MDPI - Publisher of Open Access Journals

26 pages, 5665 KB

Open AccessArticle

SwinT-SRGAN: Swin Transformer Enhanced Generative Adversarial Network for Image Super-Resolution

by Qingyu Liu, Lei Chen, Yeguo Sun and Lei Liu

Electronics 2025, 14(17), 3511; https://doi.org/10.3390/electronics14173511 - 2 Sep 2025

Viewed by 450

To resolve the conflict between global structure modeling and local detail preservation in image super-resolution, we propose SwinT-SRGAN, a novel framework integrating Swin Transformer with GAN. Key innovations include: (1) A dual-path generator where Transformer captures long-range dependencies via window attention while CNN [...] Read more.

To resolve the conflict between global structure modeling and local detail preservation in image super-resolution, we propose SwinT-SRGAN, a novel framework integrating Swin Transformer with GAN. Key innovations include: (1) A dual-path generator where Transformer captures long-range dependencies via window attention while CNN extracts high-frequency textures; (2) An end-to-end Detail Recovery Block (DRB) suppressing artifacts through dual-path attention; (3) A triple-branch discriminator enabling hierarchical adversarial supervision; (4) A dynamic loss scheduler adaptively balancing six loss components (pixel/perceptual/high-frequency constraints). Experiments on CelebA-HQ and Flickr2K demonstrate: (1) Very good performance (max gains: 0.71 dB PSNR, 0.83% SSIM, 4.67 LPIPS reduction vs. Swin-IR); (2) Ablation studies validate critical roles of DRB. This work offers a robust solution for high-frequency-sensitive applications. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

20 pages, 1818 KB

Open AccessArticle

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

by Liang Wang, Meiqing Jiao, Zhihai Li, Mengxue Zhang, Haiyan Wei, Yuru Ma, Honghui An, Jiaqi Lin and Jun Wang

Electronics 2025, 14(16), 3325; https://doi.org/10.3390/electronics14163325 - 21 Aug 2025

Viewed by 833

Abstract

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and [...] Read more.

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality. Full article

► Show Figures

Figure 1

32 pages, 3272 KB

Open AccessArticle

Bridging Modalities: An Analysis of Cross-Modal Wasserstein Adversarial Translation Networks and Their Theoretical Foundations

by Joseph Tafataona Mtetwa, Kingsley A. Ogudo and Sameerchand Pudaruth

Mathematics 2025, 13(16), 2545; https://doi.org/10.3390/math13162545 - 8 Aug 2025

Viewed by 818

Abstract

What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal [...] Read more.

What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal understanding. Unlike traditional generative models that rely on stochastic noise, our frameworks learn deterministic translation mappings that preserve semantic fidelity across modalities through rigorous mathematical foundations. We systematically examine: (1) cross-modality consistent dual-critical networks; (2) Wasserstein cycle consistency; (3) multi-scale Wasserstein distance; (4) regularization through modality invariance; and (5) Wasserstein information bottleneck. Each approach employs adversarial training with Wasserstein distances to establish theoretically grounded translation functions between heterogeneous data representations. Through mathematical analysis—including information-theoretic frameworks, differential geometry, and convergence guarantees—we establish the theoretical foundations underlying cross-modal translation. Our empirical evaluation across MS-COCO, Flickr30K, and Conceptual Captions datasets, including comparisons with transformer-based baselines, reveals that our proposed multi-scale Wasserstein cycle consistent (MS-WCC) framework achieves remarkable performance gains—12.1% average improvement in FID scores and 8.0% enhancement in cross-modal translation accuracy—compared to state-of-the-art methods, while maintaining superior computational efficiency. These results demonstrate that principled mathematical approaches to cross-modal translation can significantly advance machine understanding of multimodal data, opening new possibilities for applications requiring seamless communication between visual and textual domains. Full article

► Show Figures

Figure 1

13 pages, 736 KB

Open AccessArticle

Birding via Facebook—Methodological Considerations When Crowdsourcing Observations of Bird Behavior via Social Media

by Dirk H. R. Spennemann

Birds 2025, 6(3), 39; https://doi.org/10.3390/birds6030039 - 28 Jul 2025

Viewed by 604

Abstract

This paper outlines a methodology to compile geo-referenced observational data of Australian birds acting as pollinators of Strelitzia sp. (Bird of Paradise) flowers and dispersers of their seeds. Given the absence of systematic published records, a crowdsourcing approach was employed, combining data from [...] Read more.

This paper outlines a methodology to compile geo-referenced observational data of Australian birds acting as pollinators of Strelitzia sp. (Bird of Paradise) flowers and dispersers of their seeds. Given the absence of systematic published records, a crowdsourcing approach was employed, combining data from natural history platforms (e.g., iNaturalist, eBird), image hosting websites (e.g., Flickr) and, in particular, social media. Facebook emerged as the most productive channel, with 61.4% of the 301 usable observations sourced from 43 ornithology-related groups. The strategy included direct solicitation of images and metadata via group posts and follow-up communication. The holistic, snowballing search strategy yielded a unique, behavior-focused dataset suitable for analysis. While the process exposed limitations due to user self-censorship on image quality and completeness, the approach demonstrates the viability of crowdsourced behavioral ecology data and contributes a replicable methodology for similar studies in under-documented ecological contexts. Full article

► Show Figures

Figure 1

20 pages, 4538 KB

Open AccessArticle

Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance

by Liang Wang, Mengxue Zhang, Meiqing Jiao, Enru Chen, Yuru Ma and Jun Wang

Electronics 2025, 14(14), 2809; https://doi.org/10.3390/electronics14142809 - 12 Jul 2025

Cited by 1 | Viewed by 1126

Abstract

To address the issues of modeling the relationships between multiple local region objects in images and enhancing local region features, as well as mapping global image semantics to global text semantics and local region image semantics to local text semantics, a novel image [...] Read more.

To address the issues of modeling the relationships between multiple local region objects in images and enhancing local region features, as well as mapping global image semantics to global text semantics and local region image semantics to local text semantics, a novel image captioning method based on CLIP and integrating local feature enhancement and multi-scale semantic guidance is proposed. The model employs ViT as the global visual encoder, Faster R-CNN as the local region visual encoder, BERT as the text encoder, and GPT-2 as the text decoder. By constructing a KNN graph of local image features, the model models the relationships between local region objects and then enhances the local region features using a graph attention network. Additionally, a multi-scale semantic guidance method is utilized to calculate the global and local semantic weights, thereby improving the accuracy of scene description and attribute detail description generated by the GPT-2 decoder. Evaluated on MSCOCO and Flickr30k datasets, the model achieves a significant improvement in the core metric CIDEr over established strong baselines, with 4.7% higher CIDEr than OFA on MSCOCO, and 16.6% higher CIDEr than Unified VLP on Flickr30k. Ablation studies and qualitative analysis validate the effectiveness of each proposed module. Full article

► Show Figures

Figure 1

21 pages, 14585 KB

Open AccessArticle

Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-Modal Retrieval Hashing

by Hongyu Lin, Shaofeng Shen, Yuchen Zhang and Renwei Xia

Mathematics 2025, 13(11), 1880; https://doi.org/10.3390/math13111880 - 4 Jun 2025

Cited by 1 | Viewed by 878

Abstract

To address modality heterogeneity and accelerate large-scale retrieval, cross-modal hashing strategies generate compact binary codes that enhance computational efficiency. Existing approaches often struggle with suboptimal feature learning due to fixed activation functions and limited cross-modal interaction. We propose Unsupervised Contrastive Graph Kolmogorov–Arnold Networks [...] Read more.

To address modality heterogeneity and accelerate large-scale retrieval, cross-modal hashing strategies generate compact binary codes that enhance computational efficiency. Existing approaches often struggle with suboptimal feature learning due to fixed activation functions and limited cross-modal interaction. We propose Unsupervised Contrastive Graph Kolmogorov–Arnold Networks (GraphKAN) Enhanced Cross-modal Retrieval Hashing (UCGKANH), integrating GraphKAN with contrastive learning and hypergraph-based enhancement. GraphKAN enables more flexible cross-modal representation through enhanced nonlinear expression of features. We introduce contrastive learning that captures modality-invariant structures through sample pairs. To preserve high-order semantic relations, we construct a hypergraph-based information propagation mechanism, refining hash codes by enforcing global consistency. The efficacy of our UCGKANH approach is validated by thorough tests on the MIR-FLICKR, NUS-WIDE, and MS COCO datasets, which show significant gains in retrieval accuracy coupled with strong computational efficiency. Full article

► Show Figures

Figure 1

16 pages, 2542 KB

Open AccessArticle

The Eyes: A Source of Information for Detecting Deepfakes

by Elisabeth Tchaptchet, Elie Fute Tagne, Jaime Acosta, Danda B. Rawat and Charles Kamhoua

Information 2025, 16(5), 371; https://doi.org/10.3390/info16050371 - 30 Apr 2025

Viewed by 1299

Abstract

Currently, the phenomenon of deepfakes is becoming increasingly significant, as they enable the creation of extremely realistic images capable of deceiving anyone thanks to deep learning tools based on generative adversarial networks (GANs). These images are used as profile pictures on social media [...] Read more.

Currently, the phenomenon of deepfakes is becoming increasingly significant, as they enable the creation of extremely realistic images capable of deceiving anyone thanks to deep learning tools based on generative adversarial networks (GANs). These images are used as profile pictures on social media with the intent to sow discord and perpetrate scams on a global scale. In this study, we demonstrate that these images can be identified through various imperfections present in the synthesized eyes, such as the irregular shape of the pupil and the difference between the corneal reflections of the two eyes. These defects result from the absence of physical and physiological constraints in most GAN models. We develop a two-level architecture capable of detecting these fake images. This approach begins with an automatic segmentation method for the pupils to verify their shape, as real image pupils naturally have a regular shape, typically round. Next, for all images where the pupils are not regular, the entire image is analyzed to verify the reflections. This step involves passing the facial image through an architecture that extracts and compares the specular reflections of the corneas of the two eyes, assuming that the eyes of real people observing a light source should reflect the same thing. Our experiments with a large dataset of real images from the Flickr-FacesHQ and CelebA datasets, as well as fake images from StyleGAN2 and ProGAN, show the effectiveness of our method. Our experimental results on the Flickr-Faces-HQ (FFHQ) dataset and images generated by StyleGAN2 demonstrated that our algorithm achieved a remarkable detection accuracy of 0.968 and a sensitivity of 0.911. Additionally, the method had a specificity of 0.907 and a precision of 0.90 for this same dataset. And our experimental results on the CelebA dataset and images generated by ProGAN also demonstrated that our algorithm achieved a detection accuracy of 0.870 and a sensitivity of 0.901. Moreover, the method had a specificity of 0.807 and a precision of 0.88 for this same dataset. Our approach maintains good stability of physiological properties during deep learning, making it as robust as some single-class deepfake detection methods. The results of the tests on the selected datasets demonstrate higher accuracy compared to other methods. Full article

► Show Figures

Figure 1

38 pages, 2033 KB

Open AccessArticle

DCAT: A Novel Transformer-Based Approach for Dynamic Context-Aware Image Captioning in the Tamil Language

by Jothi Prakash Venugopal, Arul Antran Vijay Subramanian, Manikandan Murugan, Gopikrishnan Sundaram, Marco Rivera and Patrick Wheeler

Appl. Sci. 2025, 15(9), 4909; https://doi.org/10.3390/app15094909 - 28 Apr 2025

Cited by 1 | Viewed by 784

Abstract

The task of image captioning in low-resource languages like Tamil is fraught with challenges due to limited linguistic resources and complex semantic structures. This paper addresses the problem of generating contextually and linguistically coherent captions in Tamil. We introduce the Dynamic Context-Aware Transformer [...] Read more.

The task of image captioning in low-resource languages like Tamil is fraught with challenges due to limited linguistic resources and complex semantic structures. This paper addresses the problem of generating contextually and linguistically coherent captions in Tamil. We introduce the Dynamic Context-Aware Transformer (DCAT), a novel approach that synergizes the Vision Transformer (ViT) with the Generative Pre-trained Transformer (GPT-3), reinforced by a unique Context Embedding Layer. The DCAT model, tailored for Tamil, innovatively employs dynamic attention mechanisms during its Initialization, Training, and Inference phases to focus on pertinent visual and textual elements. Our method distinctively leverages the nuances of Tamil syntax and semantics, a novelty in the realm of low-resource language image captioning. Comparative evaluations against established models on datasets like Flickr8k, Flickr30k, and MSCOCO reveal DCAT’s superiority, with a notable 12% increase in BLEU score (0.7425) and a 15% enhancement in METEOR score (0.4391) over leading models. Despite its computational demands, DCAT sets a new benchmark for image captioning in Tamil, demonstrating potential applicability to other similar languages. Full article

(This article belongs to the Special Issue Natural Language Processing and Semantic Technologies: From Theories to Applications)

► Show Figures

Figure 1

33 pages, 36897 KB

Open AccessArticle

Making Images Speak: Human-Inspired Image Description Generation

by Chifaa Sebbane, Ikram Belhajem and Mohammed Rziza

Information 2025, 16(5), 356; https://doi.org/10.3390/info16050356 - 28 Apr 2025

Cited by 2 | Viewed by 680

Abstract

Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these [...] Read more.

Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these limitations, we propose a hybrid image captioning framework that integrates handcrafted and deep visual features. Specifically, we combine local descriptors—Scale-Invariant Feature Transform (SIFT) and Bag of Features (BoF)—with high-level semantic features extracted using ResNet50. This dual representation captures both fine-grained spatial details and contextual semantics. The decoder employs Bahdanau attention refined with an Attention-on-Attention (AoA) mechanism to optimize visual-textual alignment, while GloVe embeddings and a GRU-based sequence model ensure fluent language generation. The proposed system is trained on 200,000 image-caption pairs from the MS COCO train2014 dataset and evaluated on 50,000 held-out MS COCO pairs plus the Flickr8K benchmark. Our model achieves a CIDEr score of 128.3 and a SPICE score of 29.24, reflecting clear improvements over baselines in both semantic precision—particularly for spatial relationships—and grammatical fluency. These results validate that combining classical computer vision techniques with modern attention mechanisms yields more interpretable and linguistically precise captions, addressing key limitations in neural caption generation. Full article

(This article belongs to the Topic Visual Computing and Understanding: New Developments and Trends)

► Show Figures

Figure 1

14 pages, 668 KB

Open AccessEditor’s ChoiceArticle

Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval

by Shenao Peng, Zhongmei Wang, Jianhua Liu, Changfan Zhang and Lin Jia

Big Data Cogn. Comput. 2025, 9(3), 53; https://doi.org/10.3390/bdcc9030053 - 25 Feb 2025

Viewed by 1138

Abstract

An image–text retrieval method that integrates intramodal fine-grained local semantic information and intermodal global semantic information is proposed to address the weak fine-grained discrimination capabilities for the semantic features located between image and text modalities in cross-modal retrieval tasks. First, the original features [...] Read more.

An image–text retrieval method that integrates intramodal fine-grained local semantic information and intermodal global semantic information is proposed to address the weak fine-grained discrimination capabilities for the semantic features located between image and text modalities in cross-modal retrieval tasks. First, the original features of images and texts are extracted, and a graph attention network is employed for region relationship reasoning to obtain relation-enhanced local features. Then, an attention mechanism is used for different semantically interacting samples within the same modality, enabling comprehensive intramodal relationship learning and producing semantically enhanced image and text embeddings. Finally, a triplet loss function is used to train the entire model, and it is enhanced with an angular constraint. Through extensive comparative experiments conducted on the Flickr30K and MS-COCO benchmark datasets, the effectiveness and superiority of the proposed method were verified. It outperformed the current method by 6.4% relatively for image retrieval and 1.3% relatively for caption retrieval on MS-COCO (Recall@1 using the 1K test set). Full article

► Show Figures

Figure 1

20 pages, 6296 KB

Open AccessArticle

Privacy-Preserving Image Captioning with Partial Encryption and Deep Learning

by Antoinette Deborah Martin and Inkyu Moon

Mathematics 2025, 13(4), 554; https://doi.org/10.3390/math13040554 - 7 Feb 2025

Viewed by 1042

Abstract

Although image captioning has gained remarkable interest, privacy concerns are raised because it relies heavily on images, and there is a risk of exposing sensitive information in the image data. In this study, a privacy-preserving image captioning framework that leverages partial encryption using [...] Read more.

Although image captioning has gained remarkable interest, privacy concerns are raised because it relies heavily on images, and there is a risk of exposing sensitive information in the image data. In this study, a privacy-preserving image captioning framework that leverages partial encryption using Double Random Phase Encoding (DRPE) and deep learning is proposed to address privacy concerns. Unlike previous methods that rely on full encryption or masking, our approach involves encrypting sensitive regions of the image while preserving the image’s overall structure and context. Partial encryption ensures that the sensitive regions’ information is preserved instead of lost by masking it with a black or gray box. It also allows the model to process both encrypted and unencrypted regions, which could be problematic for models with fully encrypted images. Our framework follows an encoder–decoder architecture where a dual-stream encoder based on ResNet50 extracts features from the partially encrypted images, and a transformer architecture is employed in the decoder to generate captions from these features. We utilize the Flickr8k dataset and encrypt the sensitive regions using DRPE. The partially encrypted images are then fed to the dual-stream encoder, which processes the real and imaginary parts of the encrypted regions separately for effective feature extraction. Our model is evaluated using standard metrics and compared with models trained on the original images. Our results demonstrate that our method achieves comparable performance to models trained on original and masked images and outperforms models trained on fully encrypted data, thus verifying the feasibility of partial encryption in privacy-preserving image captioning. Full article

(This article belongs to the Section E2: Control Theory and Mechanics)

► Show Figures

Figure 1

23 pages, 4874 KB

Open AccessArticle

Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization

by Shakhnoza Muksimova, Sabina Umirzakova, Murodjon Sultanov and Young Im Cho

Sensors 2025, 25(3), 707; https://doi.org/10.3390/s25030707 - 24 Jan 2025

Cited by 7 | Viewed by 2475

Abstract

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting [...] Read more.

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

25 pages, 2229 KB

Open AccessArticle

MIRA-CAP: Memory-Integrated Retrieval-Augmented Captioning for State-of-the-Art Image and Video Captioning

by Sabina Umirzakova, Shakhnoza Muksimova, Sevara Mardieva, Murodjon Sultanov Baxtiyarovich and Young-Im Cho

Sensors 2024, 24(24), 8013; https://doi.org/10.3390/s24248013 - 15 Dec 2024

Cited by 18 | Viewed by 2072

Abstract

Generating accurate and contextually rich captions for images and videos is essential for various applications, from assistive technology to content recommendation. However, challenges such as maintaining temporal coherence in videos, reducing noise in large-scale datasets, and enabling real-time captioning remain significant. We introduce [...] Read more.

Generating accurate and contextually rich captions for images and videos is essential for various applications, from assistive technology to content recommendation. However, challenges such as maintaining temporal coherence in videos, reducing noise in large-scale datasets, and enabling real-time captioning remain significant. We introduce MIRA-CAP (Memory-Integrated Retrieval-Augmented Captioning), a novel framework designed to address these issues through three core innovations: a cross-modal memory bank, adaptive dataset pruning, and a streaming decoder. The cross-modal memory bank retrieves relevant context from prior frames, enhancing temporal consistency and narrative flow. The adaptive pruning mechanism filters noisy data, which improves alignment and generalization. The streaming decoder allows for real-time captioning by generating captions incrementally, without requiring access to the full video sequence. Evaluated across standard datasets like MS COCO, YouCook2, ActivityNet, and Flickr30k, MIRA-CAP achieves state-of-the-art results, with high scores on CIDEr, SPICE, and Polos metrics, underscoring its alignment with human judgment and its effectiveness in handling complex visual and temporal structures. This work demonstrates that MIRA-CAP offers a robust, scalable solution for both static and dynamic captioning tasks, advancing the capabilities of vision–language models in real-world applications. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

27 pages, 4935 KB

Open AccessArticle

Diverse Dataset for Eyeglasses Detection: Extending the Flickr-Faces-HQ (FFHQ) Dataset

by Dalius Matuzevičius

Sensors 2024, 24(23), 7697; https://doi.org/10.3390/s24237697 - 1 Dec 2024

Cited by 2 | Viewed by 3023

Abstract

Facial analysis is an important area of research in computer vision and machine learning, with applications spanning security, healthcare, and user interaction systems. The data-centric AI approach emphasizes the importance of high-quality, diverse, and well-annotated datasets in driving advancements in this field. However, [...] Read more.

Facial analysis is an important area of research in computer vision and machine learning, with applications spanning security, healthcare, and user interaction systems. The data-centric AI approach emphasizes the importance of high-quality, diverse, and well-annotated datasets in driving advancements in this field. However, current facial datasets, such as Flickr-Faces-HQ (FFHQ), lack detailed annotations for detecting facial accessories, particularly eyeglasses. This work addresses this limitation by extending the FFHQ dataset with precise bounding box annotations for eyeglasses detection, enhancing its utility for data-centric AI applications. The extended dataset comprises 70,000 images, including over 16,000 images containing eyewear, and it exceeds the CelebAMask-HQ dataset in size and diversity. A semi-automated protocol was employed to efficiently generate accurate bounding box annotations, minimizing the demand for extensive manual labeling. This enriched dataset serves as a valuable resource for training and benchmarking eyewear detection models. Additionally, the baseline benchmark results for eyeglasses detection were presented using deep learning methods, including YOLOv8 and MobileNetV3. The evaluation, conducted through cross-dataset validation, demonstrated the robustness of models trained on the extended FFHQ dataset with their superior performances over existing alternative CelebAMask-HQ. The extended dataset, which has been made publicly available, is expected to support future research and development in eyewear detection, contributing to advancements in facial analysis and related fields. Full article

(This article belongs to the Special Issue Image Processing and Pattern Recognition Based on Deep Learning—2nd Edition)

► Show Figures

Figure 1

21 pages, 35716 KB

Open AccessFeature PaperArticle

Exploring Visitor Patterns in Island Natural Parks: The Relationship Between Photo Locations, Trails, and Land Use

by Eva Calicis, Jorge Costa, Augusto Pérez-Alberti and Alberto Gomes

Land 2024, 13(12), 2003; https://doi.org/10.3390/land13122003 - 25 Nov 2024

Viewed by 1593

Abstract

Overcrowding in national parks and protected areas can cause irreversible damage to the environment, compromising the quality of soil, water, wildlife, and vegetation. Thus, it is critical for park managers to have detailed information on visitor activities and spatial dynamics in order to [...] Read more.

Overcrowding in national parks and protected areas can cause irreversible damage to the environment, compromising the quality of soil, water, wildlife, and vegetation. Thus, it is critical for park managers to have detailed information on visitor activities and spatial dynamics in order to prioritise actions capable of mitigating undesirable impacts in the most frequently visited areas. In this article, we use georeferenced trails and photographs from the Wikiloc and Flickr web platforms to determine the spatial visitation patterns in the Atlantic Islands of Galicia National Park (AINP) from 2008 to 2023. Maps showing trail usage intensity and the distribution of photographs according to land use allowed us to identify the most frequented land uses by visitors and the areas of highest tourist pressure within the AINP. The results show that distribution patterns vary between platforms. Shrubland (37%) and marine cliffs (27%) were the most photographed land uses by visitors, while artificial areas (14%) were the most frequented by Wikiloc users. Cíes island emerges as the most popular tourist destination, as evidenced by the greater number of trails and photographs compared to Ons, Sálvora, and Cortegada. This study shows how social media data, specifically trails and geotagged photographs from Wikiloc and Flickr, can support and complement the monitoring of visitor use and impact in protected areas. Full article

► Show Figures

Figure 1

Search Results (108)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (108)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI