Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (108)

Search Parameters:
Keywords = Flickr

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
26 pages, 5665 KB  
Article
SwinT-SRGAN: Swin Transformer Enhanced Generative Adversarial Network for Image Super-Resolution
by Qingyu Liu, Lei Chen, Yeguo Sun and Lei Liu
Electronics 2025, 14(17), 3511; https://doi.org/10.3390/electronics14173511 - 2 Sep 2025
Viewed by 450
Abstract
To resolve the conflict between global structure modeling and local detail preservation in image super-resolution, we propose SwinT-SRGAN, a novel framework integrating Swin Transformer with GAN. Key innovations include: (1) A dual-path generator where Transformer captures long-range dependencies via window attention while CNN [...] Read more.
To resolve the conflict between global structure modeling and local detail preservation in image super-resolution, we propose SwinT-SRGAN, a novel framework integrating Swin Transformer with GAN. Key innovations include: (1) A dual-path generator where Transformer captures long-range dependencies via window attention while CNN extracts high-frequency textures; (2) An end-to-end Detail Recovery Block (DRB) suppressing artifacts through dual-path attention; (3) A triple-branch discriminator enabling hierarchical adversarial supervision; (4) A dynamic loss scheduler adaptively balancing six loss components (pixel/perceptual/high-frequency constraints). Experiments on CelebA-HQ and Flickr2K demonstrate: (1) Very good performance (max gains: 0.71 dB PSNR, 0.83% SSIM, 4.67 LPIPS reduction vs. Swin-IR); (2) Ablation studies validate critical roles of DRB. This work offers a robust solution for high-frequency-sensitive applications. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

20 pages, 1818 KB  
Article
Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation
by Liang Wang, Meiqing Jiao, Zhihai Li, Mengxue Zhang, Haiyan Wei, Yuru Ma, Honghui An, Jiaqi Lin and Jun Wang
Electronics 2025, 14(16), 3325; https://doi.org/10.3390/electronics14163325 - 21 Aug 2025
Viewed by 833
Abstract
To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and [...] Read more.
To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality. Full article
Show Figures

Figure 1

32 pages, 3272 KB  
Article
Bridging Modalities: An Analysis of Cross-Modal Wasserstein Adversarial Translation Networks and Their Theoretical Foundations
by Joseph Tafataona Mtetwa, Kingsley A. Ogudo and Sameerchand Pudaruth
Mathematics 2025, 13(16), 2545; https://doi.org/10.3390/math13162545 - 8 Aug 2025
Viewed by 818
Abstract
What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal [...] Read more.
What if machines could seamlessly translate between the visual richness of images and the semantic depth of language with mathematical precision? This paper presents a theoretical and empirical analysis of five novel cross-modal Wasserstein adversarial translation networks that challenge conventional approaches to cross-modal understanding. Unlike traditional generative models that rely on stochastic noise, our frameworks learn deterministic translation mappings that preserve semantic fidelity across modalities through rigorous mathematical foundations. We systematically examine: (1) cross-modality consistent dual-critical networks; (2) Wasserstein cycle consistency; (3) multi-scale Wasserstein distance; (4) regularization through modality invariance; and (5) Wasserstein information bottleneck. Each approach employs adversarial training with Wasserstein distances to establish theoretically grounded translation functions between heterogeneous data representations. Through mathematical analysis—including information-theoretic frameworks, differential geometry, and convergence guarantees—we establish the theoretical foundations underlying cross-modal translation. Our empirical evaluation across MS-COCO, Flickr30K, and Conceptual Captions datasets, including comparisons with transformer-based baselines, reveals that our proposed multi-scale Wasserstein cycle consistent (MS-WCC) framework achieves remarkable performance gains—12.1% average improvement in FID scores and 8.0% enhancement in cross-modal translation accuracy—compared to state-of-the-art methods, while maintaining superior computational efficiency. These results demonstrate that principled mathematical approaches to cross-modal translation can significantly advance machine understanding of multimodal data, opening new possibilities for applications requiring seamless communication between visual and textual domains. Full article
Show Figures

Figure 1

13 pages, 736 KB  
Article
Birding via Facebook—Methodological Considerations When Crowdsourcing Observations of Bird Behavior via Social Media
by Dirk H. R. Spennemann
Birds 2025, 6(3), 39; https://doi.org/10.3390/birds6030039 - 28 Jul 2025
Viewed by 604
Abstract
This paper outlines a methodology to compile geo-referenced observational data of Australian birds acting as pollinators of Strelitzia sp. (Bird of Paradise) flowers and dispersers of their seeds. Given the absence of systematic published records, a crowdsourcing approach was employed, combining data from [...] Read more.
This paper outlines a methodology to compile geo-referenced observational data of Australian birds acting as pollinators of Strelitzia sp. (Bird of Paradise) flowers and dispersers of their seeds. Given the absence of systematic published records, a crowdsourcing approach was employed, combining data from natural history platforms (e.g., iNaturalist, eBird), image hosting websites (e.g., Flickr) and, in particular, social media. Facebook emerged as the most productive channel, with 61.4% of the 301 usable observations sourced from 43 ornithology-related groups. The strategy included direct solicitation of images and metadata via group posts and follow-up communication. The holistic, snowballing search strategy yielded a unique, behavior-focused dataset suitable for analysis. While the process exposed limitations due to user self-censorship on image quality and completeness, the approach demonstrates the viability of crowdsourced behavioral ecology data and contributes a replicable methodology for similar studies in under-documented ecological contexts. Full article
Show Figures

Figure 1

20 pages, 4538 KB  
Article
Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance
by Liang Wang, Mengxue Zhang, Meiqing Jiao, Enru Chen, Yuru Ma and Jun Wang
Electronics 2025, 14(14), 2809; https://doi.org/10.3390/electronics14142809 - 12 Jul 2025
Cited by 1 | Viewed by 1126
Abstract
To address the issues of modeling the relationships between multiple local region objects in images and enhancing local region features, as well as mapping global image semantics to global text semantics and local region image semantics to local text semantics, a novel image [...] Read more.
To address the issues of modeling the relationships between multiple local region objects in images and enhancing local region features, as well as mapping global image semantics to global text semantics and local region image semantics to local text semantics, a novel image captioning method based on CLIP and integrating local feature enhancement and multi-scale semantic guidance is proposed. The model employs ViT as the global visual encoder, Faster R-CNN as the local region visual encoder, BERT as the text encoder, and GPT-2 as the text decoder. By constructing a KNN graph of local image features, the model models the relationships between local region objects and then enhances the local region features using a graph attention network. Additionally, a multi-scale semantic guidance method is utilized to calculate the global and local semantic weights, thereby improving the accuracy of scene description and attribute detail description generated by the GPT-2 decoder. Evaluated on MSCOCO and Flickr30k datasets, the model achieves a significant improvement in the core metric CIDEr over established strong baselines, with 4.7% higher CIDEr than OFA on MSCOCO, and 16.6% higher CIDEr than Unified VLP on Flickr30k. Ablation studies and qualitative analysis validate the effectiveness of each proposed module. Full article
Show Figures

Figure 1

21 pages, 14585 KB  
Article
Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-Modal Retrieval Hashing
by Hongyu Lin, Shaofeng Shen, Yuchen Zhang and Renwei Xia
Mathematics 2025, 13(11), 1880; https://doi.org/10.3390/math13111880 - 4 Jun 2025
Cited by 1 | Viewed by 878
Abstract
To address modality heterogeneity and accelerate large-scale retrieval, cross-modal hashing strategies generate compact binary codes that enhance computational efficiency. Existing approaches often struggle with suboptimal feature learning due to fixed activation functions and limited cross-modal interaction. We propose Unsupervised Contrastive Graph Kolmogorov–Arnold Networks [...] Read more.
To address modality heterogeneity and accelerate large-scale retrieval, cross-modal hashing strategies generate compact binary codes that enhance computational efficiency. Existing approaches often struggle with suboptimal feature learning due to fixed activation functions and limited cross-modal interaction. We propose Unsupervised Contrastive Graph Kolmogorov–Arnold Networks (GraphKAN) Enhanced Cross-modal Retrieval Hashing (UCGKANH), integrating GraphKAN with contrastive learning and hypergraph-based enhancement. GraphKAN enables more flexible cross-modal representation through enhanced nonlinear expression of features. We introduce contrastive learning that captures modality-invariant structures through sample pairs. To preserve high-order semantic relations, we construct a hypergraph-based information propagation mechanism, refining hash codes by enforcing global consistency. The efficacy of our UCGKANH approach is validated by thorough tests on the MIR-FLICKR, NUS-WIDE, and MS COCO datasets, which show significant gains in retrieval accuracy coupled with strong computational efficiency. Full article
Show Figures

Figure 1

16 pages, 2542 KB  
Article
The Eyes: A Source of Information for Detecting Deepfakes
by Elisabeth Tchaptchet, Elie Fute Tagne, Jaime Acosta, Danda B. Rawat and Charles Kamhoua
Information 2025, 16(5), 371; https://doi.org/10.3390/info16050371 - 30 Apr 2025
Viewed by 1299
Abstract
Currently, the phenomenon of deepfakes is becoming increasingly significant, as they enable the creation of extremely realistic images capable of deceiving anyone thanks to deep learning tools based on generative adversarial networks (GANs). These images are used as profile pictures on social media [...] Read more.
Currently, the phenomenon of deepfakes is becoming increasingly significant, as they enable the creation of extremely realistic images capable of deceiving anyone thanks to deep learning tools based on generative adversarial networks (GANs). These images are used as profile pictures on social media with the intent to sow discord and perpetrate scams on a global scale. In this study, we demonstrate that these images can be identified through various imperfections present in the synthesized eyes, such as the irregular shape of the pupil and the difference between the corneal reflections of the two eyes. These defects result from the absence of physical and physiological constraints in most GAN models. We develop a two-level architecture capable of detecting these fake images. This approach begins with an automatic segmentation method for the pupils to verify their shape, as real image pupils naturally have a regular shape, typically round. Next, for all images where the pupils are not regular, the entire image is analyzed to verify the reflections. This step involves passing the facial image through an architecture that extracts and compares the specular reflections of the corneas of the two eyes, assuming that the eyes of real people observing a light source should reflect the same thing. Our experiments with a large dataset of real images from the Flickr-FacesHQ and CelebA datasets, as well as fake images from StyleGAN2 and ProGAN, show the effectiveness of our method. Our experimental results on the Flickr-Faces-HQ (FFHQ) dataset and images generated by StyleGAN2 demonstrated that our algorithm achieved a remarkable detection accuracy of 0.968 and a sensitivity of 0.911. Additionally, the method had a specificity of 0.907 and a precision of 0.90 for this same dataset. And our experimental results on the CelebA dataset and images generated by ProGAN also demonstrated that our algorithm achieved a detection accuracy of 0.870 and a sensitivity of 0.901. Moreover, the method had a specificity of 0.807 and a precision of 0.88 for this same dataset. Our approach maintains good stability of physiological properties during deep learning, making it as robust as some single-class deepfake detection methods. The results of the tests on the selected datasets demonstrate higher accuracy compared to other methods. Full article
Show Figures

Figure 1

38 pages, 2033 KB  
Article
DCAT: A Novel Transformer-Based Approach for Dynamic Context-Aware Image Captioning in the Tamil Language
by Jothi Prakash Venugopal, Arul Antran Vijay Subramanian, Manikandan Murugan, Gopikrishnan Sundaram, Marco Rivera and Patrick Wheeler
Appl. Sci. 2025, 15(9), 4909; https://doi.org/10.3390/app15094909 - 28 Apr 2025
Cited by 1 | Viewed by 784
Abstract
The task of image captioning in low-resource languages like Tamil is fraught with challenges due to limited linguistic resources and complex semantic structures. This paper addresses the problem of generating contextually and linguistically coherent captions in Tamil. We introduce the Dynamic Context-Aware Transformer [...] Read more.
The task of image captioning in low-resource languages like Tamil is fraught with challenges due to limited linguistic resources and complex semantic structures. This paper addresses the problem of generating contextually and linguistically coherent captions in Tamil. We introduce the Dynamic Context-Aware Transformer (DCAT), a novel approach that synergizes the Vision Transformer (ViT) with the Generative Pre-trained Transformer (GPT-3), reinforced by a unique Context Embedding Layer. The DCAT model, tailored for Tamil, innovatively employs dynamic attention mechanisms during its Initialization, Training, and Inference phases to focus on pertinent visual and textual elements. Our method distinctively leverages the nuances of Tamil syntax and semantics, a novelty in the realm of low-resource language image captioning. Comparative evaluations against established models on datasets like Flickr8k, Flickr30k, and MSCOCO reveal DCAT’s superiority, with a notable 12% increase in BLEU score (0.7425) and a 15% enhancement in METEOR score (0.4391) over leading models. Despite its computational demands, DCAT sets a new benchmark for image captioning in Tamil, demonstrating potential applicability to other similar languages. Full article
Show Figures

Figure 1

33 pages, 36897 KB  
Article
Making Images Speak: Human-Inspired Image Description Generation
by Chifaa Sebbane, Ikram Belhajem and Mohammed Rziza
Information 2025, 16(5), 356; https://doi.org/10.3390/info16050356 - 28 Apr 2025
Cited by 2 | Viewed by 680
Abstract
Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these [...] Read more.
Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these limitations, we propose a hybrid image captioning framework that integrates handcrafted and deep visual features. Specifically, we combine local descriptors—Scale-Invariant Feature Transform (SIFT) and Bag of Features (BoF)—with high-level semantic features extracted using ResNet50. This dual representation captures both fine-grained spatial details and contextual semantics. The decoder employs Bahdanau attention refined with an Attention-on-Attention (AoA) mechanism to optimize visual-textual alignment, while GloVe embeddings and a GRU-based sequence model ensure fluent language generation. The proposed system is trained on 200,000 image-caption pairs from the MS COCO train2014 dataset and evaluated on 50,000 held-out MS COCO pairs plus the Flickr8K benchmark. Our model achieves a CIDEr score of 128.3 and a SPICE score of 29.24, reflecting clear improvements over baselines in both semantic precision—particularly for spatial relationships—and grammatical fluency. These results validate that combining classical computer vision techniques with modern attention mechanisms yields more interpretable and linguistically precise captions, addressing key limitations in neural caption generation. Full article
Show Figures

Figure 1

14 pages, 668 KB  
Article
Fine-Grained Local and Global Semantic Fusion for Multimodal Image–Text Retrieval
by Shenao Peng, Zhongmei Wang, Jianhua Liu, Changfan Zhang and Lin Jia
Big Data Cogn. Comput. 2025, 9(3), 53; https://doi.org/10.3390/bdcc9030053 - 25 Feb 2025
Viewed by 1138
Abstract
An image–text retrieval method that integrates intramodal fine-grained local semantic information and intermodal global semantic information is proposed to address the weak fine-grained discrimination capabilities for the semantic features located between image and text modalities in cross-modal retrieval tasks. First, the original features [...] Read more.
An image–text retrieval method that integrates intramodal fine-grained local semantic information and intermodal global semantic information is proposed to address the weak fine-grained discrimination capabilities for the semantic features located between image and text modalities in cross-modal retrieval tasks. First, the original features of images and texts are extracted, and a graph attention network is employed for region relationship reasoning to obtain relation-enhanced local features. Then, an attention mechanism is used for different semantically interacting samples within the same modality, enabling comprehensive intramodal relationship learning and producing semantically enhanced image and text embeddings. Finally, a triplet loss function is used to train the entire model, and it is enhanced with an angular constraint. Through extensive comparative experiments conducted on the Flickr30K and MS-COCO benchmark datasets, the effectiveness and superiority of the proposed method were verified. It outperformed the current method by 6.4% relatively for image retrieval and 1.3% relatively for caption retrieval on MS-COCO (Recall@1 using the 1K test set). Full article
Show Figures

Figure 1

20 pages, 6296 KB  
Article
Privacy-Preserving Image Captioning with Partial Encryption and Deep Learning
by Antoinette Deborah Martin and Inkyu Moon
Mathematics 2025, 13(4), 554; https://doi.org/10.3390/math13040554 - 7 Feb 2025
Viewed by 1042
Abstract
Although image captioning has gained remarkable interest, privacy concerns are raised because it relies heavily on images, and there is a risk of exposing sensitive information in the image data. In this study, a privacy-preserving image captioning framework that leverages partial encryption using [...] Read more.
Although image captioning has gained remarkable interest, privacy concerns are raised because it relies heavily on images, and there is a risk of exposing sensitive information in the image data. In this study, a privacy-preserving image captioning framework that leverages partial encryption using Double Random Phase Encoding (DRPE) and deep learning is proposed to address privacy concerns. Unlike previous methods that rely on full encryption or masking, our approach involves encrypting sensitive regions of the image while preserving the image’s overall structure and context. Partial encryption ensures that the sensitive regions’ information is preserved instead of lost by masking it with a black or gray box. It also allows the model to process both encrypted and unencrypted regions, which could be problematic for models with fully encrypted images. Our framework follows an encoder–decoder architecture where a dual-stream encoder based on ResNet50 extracts features from the partially encrypted images, and a transformer architecture is employed in the decoder to generate captions from these features. We utilize the Flickr8k dataset and encrypt the sensitive regions using DRPE. The partially encrypted images are then fed to the dual-stream encoder, which processes the real and imaginary parts of the encrypted regions separately for effective feature extraction. Our model is evaluated using standard metrics and compared with models trained on the original images. Our results demonstrate that our method achieves comparable performance to models trained on original and masked images and outperforms models trained on fully encrypted data, thus verifying the feasibility of partial encryption in privacy-preserving image captioning. Full article
(This article belongs to the Section E2: Control Theory and Mechanics)
Show Figures

Figure 1

23 pages, 4874 KB  
Article
Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization
by Shakhnoza Muksimova, Sabina Umirzakova, Murodjon Sultanov and Young Im Cho
Sensors 2025, 25(3), 707; https://doi.org/10.3390/s25030707 - 24 Jan 2025
Cited by 7 | Viewed by 2475
Abstract
Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting [...] Read more.
Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

25 pages, 2229 KB  
Article
MIRA-CAP: Memory-Integrated Retrieval-Augmented Captioning for State-of-the-Art Image and Video Captioning
by Sabina Umirzakova, Shakhnoza Muksimova, Sevara Mardieva, Murodjon Sultanov Baxtiyarovich and Young-Im Cho
Sensors 2024, 24(24), 8013; https://doi.org/10.3390/s24248013 - 15 Dec 2024
Cited by 18 | Viewed by 2072
Abstract
Generating accurate and contextually rich captions for images and videos is essential for various applications, from assistive technology to content recommendation. However, challenges such as maintaining temporal coherence in videos, reducing noise in large-scale datasets, and enabling real-time captioning remain significant. We introduce [...] Read more.
Generating accurate and contextually rich captions for images and videos is essential for various applications, from assistive technology to content recommendation. However, challenges such as maintaining temporal coherence in videos, reducing noise in large-scale datasets, and enabling real-time captioning remain significant. We introduce MIRA-CAP (Memory-Integrated Retrieval-Augmented Captioning), a novel framework designed to address these issues through three core innovations: a cross-modal memory bank, adaptive dataset pruning, and a streaming decoder. The cross-modal memory bank retrieves relevant context from prior frames, enhancing temporal consistency and narrative flow. The adaptive pruning mechanism filters noisy data, which improves alignment and generalization. The streaming decoder allows for real-time captioning by generating captions incrementally, without requiring access to the full video sequence. Evaluated across standard datasets like MS COCO, YouCook2, ActivityNet, and Flickr30k, MIRA-CAP achieves state-of-the-art results, with high scores on CIDEr, SPICE, and Polos metrics, underscoring its alignment with human judgment and its effectiveness in handling complex visual and temporal structures. This work demonstrates that MIRA-CAP offers a robust, scalable solution for both static and dynamic captioning tasks, advancing the capabilities of vision–language models in real-world applications. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

27 pages, 4935 KB  
Article
Diverse Dataset for Eyeglasses Detection: Extending the Flickr-Faces-HQ (FFHQ) Dataset
by Dalius Matuzevičius
Sensors 2024, 24(23), 7697; https://doi.org/10.3390/s24237697 - 1 Dec 2024
Cited by 2 | Viewed by 3023
Abstract
Facial analysis is an important area of research in computer vision and machine learning, with applications spanning security, healthcare, and user interaction systems. The data-centric AI approach emphasizes the importance of high-quality, diverse, and well-annotated datasets in driving advancements in this field. However, [...] Read more.
Facial analysis is an important area of research in computer vision and machine learning, with applications spanning security, healthcare, and user interaction systems. The data-centric AI approach emphasizes the importance of high-quality, diverse, and well-annotated datasets in driving advancements in this field. However, current facial datasets, such as Flickr-Faces-HQ (FFHQ), lack detailed annotations for detecting facial accessories, particularly eyeglasses. This work addresses this limitation by extending the FFHQ dataset with precise bounding box annotations for eyeglasses detection, enhancing its utility for data-centric AI applications. The extended dataset comprises 70,000 images, including over 16,000 images containing eyewear, and it exceeds the CelebAMask-HQ dataset in size and diversity. A semi-automated protocol was employed to efficiently generate accurate bounding box annotations, minimizing the demand for extensive manual labeling. This enriched dataset serves as a valuable resource for training and benchmarking eyewear detection models. Additionally, the baseline benchmark results for eyeglasses detection were presented using deep learning methods, including YOLOv8 and MobileNetV3. The evaluation, conducted through cross-dataset validation, demonstrated the robustness of models trained on the extended FFHQ dataset with their superior performances over existing alternative CelebAMask-HQ. The extended dataset, which has been made publicly available, is expected to support future research and development in eyewear detection, contributing to advancements in facial analysis and related fields. Full article
Show Figures

Figure 1

21 pages, 35716 KB  
Article
Exploring Visitor Patterns in Island Natural Parks: The Relationship Between Photo Locations, Trails, and Land Use
by Eva Calicis, Jorge Costa, Augusto Pérez-Alberti and Alberto Gomes
Land 2024, 13(12), 2003; https://doi.org/10.3390/land13122003 - 25 Nov 2024
Viewed by 1593
Abstract
Overcrowding in national parks and protected areas can cause irreversible damage to the environment, compromising the quality of soil, water, wildlife, and vegetation. Thus, it is critical for park managers to have detailed information on visitor activities and spatial dynamics in order to [...] Read more.
Overcrowding in national parks and protected areas can cause irreversible damage to the environment, compromising the quality of soil, water, wildlife, and vegetation. Thus, it is critical for park managers to have detailed information on visitor activities and spatial dynamics in order to prioritise actions capable of mitigating undesirable impacts in the most frequently visited areas. In this article, we use georeferenced trails and photographs from the Wikiloc and Flickr web platforms to determine the spatial visitation patterns in the Atlantic Islands of Galicia National Park (AINP) from 2008 to 2023. Maps showing trail usage intensity and the distribution of photographs according to land use allowed us to identify the most frequented land uses by visitors and the areas of highest tourist pressure within the AINP. The results show that distribution patterns vary between platforms. Shrubland (37%) and marine cliffs (27%) were the most photographed land uses by visitors, while artificial areas (14%) were the most frequented by Wikiloc users. Cíes island emerges as the most popular tourist destination, as evidenced by the greater number of trails and photographs compared to Ons, Sálvora, and Cortegada. This study shows how social media data, specifically trails and geotagged photographs from Wikiloc and Flickr, can support and complement the monitoring of visitor use and impact in protected areas. Full article
Show Figures

Figure 1

Back to TopTop