You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.

Search for Topics:

Title/Keyword

Journal

Submission Status

Category

Propose a Topic

Topic Menu

Topic Editors

Dr. Silvia Liberata Ullo

E-Mail Website

Remote Sensing and Telecommunication Laboratory, Engineering Department, University of Sannio, 82100 Benevento, Italy

Prof. Dr. Li Zhang

E-Mail Website

Department of Computer Science, Royal Holloway, University of London, Surrey TW20 0EX, UK

Computer Vision and Image Processing, 2nd Edition

Abstract submission deadline

closed (30 September 2024)

Manuscript submission deadline

closed (31 December 2024)

Viewed by

125343

Topic Information

Dear Colleagues,

The field of computer vision and image processing has advanced significantly in recent years, with new techniques and applications emerging constantly. Building on the success of our first edition, we are pleased to announce a second edition on this exciting topic. We invite researchers, academics, and practitioners to submit original research articles, reviews, or case studies that address the latest developments in computer vision and image processing. Topics of interest include but are not limited to:

Deep learning for image classification and recognition
Object detection and tracking
Image segmentation and analysis
3D reconstruction and modeling
Image and video compression
Image enhancement and restoration
Medical image processing and analysis
Augmented and virtual reality

Submissions should be original and should not have been published or submitted elsewhere. All papers will be peer-reviewed by at least two experts in the field, and accepted papers will be published together on the topic website. To submit your paper, please visit the journal's website and follow the submission guidelines. For any queries, please contact the guest editors of the topic.

We look forward to receiving your submissions and sharing the latest advancements in computer vision and image processing with our readers.

Prof. Silvia Liberata Ullo
Prof. Dr. Li Zhang
Topic Editors

Keywords

3D acquisition, processing, and visualization
scene understanding
multimodal sensor processing and fusion
multispectral, color, and greyscale image processing
industrial quality inspection
computer vision for robotics
computer vision for surveillance
airborne and satellite on-board image acquisition platforms
computational models of vision
imaging psychophysics

Participating Journals

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Applied Sciences applsci	2.5	5.5	2011	19.8 Days	CHF 2400
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800
Mathematics mathematics	2.2	4.6	2013	18.4 Days	CHF 2600
Remote Sensing remotesensing	4.1	8.6	2009	24.9 Days	CHF 2700

Preprints.org is a multidisciplinary platform offering a preprint service designed to facilitate the early sharing of your research. It supports and empowers your research journey from the very beginning.

MDPI Topics is collaborating with Preprints.org and has established a direct connection between MDPI journals and the platform. Authors are encouraged to take advantage of this opportunity by posting their preprints at Preprints.org prior to publication:

Share your research immediately: disseminate your ideas prior to publication and establish priority for your work.
Safeguard your intellectual contribution: Protect your ideas with a time-stamped preprint that serves as proof of your research timeline.
Boost visibility and impact: Increase the reach and influence of your research by making it accessible to a global audience.
Gain early feedback: Receive valuable input and insights from peers before submitting to a journal.
Ensure broad indexing: Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Computer Vision and Image Processing (101 articles)
Computer Vision and Image Processing, 3rd Edition

Published Papers (70 papers)

Download All Papers

Order results

Result details

Journals

Show export options Show export options

Select all

Export citation of selected articles as:

25 pages, 8272 KiB

Open AccessArticle

Dark-YOLO: A Low-Light Object Detection Algorithm Integrating Multiple Attention Mechanisms

by Ye Liu, Shixin Li, Liming Zhou, Haichen Liu and Zhiyu Li

Appl. Sci. 2025, 15(9), 5170; https://doi.org/10.3390/app15095170 - 6 May 2025

Cited by 1 | Viewed by 1661

Object detection in low-light environments is often hampered by unfavorable factors such as low brightness, low contrast, and noise, which lead to issues like missed detections and false positives. To address these challenges, this paper proposes a low-light object detection algorithm named Dark-YOLO, [...] Read more.

Object detection in low-light environments is often hampered by unfavorable factors such as low brightness, low contrast, and noise, which lead to issues like missed detections and false positives. To address these challenges, this paper proposes a low-light object detection algorithm named Dark-YOLO, which dynamically extracts features. First, an adaptive image enhancement module is introduced to restore image information and enrich feature details. Second, the spatial feature pyramid module is improved by incorporating cross-overlapping average pooling and max pooling to extract salient features while retaining global and local information. Then, a dynamic feature extraction module is designed, which combines partial convolution with a parameter-free attention mechanism, allowing the model to flexibly capture critical and effective information from the image. Finally, a dimension reciprocal attention module is introduced to ensure the model can comprehensively consider various features within the image. Experimental results show that the proposed model achieves an mAP@50 of 71.3% and an mAP@50-95 of 44.2% on the real-world low-light dataset ExDark, demonstrating that Dark-YOLO effectively detects objects under low-light conditions. Furthermore, facial recognition in dark environments is a particularly challenging task. Dark-YOLO demonstrates outstanding performance on the DarkFace dataset, achieving an mAP@50 of 49.1% and an mAP@50-95 of 21.9%, further validating its effectiveness for face detection under complex low-light conditions. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 3551 KiB

Open AccessArticle

Research on Multi-Modal Point Cloud Completion Algorithm Guided by Image Rotation Attention

by Shangtai Gu, Ke Xu, Jianwei Wan, Baolin Hou and Yanxin Ma

Remote Sens. 2025, 17(8), 1448; https://doi.org/10.3390/rs17081448 - 18 Apr 2025

Viewed by 709

This paper proposes a novel point cloud multi-scale completion algorithm guided by image rotation attention mechanisms to address the challenges of severe information loss and suboptimal fusion effects in multi-modal feature extraction and integration during point cloud shape completion. The proposed network employs [...] Read more.

This paper proposes a novel point cloud multi-scale completion algorithm guided by image rotation attention mechanisms to address the challenges of severe information loss and suboptimal fusion effects in multi-modal feature extraction and integration during point cloud shape completion. The proposed network employs an encoder–decoder structure, integrating a Rotating Channel Attention (RCA) module for enhanced image feature extraction and a multi-scale feature extraction method for point clouds to improve both local and global feature information. The network also utilizes multi-level self-attention mechanisms to achieve effective multi-modal feature fusion. The decoder employs a multi-branch completion method, guided by Chamfer distance loss, to accomplish the point cloud completion task. Extensive experiments on the ShapeNet-ViPC and ModelNet40ViPC datasets demonstrate the effectiveness of the proposed algorithm. Compared to eight related algorithms, the proposed method achieves superior performance in terms of completion accuracy and efficiency. Specifically, compared to the state-of-the-art XMFnet, the average Chamfer distance (CD) value is reduced by 11.71%. The algorithm also shows significant improvements in visual comparisons, with more distinct structural details and a more uniform density distribution in the completed point clouds. The ablation studies further validate the effectiveness of the RCA module and the multi-scale module, highlighting their complementary nature in enhancing point cloud completion accuracy. Future work will focus on improving the network’s performance and exploring its application in more complex 3D vision tasks. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

12 pages, 1902 KiB

Open AccessArticle

Edge-Supervised Attention-Aware Fusion Network for RGB-T Semantic Segmentation

by Ming Wang, Zhongjie Zhu, Yuer Wang, Renwei Tu, Jiuxing Weng and Xianchao Yu

Electronics 2025, 14(8), 1489; https://doi.org/10.3390/electronics14081489 - 8 Apr 2025

Viewed by 682

To address the limitations in the efficiency of modality feature fusion in existing RGB-T semantic segmentation methods, which restrict segmentation performance, this paper proposes an edge-supervised attention-aware algorithm to enhance segmentation capabilities. Firstly, we design a feature fusion module incorporating channel and spatial [...] Read more.

To address the limitations in the efficiency of modality feature fusion in existing RGB-T semantic segmentation methods, which restrict segmentation performance, this paper proposes an edge-supervised attention-aware algorithm to enhance segmentation capabilities. Firstly, we design a feature fusion module incorporating channel and spatial attention mechanisms to achieve effective complementation and enhancement of RGB-T features. Secondly, we introduce an edge-aware refinement module that processes low-level modality features using global and local attention mechanisms, obtaining fine-grained feature information through element-wise multiplication. Building on this, we design a parallel structure of dilated convolutions to extract multi-scale detail information. Additionally, an EdgeHead is introduced after the edge-aware refinement module, with edge supervision applied to further enhance edge detail capture. Finally, the optimized fused features are fed into a decoder to complete the RGB-T semantic segmentation task. Experimental results demonstrate that our algorithm achieves mean Intersection over Union (mIoU) scores of 58.52% and 85.38% on the MFNet and PST900 datasets, respectively, significantly improving the accuracy of RGB-T semantic segmentation. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

21 pages, 15502 KiB

Open AccessArticle

Multi-Scale Spatiotemporal Feature Enhancement and Recursive Motion Compensation for Satellite Video Geographic Registration

by Yu Geng, Jingguo Lv, Shuwei Huang and Boyu Wang

J. Imaging 2025, 11(4), 112; https://doi.org/10.3390/jimaging11040112 - 8 Apr 2025

Viewed by 520

Satellite video geographic alignment can be applied to target detection and tracking, true 3D scene construction, image geometry measurement, etc., which is a necessary preprocessing step for satellite video applications. In this paper, a multi-scale spatiotemporal feature enhancement and recursive motion compensation method [...] Read more.

Satellite video geographic alignment can be applied to target detection and tracking, true 3D scene construction, image geometry measurement, etc., which is a necessary preprocessing step for satellite video applications. In this paper, a multi-scale spatiotemporal feature enhancement and recursive motion compensation method for satellite video geographic alignment is proposed. Based on the SuperGlue matching algorithm, the method achieves automatic matching of inter-frame image points by introducing the multi-scale dilated attention (MSDA) to enhance the feature extraction and adopting a joint multi-frame optimization strategy (MFMO), designing a recursive motion compensation model (RMCM) to eliminate the cumulative effect of the orbit error and improve the accuracy of the inter-frame image point matching, and using a rational function model to establish the geometrical mapping between the video and the ground points to realize the georeferencing of satellite video. The geometric mapping between video and ground points is established by using the rational function model to realize the geographic alignment of satellite video. The experimental results show that the method achieves the inter-frame matching accuracy of 0.8 pixel level, and the georeferencing accuracy error is 3 m, which is a significant improvement compared with the traditional single-frame method, and the method in this paper can provide a certain reference for the subsequent related research. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

21 pages, 3911 KiB

Open AccessArticle

KT-Deblur: Kolmogorov–Arnold and Transformer Networks for Remote Sensing Image Deblurring

by Baoyu Zhu, Zekun Li, Qunbo Lv, Zheng Tan and Kai Zhang

Remote Sens. 2025, 17(5), 834; https://doi.org/10.3390/rs17050834 - 27 Feb 2025

Viewed by 1079

Aiming to address the fundamental limitation of fixed activation functions that constrain network expressiveness in existing deep deblurring models, in this pioneering study, we introduced Kolmogorov–Arnold Networks (KANs) into the field of full-color/RGB image deblurring, proposing the Kolmogorov–Arnold and Transformer Network (KT-Deblur) framework [...] Read more.

Aiming to address the fundamental limitation of fixed activation functions that constrain network expressiveness in existing deep deblurring models, in this pioneering study, we introduced Kolmogorov–Arnold Networks (KANs) into the field of full-color/RGB image deblurring, proposing the Kolmogorov–Arnold and Transformer Network (KT-Deblur) framework based on dynamically learnable activation functions. This framework overcomes the constraints of traditional networks’ fixed nonlinear transformations by employing adaptive activation regulation for different blur types through KANs’ differentiable basis functions. Integrated with a U-Net architecture within a generative adversarial network framework, it significantly enhances detail restoration capabilities in complex scenarios. The innovatively designed Unified Attention Feature Extraction (UAFE) module combines neighborhood self-attention with linear self-attention mechanisms, achieving synergistic optimization of noise suppression and detail enhancement through adaptive feature space weighting. Supported by the Fast Spatial Feature Module (FSFM), it effectively improves the model’s ability to handle complex blur patterns. Our experimental results demonstrate that the proposed method outperforms existing algorithms in terms of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics across multiple standard datasets, achieving an average PSNR of 41.25 dB on the RealBlur-R dataset, surpassing the latest state-of-the-art (SOTA) algorithms. This model exhibits strong robustness, providing a new paradigm for image-deblurring network design. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

20 pages, 4702 KiB

Open AccessArticle

Improving Rock Type Identification Through Advanced Deep Learning-Based Segmentation Models: A Comparative Study

by İlhan Aydın, Ayşe Didem Kılıç and Taha Kubilay Şener

Appl. Sci. 2025, 15(3), 1630; https://doi.org/10.3390/app15031630 - 6 Feb 2025

Viewed by 1501

The accurate identification of rock types is crucial for understanding geological structures and planning mining activities. Therefore, the precise labeling of rock types is a fundamental requirement for researchers and industry experts in these fields. This study aims to identify rock types by [...] Read more.

The accurate identification of rock types is crucial for understanding geological structures and planning mining activities. Therefore, the precise labeling of rock types is a fundamental requirement for researchers and industry experts in these fields. This study aims to identify rock types by segmenting thin-section rock images using advanced deep learning models. Commonly used models such as DeepLabv3+, SegFormer, ConvNext, and Mask2Former were evaluated to compare the performance of different segmentation models. Additionally, an improved version of Mask2Former was analyzed to enhance its performance. Post-segmentation enhancements with SLIC super pixel and morphological operations further improved boundary delineation. An improved version of Mask2Former achieved a validation accuracy of 91.26% and a mean Intersection over Union (mIoU) of 82.59%, representing an improvement of more than 5% over the base Mask2Former. Another significant aspect of the study is the detailed analysis of other models using the relevant dataset. The Quartz Feldspar Plagioclase (QAP) diagram was utilized for mineral identification, achieving a mineral recognition accuracy of 85.7%. These results indicate the robustness of the proposed approach, which exceeds the accuracy and mIoU of comparable methods reported in the literature. This study significantly enhances the effectiveness of image processing techniques for rock type identification. Furthermore, the detailed comparison of different models provides valuable guidance for researchers in selecting the most suitable model. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

19 pages, 31258 KiB

Open AccessArticle

Pyramid Fine and Coarse Attentions for Land Cover Classification from Compact Polarimetric SAR Imagery

by Saeid Taleghanidoozdoozan, Linlin Xu and David A. Clausi

Remote Sens. 2025, 17(3), 367; https://doi.org/10.3390/rs17030367 - 22 Jan 2025

Cited by 1 | Viewed by 813

Land cover classification from compact polarimetry (CP) imagery captured by the launched RADARSAT Constellation Mission (RCM) is important but challenging due to class signature ambiguity issues and speckle noise. This paper presents a new land cover classification method to improve the learning of [...] Read more.

Land cover classification from compact polarimetry (CP) imagery captured by the launched RADARSAT Constellation Mission (RCM) is important but challenging due to class signature ambiguity issues and speckle noise. This paper presents a new land cover classification method to improve the learning of discriminative features based on a novel pyramid fine- and coarse-grained self-attention transformer (PFC transformer). The fine-grained dependency inside a non-overlapping window and coarse-grained dependencies between non-overlapping windows are explicitly modeled and concatenated using a learnable linear function. This process is repeated in a hierarchical manner. Finally, the output of each stage of the proposed method is spatially reduced and concatenated to take advantage of both low- and high-level features. Two high-resolution (3 m) RCM CP SAR scenes are used to evaluate the performance of the proposed method and compare it to other state-of-the-art deep learning methods. The results show that the proposed approach achieves an overall accuracy of 93.63%, which was 4.83% higher than the best comparable method, demonstrating the effectiveness of the proposed approach for land cover classification from RCM CP SAR images. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

20 pages, 7090 KiB

Open AccessArticle

An Infrared and Visible Image Alignment Method Based on Gradient Distribution Properties and Scale-Invariant Features in Electric Power Scenes

by Lin Zhu, Yuxing Mao, Chunxu Chen and Lanjia Ning

J. Imaging 2025, 11(1), 23; https://doi.org/10.3390/jimaging11010023 - 13 Jan 2025

Viewed by 1114

In grid intelligent inspection systems, automatic registration of infrared and visible light images in power scenes is a crucial research technology. Since there are obvious differences in key attributes between visible and infrared images, direct alignment is often difficult to achieve the expected [...] Read more.

In grid intelligent inspection systems, automatic registration of infrared and visible light images in power scenes is a crucial research technology. Since there are obvious differences in key attributes between visible and infrared images, direct alignment is often difficult to achieve the expected results. To overcome the high difficulty of aligning infrared and visible light images, an image alignment method is proposed in this paper. First, we use the Sobel operator to extract the edge information of the image pair. Second, the feature points in the edges are recognised by a curvature scale space (CSS) corner detector. Third, the Histogram of Orientation Gradients (HOG) is extracted as the gradient distribution characteristics of the feature points, which are normalised with the Scale Invariant Feature Transform (SIFT) algorithm to form feature descriptors. Finally, initial matching and accurate matching are achieved by the improved fast approximate nearest-neighbour matching method and adaptive thresholding, respectively. Experiments show that this method can robustly match the feature points of image pairs under rotation, scale, and viewpoint differences, and achieves excellent matching results. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

17 pages, 1000 KiB

Open AccessArticle

Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino

by Huadong Sun, Yinghui Liu, Ziyang Chen and Pengyi Zhang

Electronics 2025, 14(1), 143; https://doi.org/10.3390/electronics14010143 - 1 Jan 2025

Viewed by 1365

Two challenges in computer vision (CV) related to face detection are the difficulty of acquisition in the target domain and the degradation of image quality. Especially in low-light situations, the poor visibility of images is difficult to label, which results in detectors trained [...] Read more.

Two challenges in computer vision (CV) related to face detection are the difficulty of acquisition in the target domain and the degradation of image quality. Especially in low-light situations, the poor visibility of images is difficult to label, which results in detectors trained under well-lit conditions exhibiting reduced performance in low-light environments. Conventional works image enhancement and object detection techniques are unable to resolve the inherent difficulties in collecting and labeling low-light images. The Dark-Illuminated Network with Contrastive Language–Image Pretraining (CLIP) and Self-Supervised Vision Transformer (Dino), abbreviated as DAl-CLIP-Dino is proposed to address the degradation of object detection performance in low-light environments and achieve zero-shot day–night domain adaptation. Specifically, an advanced reflectance representation learning module (which leverages Retinex decomposition to extract reflectance and illumination features from both low-light and well-lit images) and an interchange–redecomposition coherence process (which performs a second decomposition on reconstructed images after the exchange to generate a second round of reflectance and illumination predictions while validating their consistency using redecomposition consistency loss) are employed to achieve illumination invariance and enhance model performance. CLIP (VIT-based image encoder part) and Dino have been integrated for feature extraction, improving performance under extreme lighting conditions and enhancing its generalization capability. Our model achieves a mean average precision (mAP) of

29.6 %

for face detection on the DARK FACE dataset, outperforming other models in zero-shot domain adaptation for face detection. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 783 KiB

Open AccessArticle

IOF-Tracker: A Two-Stage Multiple Targets Tracking Method Using Spatial-Temporal Fusion Algorithm

by Hongbin Liu, Yongze Zhao, Peng Dong, Xiuyi Guo and Yilin Wang

Appl. Sci. 2025, 15(1), 107; https://doi.org/10.3390/app15010107 - 26 Dec 2024

Cited by 2 | Viewed by 2429

Multi-object tracking aims to track multiple objects across consecutive frames in a video, assigning a unique classifier to each object. However, issues such as occlusions, directional changes, or shape alterations can cause appearance variations, leading to detection and matching problems that in turn [...] Read more.

Multi-object tracking aims to track multiple objects across consecutive frames in a video, assigning a unique classifier to each object. However, issues such as occlusions, directional changes, or shape alterations can cause appearance variations, leading to detection and matching problems that in turn result in frequent ID switches. To solve these issues, this paper proposes a two-stage multi-object tracking framework based on a spatial and temporal fusion algorithm. First, the video frames are processed by a detector to identify objects and form rectangular detection areas. Meanwhile, an estimator predicts the target rectangular areas in the next frame. Then, we extract the optical flow of the target pixels within the detection and prediction areas, and then a temporal information model is established by calculating the average of the target pixels’ optical flow. Afterward, we present a spatial information model using the R-IoU (Reverse of Intersection over Union) between the detection and prediction areas. This spatial and temporal information is combined with weighted matrix fusion, which achieves the feature matching and association task. Finally, we implement a two-stage association multi-object tracking model using the mentioned fusion algorithm. Experiments on the MOTChallenge dataset using the official detector show that our two-stage multi-object tracking method based on the spatial and temporal fusion algorithm is robust in handling occlusions and ID switch issues. As of the submission of this paper, the proposed method has achieved the top ranking in the MOT17 benchmark when evaluated with the official detector. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

27 pages, 9095 KiB

Open AccessArticle

BMFusion: Bridging the Gap Between Dark and Bright in Infrared-Visible Imaging Fusion

by Chengwen Liu, Bin Liao and Zhuoyue Chang

Electronics 2024, 13(24), 5005; https://doi.org/10.3390/electronics13245005 - 19 Dec 2024

Viewed by 1121

The fusion of infrared and visible light images is a crucial technology for enhancing visual perception in complex environments. It plays a pivotal role in improving visual perception and subsequent performance in advanced visual tasks. However, due to the significant degradation of visible [...] Read more.

The fusion of infrared and visible light images is a crucial technology for enhancing visual perception in complex environments. It plays a pivotal role in improving visual perception and subsequent performance in advanced visual tasks. However, due to the significant degradation of visible light image quality in low-light or nighttime scenes, most existing fusion methods often struggle to obtain sufficient texture details and salient features when processing such scenes. This can lead to a decrease in fusion quality. To address this issue, this article proposes a new image fusion method called BMFusion. Its aim is to significantly improve the quality of fused images in low-light or nighttime scenes and generate high-quality fused images around the clock. This article first designs a brightness attention module composed of brightness attention units. It extracts multimodal features by combining the SimAm attention mechanism with a Transformer architecture. Effective enhancement of brightness and features has been achieved, with gradual brightness attention performed during feature extraction. Secondly, a complementary fusion module was designed. This module deeply fuses infrared and visible light features to ensure the complementarity and enhancement of each modal feature during the fusion process, minimizing information loss to the greatest extent possible. In addition, a feature reconstruction network combining CLIP-guided semantic vectors and neighborhood attention enhancement was proposed in the feature reconstruction stage. It uses the KAN module to perform channel adaptive optimization on the reconstruction process, ensuring semantic consistency and detail integrity of the fused image during the reconstruction phase. The experimental results on a large number of public datasets demonstrate that the BMFusion method can generate fusion images with higher visual quality and richer details in night and low-light environments compared with various existing state-of-the-art (SOTA) algorithms. At the same time, the fusion image can significantly improve the performance of advanced visual tasks. This shows the great potential and application prospect of this method in the field of multimodal image fusion. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

23 pages, 6881 KiB

Open AccessArticle

Research on Lightweight Open-Pit Mine Driving Obstacle Detection Algorithm Based on Improved YOLOv8s

by Bo Xu, Wubin Xu, Bing Li, Hanwen Zhang, Yuanbin Xiao and Weixin Zhou

Appl. Sci. 2024, 14(24), 11741; https://doi.org/10.3390/app142411741 - 16 Dec 2024

Viewed by 835

The road environment of open-pit mines is complex and unstructured. Unmanned construction machinery driving faces huge challenges. Improving the accuracy and speed of obstacle detection during driving is of great significance to ensuring the safety of mine automation construction and improving the efficiency [...] Read more.

The road environment of open-pit mines is complex and unstructured. Unmanned construction machinery driving faces huge challenges. Improving the accuracy and speed of obstacle detection during driving is of great significance to ensuring the safety of mine automation construction and improving the efficiency of overall unmanned operations. In view of the fact that the current obstacle detection algorithm struggles to strike a balance between high precision and real-time performance, and there are problems such as difficulty in model deployment or unsuitability for practical applications, a lightweight open-pit mine driving obstacle detection algorithm based on improved YOLOv8s is proposed, which is committed to improving the driving safety of unmanned engineering machinery in open-pit mines. In order to enhance the ability of the backbone to capture features, the idea of the guidance module (CGBlock) of contextual information is introduced to construct a new CGC2f module; the efficient squeeze excitation (ESE) attention mechanism is embedded in the feature fusion layer to make the model pay more attention to the channels containing important feature information; in order to enhance the model’s learning ability for obstacles of different sizes in the open-pit mine, a more suitable dynamic head network (DyHead) is used at the output end; in order to further improve real-time performance, the layer-based adaptive amplitude pruning (LAMP) score algorithm is used to prune redundant weight parameters. To verify the effectiveness of the algorithm in this paper, an experimental verification is carried out on the constructed open-pit mine driving obstacle dataset. The results show that compared with YOLOv8s, the mAP50 of this algorithm reaches 95.3%, the detection speed is increased by 40.2%, the model parameters are reduced by 71.2%, and the calculation amount is reduced by 73.7%. It meets the requirements of real-time and high-precision obstacle detection in open-pit mine driving and provides technical support for smart mine driving. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 15701 KiB

Open AccessArticle

Unsupervised Semantic Scene Reconstruction via Transformer-Based Quantized Vector Reconstruction and Autoregressive Completion

by Yubin Miao, Shuxin Xie, Tianrui Quan, Junkang Wan and Mengxiang Hao

Electronics 2024, 13(24), 4922; https://doi.org/10.3390/electronics13244922 - 13 Dec 2024

Viewed by 803

Semantic scene reconstruction from sparse and incomplete point clouds is a vital task in understanding point scenes. This task involves assigning semantic labels to objects and reconstructing their complete shapes as meshes. In recent years, researchers have adopted a “reconstruction from recognition” approach, [...] Read more.

Semantic scene reconstruction from sparse and incomplete point clouds is a vital task in understanding point scenes. This task involves assigning semantic labels to objects and reconstructing their complete shapes as meshes. In recent years, researchers have adopted a “reconstruction from recognition” approach, which first segments foreground objects from the point cloud and then completes and reconstructs them as mesh representations. This method has successfully facilitated both the semantic and geometric understanding of point scenes. However, existing approaches based on deep learning often depend on supervised training, requiring extensive annotations and incurring high training costs. To address this limitation, we introduce unsupervised algorithms for completing and reconstructing partial observations. While Transformer-based autoregressive shape completion shows great potential, there has been limited research on applying it to complete instances segmented from real-world scenes. To bridge this gap, we propose VRC (unsupervised semantic scene reconstruction via Transformer-based quantized Vector Reconstruction and autoregressive Completion), a novel framework that integrates unsupervised algorithms with Transformer-based autoregressive completion. Our approach enables the unsupervised reconstruction of real-world scenes. Comparisons with state-of-the-art methods on authoritative public datasets demonstrate that VRC achieves superior reconstruction performance with significantly reduced data costs. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

16 pages, 1289 KiB

Open AccessArticle

DAT: Deep Learning-Based Acceleration-Aware Trajectory Forecasting

by Ali Asghar Sharifi, Ali Zoljodi and Masoud Daneshtalab

J. Imaging 2024, 10(12), 321; https://doi.org/10.3390/jimaging10120321 - 13 Dec 2024

Cited by 1 | Viewed by 1496

As the demand for autonomous driving (AD) systems has increased, the enhancement of their safety has become critically important. A fundamental capability of AD systems is object detection and trajectory forecasting of vehicles and pedestrians around the ego-vehicle, which is essential for preventing [...] Read more.

As the demand for autonomous driving (AD) systems has increased, the enhancement of their safety has become critically important. A fundamental capability of AD systems is object detection and trajectory forecasting of vehicles and pedestrians around the ego-vehicle, which is essential for preventing potential collisions. This study introduces the Deep learning-based Acceleration-aware Trajectory forecasting (DAT) model, a deep learning-based approach for object detection and trajectory forecasting, utilizing raw sensor measurements. DAT is an end-to-end model that processes sequential sensor data to detect objects and forecasts their future trajectories at each time step. The core innovation of DAT lies in its novel forecasting module, which leverages acceleration data to enhance trajectory forecasting, leading to the consideration of a variety of agent motion models. We propose a robust and innovative method for estimating ground-truth acceleration for objects, along with an object detector that predicts acceleration attributes for each detected object and a novel method for trajectory forecasting. DAT is trained and evaluated on the NuScenes dataset, demonstrating its empirical effectiveness through extensive experiments. The results indicate that DAT significantly surpasses state-of-the-art methods, particularly in enhancing forecasting accuracy for objects exhibiting both linear and nonlinear motion patterns, achieving up to a

2 \times

improvement. This advancement highlights the critical role of incorporating acceleration data into predictive models, representing a substantial step forward in the development of safer autonomous driving systems. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

14 pages, 9209 KiB

Open AccessCommunication

Implementation of an FPGA-Based System to Process Images and Match Keypoints on High-Resolution Pictures

by Sina Bundschuh, Jan Kunze and Klaus-Dieter Kuhnert

Electronics 2024, 13(23), 4774; https://doi.org/10.3390/electronics13234774 - 3 Dec 2024

Viewed by 2052

Processing scenery and finding points of interest is crucial for applications in robotics and aerospace missions. Those areas require efficient and reliable visual input processing. Here, field programmable gate arrays (FPGAs) offer essential advantages, like low power consumption compared to CPUs, performing a [...] Read more.

Processing scenery and finding points of interest is crucial for applications in robotics and aerospace missions. Those areas require efficient and reliable visual input processing. Here, field programmable gate arrays (FPGAs) offer essential advantages, like low power consumption compared to CPUs, performing a large number of calculations simultaneously, and having compact hardware. This paper presents an FPGA system that processes incoming camera data, finds points of interest, and matches them across different images on high-resolution images (2048 × 1088). It is a novel approach to implement the complete image processing pipeline on high-resolution images within the FPGA fabric without additional hardware. For keypoint detection and matching, our work uses a modified SIFT algorithm optimized for FPGA implementation processing and a nearest neighbor-based matching method. It was implemented on a Xilinx Kintex-7 FPGA and partially on a NanoXplore NG-Ultra to evaluate a radiation-hardened FPGA for space applications. On the Kintex-7, the keypoint detection achieves a speed of 33 ms per image, and its features are matched on up to 5 images per second. Judging by the resource utilization of one image processing module on the NG-Ultra, porting the entire system on a radiation-hardened FPGA appears feasible. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 27038 KiB

Open AccessArticle

A Knowledge Base Driven Task-Oriented Image Semantic Communication Scheme

by Chang Guo, Junhua Xi, Zhanhao He, Jiaqi Liu and Jungang Yang

Remote Sens. 2024, 16(21), 4044; https://doi.org/10.3390/rs16214044 - 30 Oct 2024

Cited by 1 | Viewed by 1454

With the development of artificial intelligence and computer hardware, semantic communication has been attracting great interest. As an emerging communication paradigm, semantic communication can reduce the requirement for channel bandwidth by extracting semantic information. This is an effective method that can be applied [...] Read more.

With the development of artificial intelligence and computer hardware, semantic communication has been attracting great interest. As an emerging communication paradigm, semantic communication can reduce the requirement for channel bandwidth by extracting semantic information. This is an effective method that can be applied to image acquisition of unmanned aerial vehicles, which can transmit high-data-volume images within the constraints of limited available bandwidth. However, the existing semantic communication schemes fail to adequately incorporate the guidance of task requirements into the semantic communication process and are difficult to adapt to the dynamic changes of tasks. A task-oriented image semantic communication scheme driven by knowledge base is proposed, aiming at achieving high compression ratio and high quality image reconstruction, and effectively solving the bandwidth limitation. This scheme segments the input image into several semantic information unit under the guidance of task requirements by Yolo-World and Segment Anything Model. The assigned bandwidth for each unit is according to the task relevance scores, which enables high-quality transmission of task-related information with lower communication overheads. An improved metric weighted learned perceptual image patch similarity (LPIPS) is proposed to evaluate the transmission accuracy of the novel scheme. Experimental results show that our scheme achieves a notable performance improvement on weighted LPIPS while the same compression ratio compared with traditional image compression schemes. Our scheme has a higher target capture ratio than traditional image compression schemes under the task of target detection. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

16 pages, 3941 KiB

Open AccessArticle

DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning

by Mingliang Zhang, Xia Hou, Yujing Yan and Meng Sun

Electronics 2024, 13(21), 4207; https://doi.org/10.3390/electronics13214207 - 27 Oct 2024

Viewed by 926

Image captioning aims to describe the content in an image, which plays a critical role in image understanding. Existing methods tend to generate the text for more distinct natural images. These models can not be well for paintings containing more abstract meaning due [...] Read more.

Image captioning aims to describe the content in an image, which plays a critical role in image understanding. Existing methods tend to generate the text for more distinct natural images. These models can not be well for paintings containing more abstract meaning due to the limitation of objective parsing without related knowledge. To alleviate, we propose a novel cross-modality decouple model to generate the objective and subjective parsing separately. Concretely, we propose to encode both subjective semantic and implied knowledge contained in the paintings. The key point of our framework is decoupled CLIP-based branches (DecoupleCLIP). For the objective caption branch, we utilize the CLIP model as the global feature extractor and construct a feature fusion module for global clues. Based on the objective caption branch structure, we add a multimodal fusion module called the artistic conception branch. In this way, the objective captions can constrain artistic conception content. We conduct extensive experiments to demonstrate our DecoupleCLIP’s superior ability over our new dataset. Our model achieves nearly 2% improvement over other comparison models on CIDEr. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 16843 KiB

Open AccessTechnical Note

STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association

by Yu Qiao, Huijie Fan, Qiang Wang, Tinghui Zhao and Yandong Tang

Remote Sens. 2024, 16(20), 3861; https://doi.org/10.3390/rs16203861 - 17 Oct 2024

Viewed by 1250

In this paper, we introduce a high-altitude multi-drone multi-target (HAMDMT) tracking method called STCA, which aims to collaboratively track similar targets that are easily confused. We approach this challenge by categorizing the HAMDMT tracking into two principal tasks: Single-Drone Tracking and Cross-Drone Association. [...] Read more.

In this paper, we introduce a high-altitude multi-drone multi-target (HAMDMT) tracking method called STCA, which aims to collaboratively track similar targets that are easily confused. We approach this challenge by categorizing the HAMDMT tracking into two principal tasks: Single-Drone Tracking and Cross-Drone Association. Single-Drone Tracking employs positional and appearance data vectors to overcome the challenges arising from similar target appearances within the field of view of a single drone. The Cross-Drone Association employs image-matching technology (LightGlue) to ascertain the topological relationships between images captured by disparate drones, thereby accurately determining the associations between targets across multiple drones. In Cross-Drone Association, we enhanced LightGlue into a more efficacious method, designated T-LightGlue, for cross-drone target tracking. This approach markedly accelerates the tracking process while reducing indicator dropout. To narrow down the range of targets involved in the cross-drone association, we develop a Common View Area Model based on the four vertices of the image. Considering to mitigate the occlusion encountered by high-altitude drones, we design a Local-Matching Model that assigns the same ID to the mutually nearest pair of targets from different drones after mapping the centroids of the targets across drones. The MDMT dataset is the only one captured by a high-altitude drone and contains a substantial number of similar vehicles. In the MDMT dataset, the STCA achieves the highest MOTA in Single-Drone Tracking, with the IDF1 system achieving the second-highest performance and the MDA system achieving the highest performance in Cross-Drone Association. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

22 pages, 29294 KiB

Open AccessArticle

Ghost Removal from Forward-Scan Sonar Views near the Sea Surface for Image Enhancement and 3-D Object Modeling

by Yuhan Liu and Shahriar Negahdaripour

Remote Sens. 2024, 16(20), 3814; https://doi.org/10.3390/rs16203814 - 14 Oct 2024

Cited by 1 | Viewed by 1419

Underwater sonar is the primary remote sensing and imaging modality within turbid environments with poor visibility. The two-dimensional (2-D) images of a target near the air–sea interface (or resting on a hard seabed), acquired by forward-scan sonar (FSS), are generally corrupted by the [...] Read more.

Underwater sonar is the primary remote sensing and imaging modality within turbid environments with poor visibility. The two-dimensional (2-D) images of a target near the air–sea interface (or resting on a hard seabed), acquired by forward-scan sonar (FSS), are generally corrupted by the ghost and sometimes mirror components, formed by the multipath propagation of transmitted acoustic beams. In the processing of the 2-D FSS views to generate an accurate three-dimensional (3-D) object model, the corrupted regions have to be discarded. The sonar tilt angle and distance from the sea surface are two important parameters for the accurate localization of the ghost and mirror components. We propose a unified optimization technique for improving both the measurements of these two parameters from inexpensive sensors and the accuracy of a 3-D object model using 2-D FSS images at known poses. The solution is obtained by the recursive updating of sonar parameters and 3-D object model. Utilizing the 3-D object model, we can enhance the original images and generate synthetic views for arbitrary sonar poses. We demonstrate the performance of our method in experiments with the synthetic and real images of three targets: two dominantly convex coral rocks and a highly concave toy wood table. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

18 pages, 3999 KiB

Open AccessArticle

SS-YOLOv8: A Lightweight Algorithm for Surface Litter Detection

by Zhipeng Fan, Zheng Qin, Wei Liu, Ming Chen and Zeguo Qiu

Appl. Sci. 2024, 14(20), 9283; https://doi.org/10.3390/app14209283 - 12 Oct 2024

Viewed by 2077

With the advancement of science and technology, pollution in rivers and water surfaces has increased, impacting both ecology and public health. Timely identification of surface waste is crucial for effective cleanup. Traditional edge detection devices struggle with limited memory and resources, making the [...] Read more.

With the advancement of science and technology, pollution in rivers and water surfaces has increased, impacting both ecology and public health. Timely identification of surface waste is crucial for effective cleanup. Traditional edge detection devices struggle with limited memory and resources, making the YOLOv8 algorithm inefficient. This paper introduces a lightweight network model for detecting water surface litter. We enhance the CSP Bottleneck with a two-convolutions (C2f) module to improve image recognition tasks. By implementing the powerful intersection over union 2 (PIoU2), we enhance model accuracy over the original CIoU. Our novel Shared Convolutional Detection Head (SCDH) minimizes parameters, while the scale layer optimizes feature scaling. Using a slimming pruning method, we further reduce the model’s size and computational needs. Our model achieves a mean average precision (mAP) of 79.9% on the surface litter dataset, with a compact size of 2.3 MB and a processing rate of 128 frames per second, meeting real-time detection requirements. This work significantly contributes to efficient environmental monitoring and offers a scalable solution for deploying advanced detection models on resource-constrained devices. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 9898 KiB

Open AccessArticle

Land Cover Mapping in East China for Enhancing High-Resolution Weather Simulation Models

by Bingxin Ma, Yang Shao, Hequn Yang, Yiwen Lu, Yanqing Gao, Xinyao Wang, Ying Xie and Xiaofeng Wang

Remote Sens. 2024, 16(20), 3759; https://doi.org/10.3390/rs16203759 - 10 Oct 2024

Viewed by 1532

This study was designed to develop a 30 m resolution land cover dataset to improve the performance of regional weather forecasting models in East China. A 10-class land cover mapping scheme was established, reflecting East China’s diverse landscape characteristics and incorporating a new [...] Read more.

This study was designed to develop a 30 m resolution land cover dataset to improve the performance of regional weather forecasting models in East China. A 10-class land cover mapping scheme was established, reflecting East China’s diverse landscape characteristics and incorporating a new category for plastic greenhouses. Plastic greenhouses are key to understanding surface heterogeneity in agricultural regions, as they can significantly impact local climate conditions, such as heat flux and evapotranspiration, yet they are often not represented in conventional land cover classifications. This is mainly due to the lack of high-resolution datasets capable of detecting these small yet impactful features. For the six-province study area, we selected and processed Landsat 8 imagery from 2015–2018, filtering for cloud cover. Complementary datasets, such as digital elevation models (DEM) and nighttime lighting data, were integrated to enrich the inputs for the Random Forest classification. A comprehensive training dataset was compiled to support Random Forest training and classification accuracy. We developed an automated workflow to manage the data processing, including satellite image selection, preprocessing, classification, and image mosaicking, thereby ensuring the system’s practicality and facilitating future updates. We included three Weather Research and Forecasting (WRF) model experiments in this study to highlight the impact of our land cover maps on daytime and nighttime temperature predictions. The resulting regional land cover dataset achieved an overall accuracy of 83.2% and a Kappa coefficient of 0.81. These accuracy statistics are higher than existing national and global datasets. The model results suggest that the newly developed land cover, combined with a mosaic option in the Unified Noah scheme in WRF, provided the best overall performance for both daytime and nighttime temperature predictions. In addition to supporting the WRF model, our land cover map products, with a planned 3–5-year update schedule, could serve as a valuable data source for ecological assessments in the East China region, informing environmental policy and promoting sustainability. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

20 pages, 6728 KiB

Open AccessArticle

Diffusion Model for Camouflaged Object Segmentation with Frequency Domain

by Wei Cai, Weijie Gao, Yao Ding, Xinhao Jiang, Xin Wang and Xingyu Di

Electronics 2024, 13(19), 3922; https://doi.org/10.3390/electronics13193922 - 3 Oct 2024

Viewed by 2102

The task of camouflaged object segmentation (COS) is a challenging endeavor that entails the identification of objects that closely blend in with their surrounding background. Furthermore, the camouflaged object’s obscure form and its subtle differentiation from the background present significant challenges during the [...] Read more.

The task of camouflaged object segmentation (COS) is a challenging endeavor that entails the identification of objects that closely blend in with their surrounding background. Furthermore, the camouflaged object’s obscure form and its subtle differentiation from the background present significant challenges during the feature extraction phase of the network. In order to extract more comprehensive information, thereby improving the accuracy of COS, we propose a diffusion model for a COS network that utilizes frequency domain information as auxiliary input, and we name it FreDiff. Firstly, we proposed a frequency auxiliary module (FAM) to extract frequency domain features. Then, we designed a Global Fusion Module (GFM) to make FreDiff pay attention to the global features. Finally, we proposed an Upsample Enhancement Module (UEM) to enhance the detailed information of the features and perform upsampling before inputting them into the diffusion model. Additionally, taking into account the specific characteristics of COS, we develop the specialized training strategy for FreDiff. We compared FreDiff with 17 COS models on the four challenging COS datasets. Experimental results showed that FreDiff outperforms or is consistent with other state-of-the-art methods under five evaluation metrics. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 3869 KiB

Open AccessArticle

Spatio-Temporal Dynamic Attention Graph Convolutional Network Based on Skeleton Gesture Recognition

by Xiaowei Han, Ying Cui, Xingyu Chen, Yunjing Lu and Wen Hu

Electronics 2024, 13(18), 3733; https://doi.org/10.3390/electronics13183733 - 20 Sep 2024

Cited by 6 | Viewed by 2110

Dynamic gesture recognition based on skeletal data has garnered significant attention with the rise of graph convolutional networks (GCNs). Existing methods typically calculate dependencies between joints and utilize spatio-temporal attention features. However, they often rely on joint topological features of limited spatial extent [...] Read more.

Dynamic gesture recognition based on skeletal data has garnered significant attention with the rise of graph convolutional networks (GCNs). Existing methods typically calculate dependencies between joints and utilize spatio-temporal attention features. However, they often rely on joint topological features of limited spatial extent and short-time features, making it challenging to extract intra-frame spatial features and long-term inter-frame temporal features. To address this, we propose a new GCN architecture for dynamic hand gesture recognition, called a spatio-temporal dynamic attention graph convolutional network (STDA-GCN). This model employs dynamic attention spatial graph convolution, enhancing spatial feature extraction capabilities while reducing computational complexity through improved cross-channel information interaction. Additionally, a salient location channel attention mechanism is integrated between spatio-temporal convolutions to extract useful spatial features and avoid redundancy. Finally, dynamic multi-scale temporal convolution is used to extract richer inter-frame gesture features, effectively capturing information across various time scales. Evaluations on the SHREC’17 Track and DHG-14/28 benchmark datasets show that our model achieves 97.14% and 95.84% accuracy, respectively. These results demonstrate the superior performance of STDA-GCN in dynamic gesture recognition tasks. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

22 pages, 11714 KiB

Open AccessArticle

A Light-Weight Self-Supervised Infrared Image Perception Enhancement Method

by Yifan Xiao, Zhilong Zhang and Zhouli Li

Electronics 2024, 13(18), 3695; https://doi.org/10.3390/electronics13183695 - 18 Sep 2024

Cited by 1 | Viewed by 1364

Convolutional Neural Networks (

C N N s

) have achieved remarkable results in the field of infrared image enhancement. However, the research on the visual perception mechanism and the objective evaluation indicators for enhanced infrared images is still not in-depth enough. To [...] Read more.

Convolutional Neural Networks (

C N N s

) have achieved remarkable results in the field of infrared image enhancement. However, the research on the visual perception mechanism and the objective evaluation indicators for enhanced infrared images is still not in-depth enough. To make the subjective and objective evaluation more consistent, this paper uses a perceptual metric to evaluate the enhancement effect of infrared images. The perceptual metric mimics the early conversion process of the human visual system and uses the normalized Laplacian pyramid distance (

N L P D

) between the enhanced image and the original scene radiance to evaluate the image enhancement effect. Based on this, this paper designs an infrared image-enhancement algorithm that is more conducive to human visual perception. The algorithm uses a lightweight Fully Convolutional Network (

F C N

), with

N L P D

as the similarity measure, and trains the network in a self-supervised manner by minimizing the

N L P D

between the enhanced image and the original scene radiance to achieve infrared image enhancement. The experimental results show that the infrared image enhancement method in this paper outperforms existing methods in terms of visual perception quality, and due to the use of a lightweight network, it is also the fastest enhancement method currently. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

21 pages, 9454 KiB

Open AccessArticle

Denoising Diffusion Implicit Model for Camouflaged Object Detection

by Wei Cai, Weijie Gao, Xinhao Jiang, Xin Wang and Xingyu Di

Electronics 2024, 13(18), 3690; https://doi.org/10.3390/electronics13183690 - 17 Sep 2024

Viewed by 1609

Camouflaged object detection (COD) is a challenging task that involves identifying objects that closely resemble their background. In order to detect camouflaged objects more accurately, we propose a diffusion model for the COD network called DMNet. DMNet formulates COD as a denoising diffusion [...] Read more.

Camouflaged object detection (COD) is a challenging task that involves identifying objects that closely resemble their background. In order to detect camouflaged objects more accurately, we propose a diffusion model for the COD network called DMNet. DMNet formulates COD as a denoising diffusion process from noisy boxes to prediction boxes. During the training stage, random boxes diffuse from ground-truth boxes, and DMNet learns to reverse this process. In the sampling stage, DMNet progressively refines random boxes to prediction boxes. In addition, due to the camouflaged object’s blurred appearance and the low contrast between it and the background, the feature extraction stage of the network is challenging. Firstly, we proposed a parallel fusion module (PFM) to enhance the information extracted from the backbone. Then, we designed a progressive feature pyramid network (PFPN) for feature fusion, in which the upsample adaptive spatial fusion module (UAF) balances the different feature information by assigning weights to different layers. Finally, a location refinement module (LRM) is constructed to make DMNet pay attention to the boundary details. We compared DMNet with other classical object-detection models on the COD10K dataset. Experimental results indicated that DMNet outperformed others, achieving optimal effects across six evaluation metrics and significantly enhancing detection accuracy. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 7583 KiB

Open AccessArticle

Object/Scene Recognition Based on a Directional Pixel Voting Descriptor

by Abiel Aguilar-González, Alejandro Medina Santiago and J. A. de Jesús Osuna-Coutiño

Appl. Sci. 2024, 14(18), 8187; https://doi.org/10.3390/app14188187 - 11 Sep 2024

Viewed by 1111

Detecting objects in images is crucial for several applications, including surveillance, autonomous navigation, augmented reality, and so on. Although AI-based approaches such as Convolutional Neural Networks (CNNs) have proven highly effective in object detection, in scenarios where the objects being recognized are unknow, [...] Read more.

Detecting objects in images is crucial for several applications, including surveillance, autonomous navigation, augmented reality, and so on. Although AI-based approaches such as Convolutional Neural Networks (CNNs) have proven highly effective in object detection, in scenarios where the objects being recognized are unknow, it is difficult to generalize an AI model for such tasks. In another trend, feature-based approaches like SIFT, SURF, and ORB offer the capability to search any object but have limitations under complex visual variations. In this work, we introduce a novel edge-based object/scene recognition method. We propose that utilizing feature edges, instead of feature points, offers high performance under complex visual variations. Our primary contribution is a directional pixel voting descriptor based on image segments. Experimental results are promising; compared to previous approaches, ours demonstrates superior performance under complex visual variations and high processing speed. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 5025 KiB

Open AccessArticle

Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition

by Xiao-Wei Han, Xing-Yu Chen, Ying Cui, Qiu-Yang Guo and Wen Hu

Appl. Sci. 2024, 14(18), 8185; https://doi.org/10.3390/app14188185 - 11 Sep 2024

Viewed by 1105

Obtaining discriminative joint features is crucial for skeleton-based human action recognition. Current models mainly focus on the research of skeleton topology encoding. However, their predefined topology is the same and fixed for all action samples, making it challenging to obtain discriminative joint features. [...] Read more.

Obtaining discriminative joint features is crucial for skeleton-based human action recognition. Current models mainly focus on the research of skeleton topology encoding. However, their predefined topology is the same and fixed for all action samples, making it challenging to obtain discriminative joint features. Although some studies have considered the complex non-natural connection relationships between joints, the existing methods cannot fully capture this complexity by using high-order adjacency matrices or adding trainable parameters and instead increase the computation parameters. Therefore, this study constructs a novel adaptive channel-enhanced graph convolution (ACE-GCN) model for human action recognition. The model generates similar and affinity attention maps by encoding channel attention in the input features. These maps are complementarily applied to the input feature map and graph topology, which can realize the refinement of joint features and construct an adaptive and non-shared channel-based adjacency matrix. This method of constructing the adjacency matrix improves the model’s capacity to capture intricate non-natural connections between joints, prevents the accumulation of unnecessary information, and minimizes the number of computational parameters. In addition, integrating the Edgeconv module into a multi-branch aggregation improves the model’s ability to aggregate different scale and temporal features. Ultimately, comprehensive experiments were carried out on NTU-RGB+D 60 and NTU-RGB+D 120, which are two substantial datasets. On the NTU RGB+D 60 dataset, the accuracy of human action recognition was 92% (X-Sub) and 96.3% (X-View). The model achieved an accuracy of 96.6% on the NW-UCLA dataset. The experimental results confirm that the ACE-GCN exhibits superior recognition accuracy and lower computing complexity compared to current methodologies. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 22277 KiB

Open AccessArticle

Attention-Based Spatiotemporal-Aware Network for Fine-Grained Visual Recognition

by Yili Ren, Ruidong Lu, Guan Yuan, Dashuai Hao and Hongjue Li

Appl. Sci. 2024, 14(17), 7755; https://doi.org/10.3390/app14177755 - 2 Sep 2024

Viewed by 1373

On public benchmarks, current macro facial expression recognition technologies have achieved significant success. However, in real-life scenarios, individuals may attempt to conceal their true emotions. Conventional expression recognition often overlooks subtle facial changes, necessitating more fine-grained micro-expression recognition techniques. Different with prevalent facial [...] Read more.

On public benchmarks, current macro facial expression recognition technologies have achieved significant success. However, in real-life scenarios, individuals may attempt to conceal their true emotions. Conventional expression recognition often overlooks subtle facial changes, necessitating more fine-grained micro-expression recognition techniques. Different with prevalent facial expressions, weak intensity and short duration are the two main obstacles for perceiving and interpreting a micro-expression correctly. Meanwhile, correlations between pixels of visual data in spatial and channel dimensions are ignored in most existing methods. In this paper, we propose a novel network structure, the Attention-based Spatiotemporal-aware network (ASTNet), for micro-expression recognition. In ASTNet, we combine ResNet and ConvLSTM as a holistic framework (ResNet-ConvLSTM) to extract the spatial and temporal features simultaneously. Moreover, we innovatively integrate two level attention mechanisms, channel-level attention and spatial-level attention, into the ResNet-ConvLSTM. Channel-level attention is used to discriminate the importance of different channels because the contributions for the overall presentation of micro-expression vary between channels. Spatial-level attention is leveraged to dynamically estimate weights for different regions due to the diversity of regions’ reflections to micro-expression. Extensive experiments conducted on two benchmark datasets demonstrate that ASTNet achieves performance improvements of 4.25–16.02% and 0.79–12.93% over several state-of-the-art methods. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

13 pages, 5007 KiB

Open AccessArticle

Infrared Image Enhancement Method of Substation Equipment Based on Self-Attention Cycle Generative Adversarial Network (SA-CycleGAN)

by Yuanbin Wang and Bingchao Wu

Electronics 2024, 13(17), 3376; https://doi.org/10.3390/electronics13173376 - 26 Aug 2024

Cited by 1 | Viewed by 1355

During the acquisition of infrared images in substations, low-quality images with poor contrast, blurred details, and missing texture information frequently appear, which adversely affects subsequent advanced visual tasks. To address this issue, this paper proposes an infrared image enhancement algorithm for substation equipment [...] Read more.

During the acquisition of infrared images in substations, low-quality images with poor contrast, blurred details, and missing texture information frequently appear, which adversely affects subsequent advanced visual tasks. To address this issue, this paper proposes an infrared image enhancement algorithm for substation equipment based on a self-attention cycle generative adversarial network (SA-CycleGAN). The proposed algorithm incorporates a self-attention mechanism into the CycleGAN model’s transcoding network to improve the mapping ability of infrared image information, enhance image contrast, and reducing the number of model parameters. The addition of an efficient local attention mechanism (EAL) and a feature pyramid structure within the encoding network enhances the generator’s ability to extract features and texture information from small targets in infrared substation equipment images, effectively improving image details. In the discriminator part, the model’s performance is further enhanced by constructing a two-channel feature network. To accelerate the model’s convergence, the loss function of the original CycleGAN is optimized. Compared to several mainstream image enhancement algorithms, the proposed algorithm improves the quality of low-quality infrared images by an average of 10.91% in color degree, 18.89% in saturation, and 29.82% in feature similarity indices. Additionally, the number of parameters in the proposed algorithm is reduced by 37.89% compared to the original model. Finally, the effectiveness of the proposed method in improving recognition accuracy is validated by the Centernet target recognition algorithm. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

26 pages, 9607 KiB

Open AccessArticle

A Global Spatial-Spectral Feature Fused Autoencoder for Nonlinear Hyperspectral Unmixing

by Mingle Zhang, Mingyu Yang, Hongyu Xie, Pinliang Yue, Wei Zhang, Qingbin Jiao, Liang Xu and Xin Tan

Remote Sens. 2024, 16(17), 3149; https://doi.org/10.3390/rs16173149 - 26 Aug 2024

Cited by 3 | Viewed by 1343

Hyperspectral unmixing (HU) aims to decompose mixed pixels into a set of endmembers and corresponding abundances. Deep learning-based HU methods are currently a hot research topic, but most existing unmixing methods still rely on per-pixel training or employ convolutional neural networks (CNNs), which [...] Read more.

Hyperspectral unmixing (HU) aims to decompose mixed pixels into a set of endmembers and corresponding abundances. Deep learning-based HU methods are currently a hot research topic, but most existing unmixing methods still rely on per-pixel training or employ convolutional neural networks (CNNs), which overlook the non-local correlations of materials and spectral characteristics. Furthermore, current research mainly focuses on linear mixing models, which limits the feature extraction capability of deep encoders and further improvement in unmixing accuracy. In this paper, we propose a nonlinear unmixing network capable of extracting global spatial-spectral features. The network is designed based on an autoencoder architecture, where a dual-stream CNNs is employed in the encoder to separately extract spectral and local spatial information. The extracted features are then fused together to form a more complete representation of the input data. Subsequently, a linear projection-based multi-head self-attention mechanism is applied to capture global contextual information, allowing for comprehensive spatial information extraction while maintaining lightweight computation. To achieve better reconstruction performance, a model-free nonlinear mixing approach is adopted to enhance the model’s universality, with the mixing model learned entirely from the data. Additionally, an initialization method based on endmember bundles is utilized to reduce interference from outliers and noise. Comparative results on real datasets against several state-of-the-art unmixing methods demonstrate the superior of the proposed approach. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

21 pages, 8951 KiB

Open AccessArticle

Radiation Anomaly Detection of Sub-Band Optical Remote Sensing Images Based on Multiscale Deep Dynamic Fusion and Adaptive Optimization

by Jinlong Ci, Hai Tan, Haoran Zhai and Xinming Tang

Remote Sens. 2024, 16(16), 2953; https://doi.org/10.3390/rs16162953 - 12 Aug 2024

Cited by 1 | Viewed by 1506

Radiation anomalies in optical remote sensing images frequently occur due to electronic issues within the image sensor or data transmission errors. These radiation anomalies can be categorized into several types, including CCD, StripeNoise, RandomCode1, RandomCode2, ImageMissing, and Tap. To ensure the retention of [...] Read more.

Radiation anomalies in optical remote sensing images frequently occur due to electronic issues within the image sensor or data transmission errors. These radiation anomalies can be categorized into several types, including CCD, StripeNoise, RandomCode1, RandomCode2, ImageMissing, and Tap. To ensure the retention of image data with minimal radiation issues as much as possible, this paper adopts a self-made radiation dataset and proposes a FlexVisionNet-YOLO network to detect radiation anomalies more accurately. Firstly, RepViT is used as the backbone network with a vision transformer architecture to better capture global and local features. Its multiscale feature fusion mechanism efficiently handles targets of different sizes and shapes, enhancing the detection ability for radiation anomalies. Secondly, a feature depth fusion network is proposed in the Feature Fusion part, which significantly improves the flexibility and accuracy of feature fusion and thus enhances the detection and classification performance of complex remote sensing images. Finally, Inner-CIoU is used in the Head part for edge regression, which significantly improves the localization accuracy by finely adjusting the target edges; Slide-Loss is used for classification loss, which enhances the classification robustness by dynamically adjusting the category probabilities and markedly improves the classification accuracy, especially in the sample imbalance dataset. Experimental results show that, compared to YOLOv8, the proposed FlexVisionNet-YOLO method improves precision, recall, mAP0.5, and mAP0.5:0.9 by 3.5%, 7.1%, 4.4%, and 13.6%, respectively. Its effectiveness in detecting radiation anomalies surpasses that of other models. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

25 pages, 11593 KiB

Open AccessArticle

An Effective and Lightweight Full-Scale Target Detection Network for UAV Images Based on Deformable Convolutions and Multi-Scale Contextual Feature Optimization

by Wanwan Yu, Junping Zhang, Dongyang Liu, Yunqiao Xi and Yinhu Wu

Remote Sens. 2024, 16(16), 2944; https://doi.org/10.3390/rs16162944 - 11 Aug 2024

Cited by 3 | Viewed by 3454

Currently, target detection on unmanned aerial vehicle (UAV) images is a research hotspot. Due to the significant scale variability of targets and the interference of complex backgrounds, current target detection models face challenges when applied to UAV images. To address these issues, we [...] Read more.

Currently, target detection on unmanned aerial vehicle (UAV) images is a research hotspot. Due to the significant scale variability of targets and the interference of complex backgrounds, current target detection models face challenges when applied to UAV images. To address these issues, we designed an effective and lightweight full-scale target detection network, FSTD-Net. The design of FSTD-Net is based on three principal aspects. Firstly, to optimize the extracted target features at different scales while minimizing background noise and sparse feature representations, a multi-scale contextual information extraction module (MSCIEM) is developed. The multi-scale information extraction module (MSIEM) in MSCIEM can better capture multi-scale features, and the contextual information extraction module (CIEM) in MSCIEM is designed to capture long-range contextual information. Secondly, to better adapt to various target shapes at different scales in UAV images, we propose the feature extraction module fitting different shapes (FEMFDS), based on deformable convolutions. Finally, considering low-level features contain rich details, a low-level feature enhancement branch (LLFEB) is designed. The experiments demonstrate that, compared to the second-best model, the proposed FSTD-Net achieves improvements of 3.8%, 2.4%, and 2.0% in AP₅₀, AP, and AP₇₅ on the VisDrone2019, respectively. Additionally, FSTD-Net achieves enhancements of 3.4%, 1.7%, and 1% on the UAVDT dataset. Our proposed FSTD-Net has better detection performance compared to state-of-the-art detection models. The experimental results indicate the effectiveness of the FSTD-Net for target detection in UAV images. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

19 pages, 5414 KiB

Open AccessArticle

Implicit Sharpness-Aware Minimization for Domain Generalization

by Mingrong Dong, Yixuan Yang, Kai Zeng, Qingwang Wang and Tao Shen

Remote Sens. 2024, 16(16), 2877; https://doi.org/10.3390/rs16162877 - 6 Aug 2024

Cited by 1 | Viewed by 1895

Domain generalization (DG) aims to learn knowledge from multiple related domains to achieve a robust generalization performance in unseen target domains, which is an effective approach to mitigate domain shift in remote sensing image classification. Although the sharpness-aware minimization (SAM) method enhances DG [...] Read more.

Domain generalization (DG) aims to learn knowledge from multiple related domains to achieve a robust generalization performance in unseen target domains, which is an effective approach to mitigate domain shift in remote sensing image classification. Although the sharpness-aware minimization (SAM) method enhances DG capability and improves remote sensing image classification performance by promoting the convergence of the loss minimum to a flatter loss surface, the perturbation loss (maximum loss within the neighborhood of a local minimum) of SAM fails to accurately measure the true sharpness of the loss landscape. Furthermore, its variants often overlook gradient conflicts, thereby limiting further improvement in DG performance. In this paper, we introduce implicit sharpness-aware minimization (ISAM), a novel method that addresses the deficiencies of SAM and mitigates gradient conflicts. Specifically, we demonstrate that the discrepancy in training loss during gradient ascent or descent serves as an equivalent measure of the dominant eigenvalue of the Hessian matrix. This discrepancy provides a reliable measure for sharpness. ISAM effectively reduces sharpness and mitigates potential conflicts between gradients by implicitly minimizing the discrepancy between training losses while ensuring a sufficiently low minimum through minimizing perturbation loss. Extensive experiments and analyses demonstrate that ISAM significantly enhances the model’s generalization ability on remote sensing and DG datasets, outperforming existing state-of-the-art methods. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 7392 KiB

Open AccessArticle

Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles

by Chenlong Li, Lan Wang, Yitong Liu and Shuaike Zhang

Electronics 2024, 13(15), 3089; https://doi.org/10.3390/electronics13153089 - 4 Aug 2024

Cited by 3 | Viewed by 1874

The detection algorithms for water surface objects considerably assist unmanned surface vehicles in rapidly perceiving their surrounding environment, providing essential environmental information and evaluating object attributes. This study proposes a lightweight water surface target detection algorithm called YOLO-WSD (water surface detection), based on [...] Read more.

The detection algorithms for water surface objects considerably assist unmanned surface vehicles in rapidly perceiving their surrounding environment, providing essential environmental information and evaluating object attributes. This study proposes a lightweight water surface target detection algorithm called YOLO-WSD (water surface detection), based on YOLOv8n, to address the need for real-time, high-precision, and lightweight target detection algorithms that can adapt to the rapid changes in the surrounding environment during specific tasks. Initially, we designed the C2F-E module, enriched in gradient flow compared to the conventional C2F module, enabling the backbone network to extract richer multi-level features while maintaining lightness. Additionally, this study redesigns the feature fusion network structure by introducing low-level features and achieving multi-level fusion to enhance the network’s capability of integrating multiple levels. Meanwhile, it investigates the impact of channel number differences in the Concat module fusion on model performance, thereby optimizing the neural network structure. Lastly, it introduces the WIOU localization loss function to bolster model robustness. Experiments demonstrated that YOLO-WSD achieves a 4.6% and 3.4% improvement in mAP0.5 on the water surface object detection dataset and Seaship public dataset, respectively, with recall rates improving by 5.4% and 8.5% relative to the baseline YOLOv8n model. The model’s parameter size is 3.3 M. YOLO-WSD exhibits superior performance compared to other mainstream lightweight algorithms. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

23 pages, 3243 KiB

Open AccessArticle

StarCAN-PFD: An Efficient and Simplified Multi-Scale Feature Detection Network for Small Objects in Complex Scenarios

by Zongxuan Chai, Tingting Zheng and Feixiang Lu

Electronics 2024, 13(15), 3076; https://doi.org/10.3390/electronics13153076 - 3 Aug 2024

Cited by 1 | Viewed by 2244

Small object detection in traffic sign applications often faces challenges like complex backgrounds, blurry samples, and multi-scale variations. Existing solutions tend to complicate the algorithms. In this study, we designed an efficient and simple algorithm network called StarCAN-PFD, based on the single-stage YOLOv8 [...] Read more.

Small object detection in traffic sign applications often faces challenges like complex backgrounds, blurry samples, and multi-scale variations. Existing solutions tend to complicate the algorithms. In this study, we designed an efficient and simple algorithm network called StarCAN-PFD, based on the single-stage YOLOv8 framework, to accurately recognize small objects in complex scenarios. We proposed the StarCAN feature extraction network, which was enhanced with the Context Anchor Attention (CAA). We designed the Pyramid Focus and Diffusion Network (PFDNet) to address multi-scale information loss and developed the Detail-Enhanced Conv Shared Detect (DESDetect) module to improve the recognition of complex samples while keeping the network lightweight. Experiments on the CCTSDB dataset validated the effectiveness of each module. Compared to YOLOv8, our algorithm improved mAP@0.5 by 4%, reduced the model size to less than half, and demonstrated better performance on different traffic sign datasets. It excels at detecting small traffic sign targets in complex scenes, including challenging samples such as blurry, low-light night, occluded, and overexposed conditions, showcasing strong generalization ability. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

19 pages, 8202 KiB

Open AccessArticle

An Accurate and Robust Multimodal Template Matching Method Based on Center-Point Localization in Remote Sensing Imagery

by Jiansong Yang, Yongbin Zheng, Wanying Xu, Peng Sun and Shengjian Bai

Remote Sens. 2024, 16(15), 2831; https://doi.org/10.3390/rs16152831 - 1 Aug 2024

Viewed by 2601

Deep learning-based template matching in remote sensing has received increasing research attention. Existing anchor box-based and anchor-free methods often suffer from low template localization accuracy in the presence of multimodal, nonrigid deformation and occlusion. To address this problem, we transform the template matching [...] Read more.

Deep learning-based template matching in remote sensing has received increasing research attention. Existing anchor box-based and anchor-free methods often suffer from low template localization accuracy in the presence of multimodal, nonrigid deformation and occlusion. To address this problem, we transform the template matching task into a center-point localization task for the first time and propose an end-to-end template matching method based on a novel fully convolutional Siamese network. Furthermore, we propose an adaptive shrinkage cross-correlation scheme, which improves the precision of template localization and alleviates the impact of background clutter without adding any parameters. We also design a scheme that leverages keypoint information to assist in locating the template center, thereby enhancing the precision of template localization. We construct a multimodal template matching dataset to verify the performance of the method in dealing with differences in view, scale, rotation and occlusion in practical application scenarios. Extensive experiments on a public dataset, OTB, the proposed dataset, as well as a remote sensing dataset, SEN1-2, demonstrate that our method achieves state-of-the-art performance. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

22 pages, 5295 KiB

Open AccessArticle

Research on Clothing Image Retrieval Combining Topology Features with Color Texture Features

by Xu Zhang, Huadong Sun and Jian Ma

Mathematics 2024, 12(15), 2363; https://doi.org/10.3390/math12152363 - 29 Jul 2024

Viewed by 1335

Topological data analysis (TDA) is a method of feature extraction based on data topological structure. Image feature extraction using TDA has been shown to be superior to other feature extraction techniques in some problems, so it has recently received the attention of researchers. [...] Read more.

Topological data analysis (TDA) is a method of feature extraction based on data topological structure. Image feature extraction using TDA has been shown to be superior to other feature extraction techniques in some problems, so it has recently received the attention of researchers. In this paper, clothing image retrieval based on topology features and color texture features is studied. The main work is as follows: (1) Based on the analysis of image data by persistent homology, the feature construction method of a topology feature histogram is proposed, which can represent the ruler of image local topological data, and make up for the shortcomings of traditional feature extraction methods. (2) The improvement of Wasserstein distance is presented, while the similarity measure method named topology feature histogram distance is proposed. (3) Because the single feature has some problems such as the incomplete description of image information and poor robustness, the clothing image retrieval is realized by combining the topology feature with the color texture feature. The experimental results show that the proposed algorithm, namely topology feature histogram + corresponding distance, can effectively reduce the computation time while ensuring the accuracy. Compared with the method using only color texture, the retrieval rate of top5 is improved by 14.9%. Compared with the method using cubic complex + Wasserstein distance, the retrieval rate of top5 is improved by 3.8%, while saving 3.93 s computation time. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

30 pages, 15406 KiB

Open AccessArticle

Addressing Demographic Bias in Age Estimation Models through Optimized Dataset Composition

by Nenad Panić, Marina Marjanović and Timea Bezdan

Mathematics 2024, 12(15), 2358; https://doi.org/10.3390/math12152358 - 28 Jul 2024

Cited by 3 | Viewed by 2423

Bias in facial recognition systems often results in unequal performance across demographic groups. This study addresses this by investigating how dataset composition affects the performance and bias of age estimation models across ethnicities. We fine-tuned pre-trained Convolutional Neural Networks (CNNs) like VGG19 on [...] Read more.

Bias in facial recognition systems often results in unequal performance across demographic groups. This study addresses this by investigating how dataset composition affects the performance and bias of age estimation models across ethnicities. We fine-tuned pre-trained Convolutional Neural Networks (CNNs) like VGG19 on the diverse UTKFace dataset (23,705 samples: 10,078 White, 4526 Black, 3434 Asian) and APPA-REAL (7691 samples: 6686 White, 231 Black, 674 Asian). Our approach involved adjusting dataset compositions by oversampling minority groups or reducing samples from overrepresented groups to mitigate bias. We conducted experiments to identify the optimal dataset composition that minimizes performance disparities among ethnic groups. The primary performance metric was Mean Absolute Error (MAE), measuring the average magnitude of prediction errors. We also analyzed the standard deviation of MAE across ethnic groups to assess performance consistency and equity. Our findings reveal that simple oversampling of minority groups does not ensure equitable performance. Instead, systematic adjustments, including reducing samples from overrepresented groups, led to more balanced performance and lower MAE standard deviations across ethnicities. These insights highlight the importance of tailored dataset adjustments and suggest exploring advanced data processing methods and algorithmic tweaks to enhance fairness and accuracy in facial recognition technologies. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

15 pages, 5580 KiB

Open AccessArticle

DANet: A Domain Alignment Network for Low-Light Image Enhancement

by Qiao Li, Bin Jiang, Xiaochen Bo, Chao Yang and Xu Wu

Electronics 2024, 13(15), 2954; https://doi.org/10.3390/electronics13152954 - 26 Jul 2024

Viewed by 1606

We propose restoring low-light images suffering from severe degradation using a deep-learning approach. A significant domain gap exists between low-light and real images, which previous methods have failed to address with domain alignment. To tackle this, we introduce a domain alignment network leveraging [...] Read more.

We propose restoring low-light images suffering from severe degradation using a deep-learning approach. A significant domain gap exists between low-light and real images, which previous methods have failed to address with domain alignment. To tackle this, we introduce a domain alignment network leveraging dual encoders and a domain alignment loss. Specifically, we train two dual encoders to transform low-light and real images into two latent spaces and align these spaces using a domain alignment loss. Additionally, we design a Convolution-Transformer module (CTM) during the encoding process to comprehensively extract both local and global features. Experimental results on four benchmark datasets demonstrate that our proposed A Domain Alignment Network(DANet) method outperforms state-of-the-art methods. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 1884 KiB

Open AccessReview

Image-Based 3D Reconstruction in Laparoscopy: A Review Focusing on the Quantitative Evaluation by Applying the Reconstruction Error

by Birthe Göbel, Alexander Reiterer and Knut Möller

J. Imaging 2024, 10(8), 180; https://doi.org/10.3390/jimaging10080180 - 24 Jul 2024

Cited by 3 | Viewed by 2939

Image-based 3D reconstruction enables laparoscopic applications as image-guided navigation and (autonomous) robot-assisted interventions, which require a high accuracy. The review’s purpose is to present the accuracy of different techniques to label the most promising. A systematic literature search with PubMed and google scholar [...] Read more.

Image-based 3D reconstruction enables laparoscopic applications as image-guided navigation and (autonomous) robot-assisted interventions, which require a high accuracy. The review’s purpose is to present the accuracy of different techniques to label the most promising. A systematic literature search with PubMed and google scholar from 2015 to 2023 was applied by following the framework of “Review articles: purpose, process, and structure”. Articles were considered when presenting a quantitative evaluation (root mean squared error and mean absolute error) of the reconstruction error (Euclidean distance between real and reconstructed surface). The search provides 995 articles, which were reduced to 48 articles after applying exclusion criteria. From these, a reconstruction error data set could be generated for the techniques of stereo vision, Shape-from-Motion, Simultaneous Localization and Mapping, deep-learning, and structured light. The reconstruction error varies from below one millimeter to higher than ten millimeters—with deep-learning and Simultaneous Localization and Mapping delivering the best results under intraoperative conditions. The high variance emerges from different experimental conditions. In conclusion, submillimeter accuracy is challenging, but promising image-based 3D reconstruction techniques could be identified. For future research, we recommend computing the reconstruction error for comparison purposes and use ex/in vivo organs as reference objects for realistic experiments. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

14 pages, 4700 KiB

Open AccessArticle

Few-Shot Conditional Learning: Automatic and Reliable Device Classification for Medical Test Equipment

by Eva Pachetti, Giulio Del Corso, Serena Bardelli and Sara Colantonio

J. Imaging 2024, 10(7), 167; https://doi.org/10.3390/jimaging10070167 - 13 Jul 2024

Cited by 2 | Viewed by 1381

The limited availability of specialized image databases (particularly in hospitals, where tools vary between providers) makes it difficult to train deep learning models. This paper presents a few-shot learning methodology that uses a pre-trained ResNet integrated with an encoder as a backbone to [...] Read more.

The limited availability of specialized image databases (particularly in hospitals, where tools vary between providers) makes it difficult to train deep learning models. This paper presents a few-shot learning methodology that uses a pre-trained ResNet integrated with an encoder as a backbone to encode conditional shape information for the classification of neonatal resuscitation equipment from less than 100 natural images. The model is also strengthened by incorporating a reliability score, which enriches the prediction with an estimation of classification reliability. The model, whose performance is cross-validated, reached a median accuracy performance of over 99% (and a lower limit of 73.4% for the least accurate model/fold) using only 87 meta-training images. During the test phase on complex natural images, performance was slightly degraded due to a sub-optimal segmentation strategy (FastSAM) required to maintain the real-time inference phase (median accuracy 87.25%). This methodology proves to be excellent for applying complex classification models to contexts (such as neonatal resuscitation) that are not available in public databases. Improvements to the automatic segmentation strategy prior to the extraction of conditional information will allow a natural application in simulation and hospital settings. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

19 pages, 13763 KiB

Open AccessArticle

AMTT: An End-to-End Anchor-Based Multi-Scale Transformer Tracking Method

by Yitao Zheng, Honggui Deng, Qiguo Xu and Ni Li

Electronics 2024, 13(14), 2710; https://doi.org/10.3390/electronics13142710 - 11 Jul 2024

Viewed by 1053

Most current trackers utilize only the highest-level features to achieve faster tracking performance, making it difficult to achieve accurate tracking of small and low-resolution objects. To address this problem, we propose an end-to-end anchor-based multi-scale transformer tracking (AMTT) approach to improve the tracking [...] Read more.

Most current trackers utilize only the highest-level features to achieve faster tracking performance, making it difficult to achieve accurate tracking of small and low-resolution objects. To address this problem, we propose an end-to-end anchor-based multi-scale transformer tracking (AMTT) approach to improve the tracking performance of the network for objects of different sizes. First, we design a multi-scale feature encoder based on the deformable transformer, which better fuses the multilayer template features and search features through the self-enhancement module and cross-enhancement module to improve the attention of the whole network to objects of different sizes. Then, to reduce the computational overhead of the decoder while further enhancing the multi-scale features, we design a feature focusing block to compress the number of coded features. Finally, we introduce a feature anchor into the traditional decoder and design an anchor-based decoder, which utilizes the feature anchor to guide the decoder to adapt to changes in object scale and achieve more accurate tracking performance. To confirm the effectiveness of our proposed method, we conduct a series of experiments on different datasets such as UAV123, OTB100 and GOT10k. The results show that our adopted method exhibits highly competitive performance compared to the state-of-the-art methods in recent years. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

12 pages, 3830 KiB

Open AccessArticle

Comparative Evaluation of Convolutional Neural Network Object Detection Algorithms for Vehicle Detection

by Saieshan Reddy, Nelendran Pillay and Navin Singh

J. Imaging 2024, 10(7), 162; https://doi.org/10.3390/jimaging10070162 - 5 Jul 2024

Cited by 7 | Viewed by 3273

The domain of object detection was revolutionized with the introduction of Convolutional Neural Networks (CNNs) in the field of computer vision. This article aims to explore the architectural intricacies, methodological differences, and performance characteristics of three CNN-based object detection algorithms, namely Faster Region-Based [...] Read more.

The domain of object detection was revolutionized with the introduction of Convolutional Neural Networks (CNNs) in the field of computer vision. This article aims to explore the architectural intricacies, methodological differences, and performance characteristics of three CNN-based object detection algorithms, namely Faster Region-Based Convolutional Network (R-CNN), You Only Look Once v3 (YOLO), and Single Shot MultiBox Detector (SSD) in the specific domain application of vehicle detection. The findings of this study indicate that the SSD object detection algorithm outperforms the other approaches in terms of both performance and processing speeds. The Faster R-CNN approach detected objects in images with an average speed of 5.1 s, achieving a mean average precision of 0.76 and an average loss of 0.467. YOLO v3 detected objects with an average speed of 1.16 s, achieving a mean average precision of 0.81 with an average loss of 1.183. In contrast, SSD detected objects with an average speed of 0.5 s, exhibiting the highest mean average precision of 0.92 despite having a higher average loss of 2.625. Notably, all three object detectors achieved an accuracy exceeding 99%. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 1534 KiB

Open AccessArticle

PointBLIP: Zero-Training Point Cloud Classification Network Based on BLIP-2 Model

by Yunzhe Xiao, Yong Dou and Shaowu Yang

Remote Sens. 2024, 16(13), 2453; https://doi.org/10.3390/rs16132453 - 3 Jul 2024

Viewed by 2412

Leveraging the open-world understanding capacity of large-scale visual-language pre-trained models has become a hot spot in point cloud classification. Recent approaches rely on transferable visual-language pre-trained models, classifying point clouds by projecting them into 2D images and evaluating consistency with textual prompts. These [...] Read more.

Leveraging the open-world understanding capacity of large-scale visual-language pre-trained models has become a hot spot in point cloud classification. Recent approaches rely on transferable visual-language pre-trained models, classifying point clouds by projecting them into 2D images and evaluating consistency with textual prompts. These methods benefit from the robust open-world understanding capabilities of visual-language pre-trained models and require no additional training. However, they face several challenges summarized as prompt ambiguity, image domain gap, view weight confusion, and feature deviation. In response to these challenges, we propose PointBLIP, a zero-training point cloud classification network based on the recently introduced BLIP-2 visual-language model. PointBLIP is adept at processing similarities between multi-images and multi-prompts. We separately introduce a novel method for point cloud zero-shot and few-shot classification, which involves comparing multiple features to achieve effective classification. Simultaneously, we enhance the input data quality for both the image and text sides of PointBLIP. In point cloud zero-shot classification tasks, we outperform state-of-the-art methods on three benchmark datasets. For few-shot classification tasks, to the best of our knowledge, we present the first zero-training few-shot point cloud method, surpassing previous works under the same conditions and showcasing comparable performance to full-training methods. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

24 pages, 8813 KiB

Open AccessArticle

MSSD-Net: Multi-Scale SAR Ship Detection Network

by Xi Wang, Wei Xu, Pingping Huang and Weixian Tan

Remote Sens. 2024, 16(12), 2233; https://doi.org/10.3390/rs16122233 - 19 Jun 2024

Cited by 5 | Viewed by 1833

In recent years, the development of neural networks has significantly advanced their application in Synthetic Aperture Radar (SAR) ship target detection for maritime traffic control and ship management. However, traditional neural network architectures are often complex and resource intensive, making them unsuitable for [...] Read more.

In recent years, the development of neural networks has significantly advanced their application in Synthetic Aperture Radar (SAR) ship target detection for maritime traffic control and ship management. However, traditional neural network architectures are often complex and resource intensive, making them unsuitable for deployment on artificial satellites. To address this issue, this paper proposes a lightweight neural network: the Multi-Scale SAR Ship Detection Network (MSSD-Net). Initially, the MobileOne network module is employed to construct the backbone network for feature extraction from SAR images. Subsequently, a Multi-Scale Coordinate Attention (MSCA) module is designed to enhance the network’s capability to process contextual information. This is followed by the integration of features across different scales using an FPN + PAN structure. Lastly, an Anchor-Free approach is utilized for the rapid detection of ship targets. To evaluate the performance of MSSD-Net, we conducted extensive experiments on the Synthetic Aperture Radar Ship Detection Dataset (SSDD) and SAR-Ship-Dataset. Our experimental results demonstrate that MSSD-Net achieves a mean average precision (mAP) of 98.02% on the SSDD while maintaining a compact model size of only 1.635 million parameters. This indicates that MSSD-Net effectively reduces model complexity without compromising its ability to achieve high accuracy in object detection tasks. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

19 pages, 7213 KiB

Open AccessArticle

Embedded Zero-Shot Image Classification Based on Bidirectional Feature Mapping

by Huadong Sun, Zhibin Zhen, Yinghui Liu, Xu Zhang, Xiaowei Han and Pengyi Zhang

Appl. Sci. 2024, 14(12), 5230; https://doi.org/10.3390/app14125230 - 17 Jun 2024

Cited by 1 | Viewed by 1641

The zero-shot image classification technique aims to explore the semantic information shared between seen and unseen classes through visual features and auxiliary information and, based on this semantic information, to complete the knowledge migration from seen to unseen classes in order to complete [...] Read more.

The zero-shot image classification technique aims to explore the semantic information shared between seen and unseen classes through visual features and auxiliary information and, based on this semantic information, to complete the knowledge migration from seen to unseen classes in order to complete the classification of unseen class images. Previous zero-shot work has either not extracted enough features to express the relationship between the sample classes or has only used a single feature mapping method, which cannot fully explore the information contained in the features and the connection between the visual–semantic features. To address the above problems, this paper proposes an embedded zero-shot image classification model based on bidirectional feature mapping (BFM). It mainly contains a feature space mapping module, which is dominated by a bidirectional feature mapping network and supplemented with a mapping network from visual to category label semantic feature space. Attention mechanisms based on attribute guidance and visual guidance are further introduced to weight the features to reduce the difference between visual and semantic features to alleviate the modal difference problem, and then the category calibration loss is utilized to assign a larger weight to the unseen class to alleviate the seen class bias problem. The BFM model proposed in this paper has been experimented on three public datasets CUB, SUN, and AWA2, and has achieved 71.9%, 62.8%, and 69.3% and 61.6%, 33.2%, and 66.6% accuracies under traditional and generalized zero-sample image classification settings, respectively. The experimental results verify the superiority of the BFM model in the field of zero-shot image classification. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 5611 KiB

Open AccessArticle

A Visible and Synthetic Aperture Radar Image Fusion Algorithm Based on a Transformer and a Convolutional Neural Network

by Liushun Hu, Shaojing Su, Zhen Zuo, Junyu Wei, Siyang Huang, Zongqing Zhao, Xiaozhong Tong and Shudong Yuan

Electronics 2024, 13(12), 2365; https://doi.org/10.3390/electronics13122365 - 17 Jun 2024

Viewed by 1550

For visible and Synthetic Aperture Radar (SAR) image fusion, this paper proposes a visible and SAR image fusion algorithm based on a Transformer and a Convolutional Neural Network (CNN). Firstly, in this paper, the Restormer Block is used to extract cross-modal shallow features. [...] Read more.

For visible and Synthetic Aperture Radar (SAR) image fusion, this paper proposes a visible and SAR image fusion algorithm based on a Transformer and a Convolutional Neural Network (CNN). Firstly, in this paper, the Restormer Block is used to extract cross-modal shallow features. Then, we introduce an improved Transformer–CNN Feature Extractor (TCFE) with a two-branch residual structure. This includes a Transformer branch that introduces the Lite Transformer (LT) and DropKey for extracting global features and a CNN branch that introduces the Convolutional Block Attention Module (CBAM) for extracting local features. Finally, the fused image is output based on global features extracted by the Transformer branch and local features extracted by the CNN branch. The experiments show that the algorithm proposed in this paper can effectively achieve the extraction and fusion of global and local features of visible and SAR images, so that high-quality visible and SAR fusion images can be obtained. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 28354 KiB

Open AccessArticle

A Hybrid Domain Color Image Watermarking Scheme Based on Hyperchaotic Mapping

by Yumin Dong, Rui Yan, Qiong Zhang and Xuesong Wu

Mathematics 2024, 12(12), 1859; https://doi.org/10.3390/math12121859 - 14 Jun 2024

Cited by 2 | Viewed by 1326

In the field of image watermarking technology, it is very important to balance imperceptibility, robustness and embedding capacity. In order to solve this key problem, this paper proposes a new color image adaptive watermarking scheme based on discrete wavelet transform (DWT), discrete cosine [...] Read more.

In the field of image watermarking technology, it is very important to balance imperceptibility, robustness and embedding capacity. In order to solve this key problem, this paper proposes a new color image adaptive watermarking scheme based on discrete wavelet transform (DWT), discrete cosine transform (DCT) and singular value decomposition (SVD). In order to improve the security of the watermark, we use Lorenz hyperchaotic mapping to encrypt the watermark image. We adaptively determine the embedding factor by calculating the Bhattacharyya distance between the cover image and the watermark image, and combine the Alpha blending technique to embed the watermark image into the Y component of the YCbCr color space to enhance the imperceptibility of the algorithm. The experimental results show that the average PSNR of our scheme is 45.9382 dB, and the SSIM is 0.9986. Through a large number of experimental results and comparative analysis, it shows that the scheme has good imperceptibility and robustness, indicating that we have achieved a good balance between imperceptibility, robustness and embedding capacity. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

26 pages, 59985 KiB

Open AccessArticle

Depth-Guided Dehazing Network for Long-Range Aerial Scenes

by Yihu Wang, Jilin Zhao, Liangliang Yao and Changhong Fu

Remote Sens. 2024, 16(12), 2081; https://doi.org/10.3390/rs16122081 - 8 Jun 2024

Cited by 1 | Viewed by 1304

Over the past few years, the applications of unmanned aerial vehicles (UAVs) have greatly increased. However, the decrease in clarity in hazy environments is an important constraint on their further development. Current research on image dehazing mainly focuses on normal scenes at close [...] Read more.

Over the past few years, the applications of unmanned aerial vehicles (UAVs) have greatly increased. However, the decrease in clarity in hazy environments is an important constraint on their further development. Current research on image dehazing mainly focuses on normal scenes at close range or mid-range, while ignoring long-range scenes such as aerial perspective. Furthermore, based on the atmospheric scattering model, the inclusion of depth information is essential for the procedure of image dehazing, especially when dealing with images that exhibit substantial variations in depth. However, most existing models neglect this important information. Consequently, these state-of-the-art (SOTA) methods perform inadequately in dehazing when applied to long-range images. For the purpose of dealing with the above challenges, we propose the construction of a depth-guided dehazing network designed specifically for long-range aerial scenes. Initially, we introduce the depth prediction subnetwork to accurately extract depth information from long-range aerial images, taking into account the substantial variance in haze density. Subsequently, we propose the depth-guided attention module, which integrates a depth map with dehazing features through the attention mechanism, guiding the dehazing process and enabling the effective removal of haze in long-range areas. Furthermore, considering the unique characteristics of long-range aerial scenes, we introduce the UAV-HAZE dataset, specifically designed for training and evaluating dehazing methods in such scenarios. Finally, we conduct extensive experiments to test our method against several SOTA dehazing methods and demonstrate its superiority over others. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 28541 KiB

Open AccessArticle

IFSrNet: Multi-Scale IFS Feature-Guided Registration Network Using Multispectral Image-to-Image Translation

by Bowei Chen, Li Chen, Umara Khalid and Shuai Zhang

Electronics 2024, 13(12), 2240; https://doi.org/10.3390/electronics13122240 - 7 Jun 2024

Cited by 4 | Viewed by 1209

Multispectral image registration is the process of aligning the spatial regions of two images with different distributions. One of the main challenges it faces is to resolve the severe inconsistencies between the reference and target images. This paper presents a novel multispectral image [...] Read more.

Multispectral image registration is the process of aligning the spatial regions of two images with different distributions. One of the main challenges it faces is to resolve the severe inconsistencies between the reference and target images. This paper presents a novel multispectral image registration network, Multi-scale Intuitionistic Fuzzy Set Feature-guided Registration Network (IFSrNet), to address multispectral image registration. IFSrNet generates pseudo-infrared images from visible images using Cycle Generative Adversarial Network (CycleGAN), which is equipped with a multi-head attention module. An end-to-end registration network encodes the input multispectral images with intuitionistic fuzzification, which employs an improved feature descriptor—Intuitionistic Fuzzy Set–Scale-Invariant Feature Transform (IFS-SIFT)—to guide its operation. The results of the image registration will be presented in a direct output. For this task we have also designed specialised loss functions. The results of the experiment demonstrate that IFSrNet outperforms existing registration methods in the Visible–IR dataset. IFSrNet has the potential to be employed as a novel image-to-image translation paradigm. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

13 pages, 19038 KiB

Open AccessArticle

Multi-Scale Feature Fusion Point Cloud Object Detection Based on Original Point Cloud and Projection

by Zhikang Zhang, Zhongjie Zhu, Yongqiang Bai, Yiwen Jin and Ming Wang

Electronics 2024, 13(11), 2213; https://doi.org/10.3390/electronics13112213 - 6 Jun 2024

Cited by 1 | Viewed by 1905

Existing point cloud object detection algorithms struggle to effectively capture spatial features across different scales, often resulting in inadequate responses to changes in object size and limited feature extraction capabilities, thereby affecting detection accuracy. To solve this problem, we present a point cloud [...] Read more.

Existing point cloud object detection algorithms struggle to effectively capture spatial features across different scales, often resulting in inadequate responses to changes in object size and limited feature extraction capabilities, thereby affecting detection accuracy. To solve this problem, we present a point cloud object detection method based on multi-scale feature fusion of the original point cloud and projection, which aims to improve the multi-scale performance and completeness of feature extraction in point cloud object detection. First, we designed a 3D feature extraction module based on the 3D Swin Transformer. This module pre-processes the point cloud using a 3D Patch Partition approach and employs a self-attention mechanism within a 3D sliding window, along with a downsampling strategy, to effectively extract features at different scales. At the same time, we convert the 3D point cloud to a 2D image using projection technology and extract 2D features using the Swin Transformer. A 2D/3D feature fusion module is then built to integrate 2D and 3D features at the channel level through point-by-point addition and vector concatenation to improve feature completeness. Finally, the integrated feature maps are fed into the detection head to facilitate efficient object detection. Experimental results show that our method has improved the average precision of vehicle detection by 1.01% on the KITTI dataset over three levels of difficulty compared to Voxel-RCNN. In addition, visualization analyses show that our proposed algorithm also exhibits superior performance in object detection. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 14025 KiB

Open AccessArticle

Point Cloud Registration Algorithm Based on Adaptive Neighborhood Eigenvalue Loading Ratio

by Zhongping Liao, Tao Peng, Ruiqi Tang and Zhiguo Hao

Appl. Sci. 2024, 14(11), 4828; https://doi.org/10.3390/app14114828 - 3 Jun 2024

Viewed by 1971

Traditional iterative closest point (ICP) registration algorithms are sensitive to initial positions and easily fall into the trap of locally optimal solutions. To address this problem, a point cloud registration algorithm is put forward in this study based on adaptive neighborhood eigenvalue loading [...] Read more.

Traditional iterative closest point (ICP) registration algorithms are sensitive to initial positions and easily fall into the trap of locally optimal solutions. To address this problem, a point cloud registration algorithm is put forward in this study based on adaptive neighborhood eigenvalue loading ratios. In the algorithm, the resolution of the point cloud is first calculated and used as an adaptive basis to determine the raster widths and radii of spherical neighborhoods in the raster filtering; then, the adaptive raster filtering is implemented to the point cloud for denoising, while the eigenvalue loading ratios of point neighborhoods are calculated to extract and match the contour feature points; subsequently, sample consensus initial alignment (SAC-IA) is used to carry out coarse registration; and finally, a fine registration is delivered with KD-tree-accelerated ICP. The experimental results of this study demonstrate that the feature points extracted with this method are highly representative while consuming only 35.6% of the time consumed by other feature point extraction algorithms. Additionally, in noisy and low-overlap scenarios, the registration error of this method can be controlled at a level of 0.1 mm, with the registration speed improved by 56% on average over that of other algorithms. Taken together, the method in this study cannot only ensure strong robustness in registration but can also deliver high registration accuracy and efficiency. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

15 pages, 5200 KiB

Open AccessArticle

Few-Shot Image Classification Based on Swin Transformer + CSAM + EMD

by Huadong Sun, Pengyi Zhang, Xu Zhang and Xiaowei Han

Electronics 2024, 13(11), 2121; https://doi.org/10.3390/electronics13112121 - 29 May 2024

Cited by 2 | Viewed by 1452

In few-shot image classification (FSIC), the feature extraction module of the traditional convolutional neural networks is often constrained by the local nature of the convolutional kernel. As a result, it becomes challenging to handle global information and long-distance dependencies effectively. In order to [...] Read more.

In few-shot image classification (FSIC), the feature extraction module of the traditional convolutional neural networks is often constrained by the local nature of the convolutional kernel. As a result, it becomes challenging to handle global information and long-distance dependencies effectively. In order to address this problem, an innovative FSIC method is proposed in this paper, which is the integration of Swin Transformer and CSAM and Earth Mover’s Distance (EMD) technology (STCE). We utilize the Swin Transformer network for image feature extraction, and perform CSAM attention mechanism feature weighting on the output feature map, while we adopt the EMD algorithm to generate the optimal matching flow between the structural units, minimizing the matching cost. This approach allows for a more precise representation of the classification distance between images. We have conducted numerous experiments to validate the effectiveness of our algorithm. On three commonly used few-shot datasets, namely mini-ImageNet, tiered-ImageNet, and FC100, the accuracy of one-shot and five-shot has reached the state of the art (SOTA) in the FSIC; the mini-ImageNet achieves an accuracy of 98.65 ± 0.1% for one-shot and 99.6 ± 0.2% for five-shot tasks, while tiered ImageNet has an accuracy of 91.6 ± 0.1% for one-shot tasks and 96.55 ± 0.27% for five-shot tasks. For FC100, the accuracy is 64.1 ± 0.3% for one-shot tasks and 79.8 ± 0.69% for five-shot tasks. On two commonly used few-shot datasets, namely CUB, CIFAR-FS, CUB achieves an accuracy of 83.1 ± 0.4% for one-shot and 92.88 ± 0.4% for five-shot tasks, while CIFAR-FS achieves an accuracy of 86.95 ± 0.2% for one-shot and 94 ± 0.4% for five-shot tasks. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 6111 KiB

Open AccessArticle

Multi-Scale Target Detection in Autonomous Driving Scenarios Based on YOLOv5-AFAM

by Hang Ma, Wei Zhao, Bosi Liu and Wenbai Chen

Appl. Sci. 2024, 14(11), 4633; https://doi.org/10.3390/app14114633 - 28 May 2024

Cited by 4 | Viewed by 1570

Multi-scale object detection is critically important in complex driving environments within the field of autonomous driving. To enhance the detection accuracy of both small-scale and large-scale targets in complex autonomous driving environments, this paper proposes an improved YOLOv5-AFAM algorithm. Firstly, the Adaptive Fusion [...] Read more.

Multi-scale object detection is critically important in complex driving environments within the field of autonomous driving. To enhance the detection accuracy of both small-scale and large-scale targets in complex autonomous driving environments, this paper proposes an improved YOLOv5-AFAM algorithm. Firstly, the Adaptive Fusion Attention Module (AFAM) and Down-sampling Module (DownC) are introduced to increase the detection precision of small targets. Secondly, the Efficient Multi-scale Attention Module (EMA) is incorporated, enabling the model to simultaneously recognize small-scale and large-scale targets. Finally, a Minimum Point Distance IoU-based Loss Function (MPDIou-LOSS) is introduced to improve the accuracy and efficiency of object detection. Experimental validation on the KITTI dataset shows that, compared to the baseline model, the improved algorithm increased precision by 2.4%, recall by 2.6%, mAP50 by 1.5%, and mAP50-90 by an impressive 4.8%. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

12 pages, 5442 KiB

Open AccessArticle

Image Enhancement of Steel Plate Defects Based on Generative Adversarial Networks

by Zhideng Jie, Hong Zhang, Kaixuan Li, Xiao Xie and Aopu Shi

Electronics 2024, 13(11), 2013; https://doi.org/10.3390/electronics13112013 - 22 May 2024

Cited by 1 | Viewed by 1544

In this study, the problem of a limited number of data samples, which affects the detection accuracy, arises for the image classification task of steel plate surface defects under conditions of small sample sizes. A data enhancement method based on generative adversarial networks [...] Read more.

In this study, the problem of a limited number of data samples, which affects the detection accuracy, arises for the image classification task of steel plate surface defects under conditions of small sample sizes. A data enhancement method based on generative adversarial networks is proposed. The method introduces a two-way attention mechanism, which is specifically designed to improve the model’s ability to identify weak defects and optimize the model structure of the network discriminator, which augments the model’s capacity to perceive the overall details of the image and effectively improves the intricacy and authenticity of the generated images. By enhancing the two original datasets, the experimental results show that the proposed method improves the average accuracy by 8.5% across the four convolutional classification models. The results demonstrate the superior detection accuracy of the proposed method, improving the classification of steel plate surface defects. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 2977 KiB

Open AccessArticle

Feature Maps Need More Attention: A Spatial-Channel Mutual Attention-Guided Transformer Network for Face Super-Resolution

by Zhe Zhang and Chun Qi

Appl. Sci. 2024, 14(10), 4066; https://doi.org/10.3390/app14104066 - 10 May 2024

Viewed by 1562

Recently, transformer-based face super-resolution (FSR) approaches have achieved promising success in restoring degraded facial details due to their high capability for capturing both local and global dependencies. However, while existing methods focus on introducing sophisticated structures, they neglect the potential feature map information, [...] Read more.

Recently, transformer-based face super-resolution (FSR) approaches have achieved promising success in restoring degraded facial details due to their high capability for capturing both local and global dependencies. However, while existing methods focus on introducing sophisticated structures, they neglect the potential feature map information, limiting FSR performance. To circumvent this problem, we carefully design a pair of guiding blocks to dig for possible feature map information to enhance features before feeding them to transformer blocks. Relying on the guiding blocks, we propose a spatial-channel mutual attention-guided transformer network for FSR, for which the backbone architecture is a multi-scale connected encoder–decoder. Specifically, we devise a novel Spatial-Channel Mutual Attention-guided Transformer Module (SCATM), which is composed of a Spatial-Channel Mutual Attention Guiding Block (SCAGB) and a Channel-wise Multi-head Transformer Block (CMTB). SCATM on the top layer (SCATM-T) aims to promote both local facial details and global facial structures, while SCATM on the bottom layer (SCATM-B) seeks to optimize the encoded features. Considering that different scale features are complementary, we further develop a Multi-scale Feature Fusion Module (MFFM), which fuses features from different scales for better restoration performance. Quantitative and qualitative experimental results on various datasets indicate that the proposed method outperforms other state-of-the-art FSR methods. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 8838 KiB

Open AccessArticle

Salient Object Detection via Fusion of Multi-Visual Perception

by Wenjun Zhou, Tianfei Wang, Xiaoqin Wu, Chenglin Zuo, Yifan Wang, Quan Zhang and Bo Peng

Appl. Sci. 2024, 14(8), 3433; https://doi.org/10.3390/app14083433 - 18 Apr 2024

Cited by 2 | Viewed by 1708

Salient object detection aims to distinguish the most visually conspicuous regions, playing an important role in computer vision tasks. However, complex natural scenarios can challenge salient object detection, hindering accurate extraction of objects with rich morphological diversity. This paper proposes a novel method [...] Read more.

Salient object detection aims to distinguish the most visually conspicuous regions, playing an important role in computer vision tasks. However, complex natural scenarios can challenge salient object detection, hindering accurate extraction of objects with rich morphological diversity. This paper proposes a novel method for salient object detection leveraging multi-visual perception, mirroring the human visual system’s rapid identification, and focusing on impressive objects/regions within complex scenes. First, a feature map is derived from the original image. Then, salient object detection results are obtained for each perception feature and combined via a feature fusion strategy to produce a saliency map. Finally, superpixel segmentation is employed for precise salient object extraction, removing interference areas. This multi-feature approach for salient object detection harnesses complementary features to adapt to complex scenarios. Competitive experiments on the MSRA10K and ECSSD datasets place our method in the first tier, achieving 0.1302 MAE and 0.9382 F-measure for the MSRA10K dataset and 0.0783 MAE and and 0.9635 F-measure for the ECSSD dataset, demonstrating superior salient object detection performance in complex natural scenarios. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

13 pages, 3442 KiB

Open AccessArticle

MDP-SLAM: A Visual SLAM towards a Dynamic Indoor Scene Based on Adaptive Mask Dilation and Dynamic Probability

by Xiaofeng Zhang and Zhengyang Shi

Electronics 2024, 13(8), 1497; https://doi.org/10.3390/electronics13081497 - 15 Apr 2024

Cited by 2 | Viewed by 1576

Visual simultaneous localization and mapping (SLAM) algorithms in dynamic scenes will apply the moving feature points to the camera pose’s calculation, which will cause the continuous accumulation of errors. As a target-detection tool, mask R-CNN, which is often used in combination with the [...] Read more.

Visual simultaneous localization and mapping (SLAM) algorithms in dynamic scenes will apply the moving feature points to the camera pose’s calculation, which will cause the continuous accumulation of errors. As a target-detection tool, mask R-CNN, which is often used in combination with the former, due to the limited training datasets, easily results in the semantic mask being incomplete and deformed, which will increase the error. In order to solve the above problems, we propose in this paper a visual SLAM algorithm based on an adaptive mask dilation strategy and the dynamic probability of the feature points, named MDP-SLAM. Firstly, we use the mask R-CNN target-detection algorithm to obtain the initial mask of the dynamic target. On this basis, an adaptive mask-dilation algorithm is used to obtain a mask that can completely cover the dynamic target and part of the surrounding scene. Then, we use the K-means clustering algorithm to segment the depth image information in the mask coverage area into absolute dynamic regions and relative dynamic regions. Combined with the epipolar constraint and the semantic constraint, the dynamic probability of the feature points is calculated, and then, the highly dynamic possible feature points are removed to solve an accurate final pose of the camera. Finally, the method is tested on the TUM RGB-D dataset. The results show that the MDP-SLAM algorithm proposed in this paper can effectively improve the accuracy of attitude estimation and has high accuracy and robustness in dynamic indoor scenes. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 7943 KiB

Open AccessArticle

Pushing the Boundaries of Solar Panel Inspection: Elevated Defect Detection with YOLOv7-GX Technology

by Yin Wang, Jingyong Zhao, Yihua Yan, Zhicheng Zhao and Xiao Hu

Electronics 2024, 13(8), 1467; https://doi.org/10.3390/electronics13081467 - 12 Apr 2024

Cited by 7 | Viewed by 1812

During the maintenance and management of solar photovoltaic (PV) panels, how to efficiently solve the maintenance difficulties becomes a key challenge that restricts their performance and service life. Aiming at the multi-defect-recognition challenge in PV-panel image analysis, this study innovatively proposes a new [...] Read more.

During the maintenance and management of solar photovoltaic (PV) panels, how to efficiently solve the maintenance difficulties becomes a key challenge that restricts their performance and service life. Aiming at the multi-defect-recognition challenge in PV-panel image analysis, this study innovatively proposes a new algorithm for the defect detection of PV panels incorporating YOLOv7-GX technology. The algorithm first constructs an innovative GhostSlimFPN network architecture by introducing GSConv and depth-wise separable convolution technologies, optimizing the traditional neck network structure. Then, a customized 1 × 1 convolutional module incorporating the GAM (Global Attention Mechanism) attention mechanism is designed in this paper to improve the ELAN structure, aiming to enhance the network’s perception and representation capabilities while controlling the network complexity. In addition, the XIOU loss function is introduced in the study to replace the traditional CIOU loss function, which effectively improves the robustness and convergence efficiency of the model. In the training stage, the sample imbalance problem is effectively solved by implementing differentiated weight allocations for different images and categories, which promotes the balance of the training process. The experimental data show that the optimized model achieves 94.8% in the highest mAP value, which is 6.4% higher than the original YOLOv7 network, significantly better than other existing models, and provides solid theoretical and technical support for further research and application in the field of PV-panel defect detection. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

28 pages, 14693 KiB

Open AccessEditor’s ChoiceArticle

Wildlife Real-Time Detection in Complex Forest Scenes Based on YOLOv5s Deep Learning Network

by Zhibin Ma, Yanqi Dong, Yi Xia, Delong Xu, Fu Xu and Feixiang Chen

Remote Sens. 2024, 16(8), 1350; https://doi.org/10.3390/rs16081350 - 11 Apr 2024

Cited by 10 | Viewed by 5571

With the progressively deteriorating global ecological environment and the gradual escalation of human activities, the survival of wildlife has been severely impacted. Hence, a rapid, precise, and reliable method for detecting wildlife holds immense significance in safeguarding their existence and monitoring their status. [...] Read more.

With the progressively deteriorating global ecological environment and the gradual escalation of human activities, the survival of wildlife has been severely impacted. Hence, a rapid, precise, and reliable method for detecting wildlife holds immense significance in safeguarding their existence and monitoring their status. However, due to the rare and concealed nature of wildlife activities, the existing wildlife detection methods face limitations in efficiently extracting features during real-time monitoring in complex forest environments. These models exhibit drawbacks such as slow speed and low accuracy. Therefore, we propose a novel real-time monitoring model called WL-YOLO, which is designed for lightweight wildlife detection in complex forest environments. This model is built upon the deep learning model YOLOv5s. In WL-YOLO, we introduce a novel and lightweight feature extraction module. This module is comprised of a deeply separable convolutional neural network integrated with compression and excitation modules in the backbone network. This design is aimed at reducing the number of model parameters and computational requirements, while simultaneously enhancing the feature representation of the network. Additionally, we introduced a CBAM attention mechanism to enhance the extraction of local key features, resulting in improved performance of WL-YOLO in the natural environment where wildlife has high concealment and complexity. This model achieved a mean accuracy (mAP) value of 97.25%, an F1-score value of 95.65%, and an accuracy value of 95.14%. These results demonstrated that this model outperforms the current mainstream deep learning models. Additionally, compared to the YOLOv5m base model, WL-YOLO reduces the number of parameters by 44.73% and shortens the detection time by 58%. This study offers technical support for detecting and protecting wildlife in intricate environments by introducing a highly efficient and advanced wildlife detection model. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

18 pages, 687 KiB

Open AccessArticle

MHDNet: A Multi-Scale Hybrid Deep Learning Model for Person Re-Identification

by Jinghui Wang and Jun Wang

Electronics 2024, 13(8), 1435; https://doi.org/10.3390/electronics13081435 - 10 Apr 2024

Cited by 2 | Viewed by 2215

The primary objective of person re-identification is to identify individuals from surveillance videos across various scenarios. Conventional pedestrian recognition models typically employ convolutional neural network (CNN) and vision transformer (ViT) networks to extract features, and while CNNs are adept at extracting local features [...] Read more.

The primary objective of person re-identification is to identify individuals from surveillance videos across various scenarios. Conventional pedestrian recognition models typically employ convolutional neural network (CNN) and vision transformer (ViT) networks to extract features, and while CNNs are adept at extracting local features through convolution operations, capturing global information can be challenging, especially when dealing with high-resolution images. In contrast, ViT rely on cascaded self-attention modules to capture long-range feature dependencies, sacrificing local feature details. In light of these limitations, this paper presents the MHDNet, a hybrid network structure for pedestrian recognition that combines convolutional operations and self-attention mechanisms to enhance representation learning. The MHDNet is built around the Feature Fusion Module (FFM), which harmonizes global and local features at different resolutions. With a parallel structure, the MHDNet model maximizes the preservation of local features and global representations. Experiments on two person re-identification datasets demonstrate the superiority of the MHDNet over other state-of-the-art methods. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

11 pages, 1269 KiB

Open AccessArticle

Hybrid-Margin Softmax for the Detection of Trademark Image Similarity

by Chenyang Wang, Guangyuan Zheng and Hongtao Shan

Appl. Sci. 2024, 14(7), 2865; https://doi.org/10.3390/app14072865 - 28 Mar 2024

Viewed by 1280

The detection of image similarity is critical to trademark (TM) legal registration and court judgment on infringement cases. Meanwhile, there are great challenges regarding the annotation of similar pairs and model generalization on rapidly growing data when deep learning is introduced into the [...] Read more.

The detection of image similarity is critical to trademark (TM) legal registration and court judgment on infringement cases. Meanwhile, there are great challenges regarding the annotation of similar pairs and model generalization on rapidly growing data when deep learning is introduced into the task. The research idea of metric learning is naturally suited for the task where similarity of input is given instead of classification, but current methods are not targeted at the task and should be upgraded. To address these issues, loss-driven model training is introduced, and a hybrid-margin softmax (HMS) is proposed exactly based on the peculiarity of TM images. Two additive penalty margins are attached to the softmax to expand the decision boundary and develop greater tolerance for slight differences between similar TM images. With the HMS, a Siamese neural network (SNN) as the feature extractor is further penalized and the discrimination ability is improved. Experiments demonstrate that the detection model trained on HMS can make full use of small numbers of training data and has great discrimination ability on bigger quantities of test data. Meanwhile, the model can reach high performance with less depth of SNN. Extensive experiments indicate that the HMS-driven model trained completely on TM data generalized well on the face recognition (FR) task, which involves another type of image data. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

25 pages, 8266 KiB

Open AccessArticle

Infrared Small Target Detection Based on Tensor Tree Decomposition and Self-Adaptive Local Prior

by Guiyu Zhang, Zhenyu Ding, Qunbo Lv, Baoyu Zhu, Wenjian Zhang, Jiaao Li and Zheng Tan

Remote Sens. 2024, 16(6), 1108; https://doi.org/10.3390/rs16061108 - 21 Mar 2024

Cited by 1 | Viewed by 2390

Infrared small target detection plays a crucial role in both military and civilian systems. However, current detection methods face significant challenges in complex scenes, such as inaccurate background estimation, inability to distinguish targets from similar non-target points, and poor robustness across various scenes. [...] Read more.

Infrared small target detection plays a crucial role in both military and civilian systems. However, current detection methods face significant challenges in complex scenes, such as inaccurate background estimation, inability to distinguish targets from similar non-target points, and poor robustness across various scenes. To address these issues, this study presents a novel spatial–temporal tensor model for infrared small target detection. In our method, we introduce the tensor tree rank to capture global structure in a more balanced strategy, which helps achieve more accurate background estimation. Meanwhile, we design a novel self-adaptive local prior weight by evaluating the level of clutter and noise content in the image. It mitigates the imbalance between target enhancement and background suppression. Then, the spatial–temporal total variation (STTV) is used as a joint regularization term to help better remove noise and obtain better detection performance. Finally, the proposed model is efficiently solved by the alternating direction multiplier method (ADMM). Extensive experiments demonstrate that our method achieves superior detection performance when compared with other state-of-the-art methods in terms of target enhancement, background suppression, and robustness across various complex scenes. Furthermore, we conduct an ablation study to validate the effectiveness of each module in the proposed model. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 5507 KiB

Open AccessArticle

Research on Coaxiality Measurement Method for Automobile Brake Piston Components Based on Machine Vision

by Qinghua Li, Weinan Ge, Hu Shi, Wanting Zhao and Shihong Zhang

Appl. Sci. 2024, 14(6), 2371; https://doi.org/10.3390/app14062371 - 11 Mar 2024

Cited by 2 | Viewed by 1225

Aiming at addressing the problem of the online detection of automobile brake piston components, a non-contact measurement method based on the combination of machine vision and image processing technology is proposed. Firstly, an industrial camera is used to capture an image, and a [...] Read more.

Aiming at addressing the problem of the online detection of automobile brake piston components, a non-contact measurement method based on the combination of machine vision and image processing technology is proposed. Firstly, an industrial camera is used to capture an image, and a series of image preprocessing algorithms is used to extract a clear contour of a test piece with a unit pixel width. Secondly, based on the structural characteristics of automobile brake piston components, the region of interest is extracted, and the test piece is segmented into spring region and cylinder region. Then, based on mathematical morphology techniques, the edges of the image are optimized. We extract geometric feature points by comparing the heights of adjacent pixel points on both sides of the pixel points, so as to calculate the variation of the spring axis relative to the reference axis (centerline of the cylinder). Then, we extract the maximum variation from all images, and calculate the coaxiality error value using this maximum variation. Finally, we validate the feasibility of the proposed method and the stability of extracting geometric feature points through experiments. The experiments demonstrate the feasibility of the method in engineering practice, with the stability in extracting geometric feature points reaching 99.25%. Additionally, this method offers a new approach and perspective for coaxiality measurement of stepped shaft parts. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

19 pages, 11826 KiB

Open AccessArticle

A Convolution with Transformer Attention Module Integrating Local and Global Features for Object Detection in Remote Sensing Based on YOLOv8n

by Kaiqi Lang, Jie Cui, Mingyu Yang, Hanyu Wang, Zilong Wang and Honghai Shen

Remote Sens. 2024, 16(5), 906; https://doi.org/10.3390/rs16050906 - 4 Mar 2024

Cited by 14 | Viewed by 4225

Object detection in remote sensing scenarios plays an indispensable and significant role in civilian, commercial, and military areas, leveraging the power of convolutional neural networks (CNNs). Remote sensing images, captured by crafts and satellites, exhibit unique characteristics including complicated backgrounds, limited features, distinct [...] Read more.

Object detection in remote sensing scenarios plays an indispensable and significant role in civilian, commercial, and military areas, leveraging the power of convolutional neural networks (CNNs). Remote sensing images, captured by crafts and satellites, exhibit unique characteristics including complicated backgrounds, limited features, distinct density, and varied scales. The contextual and comprehensive information in an image can make a detector precisely localize and classify targets, which is extremely valuable for object detection in remote sensing scenarios. However, CNNs, restricted by the essence of the convolution operation, possess local receptive fields and scarce contextual information, even in large models. To address this limitation and improve detection performance by extracting global contextual information, we propose a novel plug-and-play attention module, named Convolution with Transformer Attention Module (CTAM). CTAM is composed of a convolutional bottleneck block and a simplified Transformer layer, which can facilitate the integration of local features and position information with long-range dependency. YOLOv8n, a superior and faster variant of the YOLO series, is selected as the baseline. To demonstrate the effectiveness and efficiency of CTAM, we incorporated CTAM into YOLOv8n and conducted extensive experiments on the DIOR dataset. YOLOv8n-CTAM achieves an impressive 54.2 mAP@50-95, surpassing YOLOv8n (51.4) by a large margin. Notably, it outperforms the baseline by 2.7 mAP@70 and 4.4 mAP@90, showcasing its superiority with stricter IoU thresholds. Furthermore, the experiments conducted on the TGRS-HRRSD dataset validate the excellent generalization ability of CTAM. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

26 pages, 29677 KiB

Open AccessArticle

Development of a Powder Analysis Procedure Based on Imaging Techniques for Examining Aggregation and Segregation Phenomena

by Giuseppe Bonifazi, Paolo Barontini, Riccardo Gasbarrone, Davide Gattabria and Silvia Serranti

J. Imaging 2024, 10(3), 53; https://doi.org/10.3390/jimaging10030053 - 21 Feb 2024

Viewed by 1843

In this manuscript, a method that utilizes classical image techniques to assess particle aggregation and segregation, with the primary goal of validating particle size distribution determined by conventional methods, is presented. This approach can represent a supplementary tool in quality control systems for [...] Read more.

In this manuscript, a method that utilizes classical image techniques to assess particle aggregation and segregation, with the primary goal of validating particle size distribution determined by conventional methods, is presented. This approach can represent a supplementary tool in quality control systems for powder production processes in industries such as manufacturing and pharmaceuticals. The methodology involves the acquisition of high-resolution images, followed by their fractal and textural analysis. Fractal analysis plays a crucial role by quantitatively measuring the complexity and self-similarity of particle structures. This approach allows for the numerical evaluation of aggregation and segregation phenomena, providing valuable insights into the underlying mechanisms at play. Textural analysis contributes to the characterization of patterns and spatial correlations observed in particle images. The examination of textural features offers an additional understanding of particle arrangement and organization. Consequently, it aids in validating the accuracy of particle size distribution measurements. To this end, by incorporating fractal and structural analysis, a methodology that enhances the reliability and accuracy of particle size distribution validation is obtained. It enables the identification of irregularities, anomalies, and subtle variations in particle arrangements that might not be detected by traditional measurement techniques alone. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Graphical abstract

16 pages, 3114 KiB

Open AccessArticle

Underwater Degraded Image Restoration by Joint Evaluation and Polarization Partition Fusion

by Changye Cai, Yuanyi Fan, Ronghua Li, Haotian Cao, Shenghui Zhang and Mianze Wang

Appl. Sci. 2024, 14(5), 1769; https://doi.org/10.3390/app14051769 - 21 Feb 2024

Cited by 2 | Viewed by 1490

Images of underwater environments suffer from contrast degradation, reduced clarity, and information attenuation. The traditional method is the global estimate of polarization. However, targets in water often have complex polarization properties. For low polarization regions, since the polarization is similar to the polarization [...] Read more.

Images of underwater environments suffer from contrast degradation, reduced clarity, and information attenuation. The traditional method is the global estimate of polarization. However, targets in water often have complex polarization properties. For low polarization regions, since the polarization is similar to the polarization of background, it is difficult to distinguish between target and non-targeted regions when using traditional methods. Therefore, this paper proposes a joint evaluation and partition fusion method. First, we use histogram stretching methods for preprocessing two polarized orthogonal images, which increases the image contrast and enhances the image detail information. Then, the target is partitioned according to the values of each pixel point of the polarization image, and the low and high polarization target regions are extracted based on polarization values. To address the practical problem, the low polarization region is recovered using the polarization difference method, and the high polarization region is recovered using the joint estimation of multiple optimization metrics. Finally, the low polarization and the high polarization regions are fused. Subjectively, the experimental results as a whole have been fully restored, and the information has been retained completely. Our method can fully recover the low polarization region, effectively remove the scattering effect and increase an image’s contrast. Objectively, the results of the experimental evaluation indexes, EME, Entropy, and Contrast, show that our method performs significantly better than the other methods, which confirms the feasibility of this paper’s algorithm for application in specific underwater scenarios. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 5785 KiB

Open AccessArticle

Research on Rejoining Bone Stick Fragment Images: A Method Based on Multi-Scale Feature Fusion Siamese Network Guided by Edge Contour

by Jingjing He, Huiqin Wang, Rui Liu, Li Mao, Ke Wang, Zhan Wang and Ting Wang

Appl. Sci. 2024, 14(2), 717; https://doi.org/10.3390/app14020717 - 15 Jan 2024

Cited by 3 | Viewed by 1588

The rejoining of bone sticks holds significant importance in studying the historical and cultural aspects of the Han Dynasty. Currently, the rejoining work of bone inscriptions heavily relies on manual efforts by experts, demanding a considerable amount of time and energy. This paper [...] Read more.

The rejoining of bone sticks holds significant importance in studying the historical and cultural aspects of the Han Dynasty. Currently, the rejoining work of bone inscriptions heavily relies on manual efforts by experts, demanding a considerable amount of time and energy. This paper introduces a multi-scale feature fusion Siamese network guided by edge contour (MFS-GC) model. Constructing a Siamese network framework, it first uses a residual network to extract features of bone sticks, which is followed by computing the L2 distance for similarity measurement. During the extraction of feature vectors using the residual network, the BN layer tends to lose contour detail information, resulting in less conspicuous feature extraction, especially along fractured edges. To address this issue, the Spatially Adaptive DEnormalization (SPADE) model is employed to guide the normalization of contour images of bone sticks. This ensures that the network can learn multi-scale boundary contour features at each layer. Finally, the extracted multi-scale fused features undergo similarity measurement for local matching of bone stick fragment images. Additionally, a Conjugable Bone Stick Dataset (CBSD) is constructed. In the experimental validation phase, the MFS-GC algorithm is compared with classical similarity calculation methods in terms of precision, recall, and miss detection rate. The experiments demonstrate that the MFS-GC algorithm achieves an average accuracy of 95.5% in the Top-15 on the CBSD. The findings of this research can contribute to solving the rejoining issues of bone sticks. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

16 pages, 4061 KiB

Open AccessArticle

EDF-YOLOv5: An Improved Algorithm for Power Transmission Line Defect Detection Based on YOLOv5

by Hongxing Peng, Minjun Liang, Chang Yuan and Yongqiang Ma

Electronics 2024, 13(1), 148; https://doi.org/10.3390/electronics13010148 - 29 Dec 2023

Cited by 10 | Viewed by 1878

Detecting defects in power transmission lines through unmanned aerial inspection images is crucial for evaluating the operational status of outdoor transmission equipment. This paper presents a defect recognition method called EDF-YOLOv5, which is based on the YOLOv5s, to enhance detection accuracy. Firstly, the [...] Read more.

Detecting defects in power transmission lines through unmanned aerial inspection images is crucial for evaluating the operational status of outdoor transmission equipment. This paper presents a defect recognition method called EDF-YOLOv5, which is based on the YOLOv5s, to enhance detection accuracy. Firstly, the EN-SPPFCSPC module is designed to improve the algorithm’s ability to extract information, thereby enhancing the detection performance for small target defects. Secondly, the algorithm incorporates a high-level semantic feature information extraction network, DCNv3C3, which improves its ability to generalize to defects of different shapes. Lastly, a new bounding box loss function, Focal-CIoU, is introduced to enhance the contribution of high-quality samples during training. The experimental results demonstrate that the enhanced algorithm achieves a 2.3% increase in mean average precision (mAP@.5) for power transmission line defect detection, a 0.9% improvement in F1-score, and operates at a detection speed of 117 frames per second. These findings highlight the superior performance of EDF-YOLOv5 in detecting power transmission line defects. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

14 pages, 7400 KiB

Open AccessArticle

Non-Local Means Hole Repair Algorithm Based on Adaptive Block

by Bohu Zhao, Lebao Li and Haipeng Pan

Appl. Sci. 2024, 14(1), 159; https://doi.org/10.3390/app14010159 - 24 Dec 2023

Cited by 2 | Viewed by 1238

RGB-D cameras provide depth and color information and are widely used in 3D reconstruction and computer vision. In the majority of existing RGB-D cameras, a considerable portion of depth values is often lost due to severe occlusion or limited camera coverage, thereby adversely [...] Read more.

RGB-D cameras provide depth and color information and are widely used in 3D reconstruction and computer vision. In the majority of existing RGB-D cameras, a considerable portion of depth values is often lost due to severe occlusion or limited camera coverage, thereby adversely impacting the precise localization and three-dimensional reconstruction of objects. In this paper, to address the issue of poor-quality in-depth images captured by RGB-D cameras, a depth image hole repair algorithm based on non-local means is proposed first, leveraging the structural similarities between grayscale and depth images. Second, while considering the cumbersome parameter tuning associated with the non-local means hole repair method for determining the size of structural blocks for depth image hole repair, an intelligent block factor is introduced, which automatically determines the optimal search and repair block sizes for various hole sizes, resulting in the development of an adaptive block-based non-local means algorithm for repairing depth image holes. Furthermore, the proposed algorithm’s performance are evaluated using both the Middlebury stereo matching dataset and a self-constructed RGB-D dataset, with performance assessment being carried out by comparing the algorithm against other methods using five metrics: RMSE, SSIM, PSNR, DE, and ALME. Finally, experimental results unequivocally demonstrate the innovative resolution of the parameter tuning complexity inherent in-depth image hole repair, effectively filling the holes, suppressing noise within depth images, enhancing image quality, and achieving elevated precision and accuracy, as affirmed by the attained results. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Displaying articles 1-70

Submit your Manuscript

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Applied Sciences applsci	2.5	5.5	2011	19.8 Days	CHF 2400
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800
Mathematics mathematics	2.2	4.6	2013	18.4 Days	CHF 2600
Remote Sensing remotesensing	4.1	8.6	2009	24.9 Days	CHF 2700

Submit your Abstract

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Applied Sciences applsci	2.5	5.5	2011	19.8 Days	CHF 2400
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800
Mathematics mathematics	2.2	4.6	2013	18.4 Days	CHF 2600
Remote Sensing remotesensing	4.1	8.6	2009	24.9 Days	CHF 2700

Back to TopTop