You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.

Search for Topics:

Title/Keyword

Journal

Submission Status

Category

Submit your Manuscript Submit your Abstract Propose a Topic

Topic Menu

Topic Editors

Prof. Dr. Antonio Fernández-Caballero

E-Mail Website

Instituto de Investigación en Informática de Albacete, Universidad de Castilla-La Mancha, 02071 Albacete, Spain

Prof. Dr. Byung-Gyu Kim

E-Mail Website

Department of IT Engineering, Sookmyung Women’s University, Seoul 04310, Republic of Korea

Applied Computer Vision and Pattern Recognition: 2nd Edition

Abstract submission deadline

30 September 2025

Manuscript submission deadline

31 December 2025

Viewed by

168892

Topic Information

Dear Colleagues,

Computer vision is a field of artificial intelligence that trains computers to interpret and understand the visual world. Computer vision tasks include methods for acquiring digital images (through image sensors), image processing, and image analysis to reach an understanding of digital images. In general, it deals with the extraction of high-dimensional data from the real world in order to produce numerical or symbolic information that a computer can interpret. For interpretation, computer vision is closely related to pattern recognition.

Indeed, pattern recognition is the process of recognizing patterns by using machine learning algorithms. Pattern recognition can be defined as the identification and classification of meaningful patterns of data based on the extraction and comparison of characteristic properties or features of the data. Pattern recognition is a very important area of research and application, underpinning developments in related fields, such as computer vision, image processing, text and document analysis, and neural networks. It is closely related to machine learning and finds applications in rapidly emerging areas, such as biometrics, bioinformatics, multimedia data analysis, and, more recently, data science. Nowdays, a data-driven approach (such as deep learning) is popular to achieve the goal of pattern recognition and classification in many applications.

This Topic, on Applied Computer Vision and Pattern Recognition, invites papers on theoretical and applied issues, including, but not limited to, the following areas:

Statistical, structural, and syntactic pattern recognition;
Neural networks, machine learning, and deep learning;
Computer vision, robot vision, and machine vision;
Multimedia systems and multimedia content;
Biosignal processing, speech processing, image processing, and video processing;
Data mining, information retrieval, big data, and business intelligence.

This Topic will present the results of research describing recent advances in both the computer vision and pattern recognition fields.

Prof. Dr. Antonio Fernández-Caballero
Prof. Dr. Byung-Gyu Kim
Topic Editors

Keywords

pattern recognition
neural networks, machine learning
deep learning, artificial intelligence
computer vision
multimedia
data mining
signal processing
image processing

Participating Journals

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Applied Sciences applsci	2.5	5.5	2011	19.8 Days	CHF 2400	Submit
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400	Submit
Machine Learning and Knowledge Extraction make	6.0	9.9	2019	25.5 Days	CHF 1800	Submit
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800	Submit
Sensors sensors	3.5	8.2	2001	19.7 Days	CHF 2600	Submit

Preprints.org is a multidisciplinary platform offering a preprint service designed to facilitate the early sharing of your research. It supports and empowers your research journey from the very beginning.

MDPI Topics is collaborating with Preprints.org and has established a direct connection between MDPI journals and the platform. Authors are encouraged to take advantage of this opportunity by posting their preprints at Preprints.org prior to publication:

Share your research immediately: disseminate your ideas prior to publication and establish priority for your work.
Safeguard your intellectual contribution: Protect your ideas with a time-stamped preprint that serves as proof of your research timeline.
Boost visibility and impact: Increase the reach and influence of your research by making it accessible to a global audience.
Gain early feedback: Receive valuable input and insights from peers before submitting to a journal.
Ensure broad indexing: Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Applied Computer Vision and Pattern Recognition (87 articles)

Published Papers (67 papers)

Download All Papers

Order results

Result details

Journals

Show export options Show export options

Select all

Export citation of selected articles as:

21 pages, 25577 KB

Open AccessArticle

DFFNet: A Dual-Domain Feature Fusion Network for Single Remote Sensing Image Dehazing

by Huazhong Jin, Zhang Chen, Zhina Song and Kaimin Sun

Sensors 2025, 25(16), 5125; https://doi.org/10.3390/s25165125 - 18 Aug 2025

Viewed by 334

Single remote sensing image dehazing aims to eliminate atmospheric scattering effects without auxiliary information. It serves as a crucial preprocessing step for enhancing the performance of downstream tasks in remote sensing images. Conventional approaches often struggle to balance haze removal and detail restoration [...] Read more.

Single remote sensing image dehazing aims to eliminate atmospheric scattering effects without auxiliary information. It serves as a crucial preprocessing step for enhancing the performance of downstream tasks in remote sensing images. Conventional approaches often struggle to balance haze removal and detail restoration under non-uniform haze distributions. To address this issue, we propose a Dual-domain Feature Fusion Network (DFFNet) for remote sensing image dehazing. DFFNet consists of two specialized units: the Frequency Restore Unit (FRU) and the Context Extract Unit (CEU). As haze primarily manifests as low-frequency energy in the frequency domain, the FRU effectively suppresses haze across the entire image by adaptively modulating low-frequency amplitudes. Meanwhile, to reconstruct details attenuated due to dense haze occlusion, we introduce the CEU. This unit extracts multi-scale spatial features to capture contextual information, providing structural guidance for detail reconstruction. Furthermore, we introduce the Dual-Domain Feature Fusion Module (DDFFM) to establish dependencies between features from FRU and CEU via a designed attention mechanism. This leverages spatial contextual information to guide detail reconstruction during frequency domain haze removal. Experiments on the StateHaze1k, RICE and RRSHID datasets demonstrate that DFFNet achieves competitive performance in both visual quality and quantitative metrics. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 8033 KB

Open AccessArticle

PU-DZMS: Point Cloud Upsampling via Dense Zoom Encoder and Multi-Scale Complementary Regression

by Shucong Li, Zhenyu Liu, Tianlei Wang and Zhiheng Zhou

J. Imaging 2025, 11(8), 270; https://doi.org/10.3390/jimaging11080270 - 12 Aug 2025

Viewed by 369

Point cloud imaging technology usually faces the problem of point cloud sparsity, which leads to a lack of important geometric detail. There are many point cloud upsampling networks that have been designed to solve this problem. However, the existing methods have limitations in [...] Read more.

Point cloud imaging technology usually faces the problem of point cloud sparsity, which leads to a lack of important geometric detail. There are many point cloud upsampling networks that have been designed to solve this problem. However, the existing methods have limitations in local–global relation understanding, leading to contour distortion and many local sparse regions. To this end, PU-DZMS is proposed with two components. (1) the Dense Zoom Encoder (DENZE) is designed to capture local–global features by using ZOOM Blocks with a dense connection. The main module in the ZOOM Block is the Zoom Encoder, which embeds a Transformer mechanism into the down–upsampling process to enhance local–global geometric features. The geometric edge of the point cloud would be clear under the DENZE. (2) The Multi-Scale Complementary Regression (MSCR) module is designed to expand the features and regress a dense point cloud. MSCR obtains the features’ geometric distribution differences across scales to ensure geometric continuity, and it regresses new points by adopting cross-scale residual learning. The local sparse regions of the point cloud would be reduced by the MSCR module. The experimental results on the PU-GAN dataset and the PU-Net dataset show that the proposed method performs well on point cloud upsampling tasks. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

18 pages, 2688 KB

Open AccessArticle

Generalized Hierarchical Co-Saliency Learning for Label-Efficient Tracking

by Jie Zhao, Ying Gao, Chunjuan Bo and Dong Wang

Sensors 2025, 25(15), 4691; https://doi.org/10.3390/s25154691 - 29 Jul 2025

Viewed by 217

Visual object tracking is one of the core techniques in human-centered artificial intelligence, which is very useful for human–machine interaction. State-of-the-art tracking methods have shown their robustness and accuracy on many challenges. However, a large amount of videos with precisely dense annotations are [...] Read more.

Visual object tracking is one of the core techniques in human-centered artificial intelligence, which is very useful for human–machine interaction. State-of-the-art tracking methods have shown their robustness and accuracy on many challenges. However, a large amount of videos with precisely dense annotations are required for fully supervised training of their models. Considering that annotating videos frame-by-frame is a labor- and time-consuming workload, reducing the reliance on manual annotations during the tracking models’ training is an important problem to be resolved. To make a trade-off between the annotating costs and the tracking performance, we propose a weakly supervised tracking method based on co-saliency learning, which can be flexibly integrated into various tracking frameworks to reduce annotation costs and further enhance the target representation in current search images. Since our method enables the model to explore valuable visual information from unlabeled frames, and calculate co-salient attention maps based on multiple frames, our weakly supervised methods can obtain competitive performance compared to fully supervised baseline trackers, using only 3.33% of manual annotations. We integrate our method into two CNN-based trackers and a Transformer-based tracker; extensive experiments on four general tracking benchmarks demonstrate the effectiveness of our method. Furthermore, we also demonstrate the advantages of our method on egocentric tracking task; our weakly supervised method obtains 0.538 success on TREK-150, which is superior to prior state-of-the-art fully supervised tracker by 7.7%. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

22 pages, 2583 KB

Open AccessArticle

Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples

by Guangfu Wang, Dazhi Sun, Hao Li, Jian Cheng, Pengpeng Yan and Heping Li

Mach. Learn. Knowl. Extr. 2025, 7(3), 64; https://doi.org/10.3390/make7030064 - 9 Jul 2025

Viewed by 474

The underground coal mine environment is complex and dynamic, making the application of visual algorithms for object detection a crucial component of underground safety management as well as a key factor in ensuring the safe operation of workers. We look at this in [...] Read more.

The underground coal mine environment is complex and dynamic, making the application of visual algorithms for object detection a crucial component of underground safety management as well as a key factor in ensuring the safe operation of workers. We look at this in the context of helmet-wearing detection in underground mines, where over 25% of the targets are small objects. To address challenges such as the lack of effective samples for unworn helmets, significant background interference, and the difficulty of detecting small helmet targets, this paper proposes a novel underground helmet-wearing detection algorithm that combines dynamic background awareness with a limited number of valid samples to improve accuracy for underground workers. The algorithm begins by analyzing the distribution of visual surveillance data and spatial biases in underground environments. By using data augmentation techniques, it then effectively expands the number of training samples by introducing positive and negative samples for helmet-wearing detection from ordinary scenes. Thereafter, based on YOLOv10, the algorithm incorporates a background awareness module with region masks to reduce the adverse effects of complex underground backgrounds on helmet-wearing detection. Specifically, it adds a convolution and attention fusion module in the detection head to enhance the model’s perception of small helmet-wearing objects by enlarging the detection receptive field. By analyzing the aspect ratio distribution of helmet wearing data, the algorithm improves the aspect ratio constraints in the loss function, further enhancing detection accuracy. Consequently, it achieves precise detection of helmet-wearing in underground coal mines. Experimental results demonstrate that the proposed algorithm can detect small helmet-wearing objects in complex underground scenes, with a 14% reduction in background false detection rates, and thereby achieving accuracy, recall, and average precision rates of 94.4%, 89%, and 95.4%, respectively. Compared to other mainstream object detection algorithms, the proposed algorithm shows improvements in detection accuracy of 6.7%, 5.1%, and 11.8% over YOLOv9, YOLOv10, and RT-DETR, respectively. The algorithm proposed in this paper can be applied to real-time helmet-wearing detection in underground coal mine scenes, providing safety alerts for standardized worker operations and enhancing the level of underground security intelligence. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Graphical abstract

25 pages, 4786 KB

Open AccessArticle

Diagnosis by SAM Linked to Machine Vision Systems in Olive Pitting Machines

by Luis Villanueva Gandul, Antonio Madueño-Luna, José Miguel Madueño-Luna, Miguel Calixto López-Gordillo and Manuel Jesús González-Ortega

Appl. Sci. 2025, 15(13), 7395; https://doi.org/10.3390/app15137395 - 1 Jul 2025

Viewed by 522

Computer Vision (CV) has proven to be a powerful tool for automation in agri-food industrial processes, offering high-precision solutions tailored to specific working conditions. Recent advancements in Artificial Neural Networks (ANNs) have revolutionized CV applications, enabling systems to autonomously learn and optimize tasks. [...] Read more.

Computer Vision (CV) has proven to be a powerful tool for automation in agri-food industrial processes, offering high-precision solutions tailored to specific working conditions. Recent advancements in Artificial Neural Networks (ANNs) have revolutionized CV applications, enabling systems to autonomously learn and optimize tasks. However, ANN-based approaches often require complex development and lengthy training periods, making their implementation a challenge. In this study, we explore the use of the Segment Anything Model (SAM), a pre-trained neural network developed by META AI in 2023, as an alternative for industrial segmentation tasks in the table olive (Olea europaea L.) processing industry. SAM’s ability to segment objects regardless of scene composition makes it a promising tool to improve the efficiency of olive pitting machines (DRRs). These machines, widely employed in industrial processing, frequently experience mechanical inefficiencies, including the “boat error,” which arises when olives are improperly oriented, leading to defective pitting and pit splinter contamination. Our approach integrates SAM into n CV workflow to diagnose and quantify boat errors without designing or training an additional task-specific ANN. By analyzing the segmented images, we can determine both the percentage of boat errors and the size distribution of olives during transport. The results validate SAM as a feasible option for industrial segmentation, offering a simpler and more accessible solution compared to traditional ANN-based methods. Moreover, our statistical analysis reveals that improper calibration—manifested as size deviations from the nominal value—does not significantly increase boat error rates. This finding supports the adoption of complementary CV technologies to enhance olive pitting efficiency. Future work could investigate real-time integration and the combination of CV with electromechanical correction systems to fully automate and optimize the pitting process. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

37 pages, 7519 KB

Open AccessReview

Causality and “In-the-Wild” Video-Based Person Re-Identification: A Survey

by Md Rashidunnabi, Kailash Hambarde and Hugo Proença

Electronics 2025, 14(13), 2669; https://doi.org/10.3390/electronics14132669 - 1 Jul 2025

Viewed by 582

Video-based person re-identification (re-identification) remains underused in real-world deployments, despite impressive benchmark performance. Most existing models rely on superficial correlations—such as clothing, background, or lighting—that fail to generalize across domains, viewpoints, and temporal variations. This study examines the emerging role of causal reasoning [...] Read more.

Video-based person re-identification (re-identification) remains underused in real-world deployments, despite impressive benchmark performance. Most existing models rely on superficial correlations—such as clothing, background, or lighting—that fail to generalize across domains, viewpoints, and temporal variations. This study examines the emerging role of causal reasoning as a principled alternative to traditional correlation-based approaches in video-based re-identification. We provide a structured and critical analysis of methods that leverage structural causal models (SCMs), interventions, and counterfactual reasoning to isolate identity-specific features from confounding factors. This study is organized around a novel taxonomy of causal re-identification methods, spanning generative disentanglement, domain-invariant modeling, and causal transformers. We review current evaluation metrics and introduce causal-specific robustness measures. In addition, we assess the practical challenges—scalability, fairness, interpretability, and privacy—that must be addressed for real-world adoption. Finally, we identify open problems and outline future research directions that integrate causal modeling with efficient architectures and self-supervised learning. This study aims to establish a coherent foundation for causal video-based person re-identification and catalyze the next phase of research in this rapidly evolving domain. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

19 pages, 4332 KB

Open AccessArticle

Development of a Computer Vision-Based Method for Sizing and Boat Error Assessment in Olive Pitting Machines

by Luis Villanueva Gandul, Antonio Madueño-Luna, José Miguel Madueño-Luna, Miguel Calixto López-Gordillo and Manuel Jesús González-Ortega

Appl. Sci. 2025, 15(12), 6648; https://doi.org/10.3390/app15126648 - 13 Jun 2025

Cited by 1 | Viewed by 583

Table olive pitting machines (DRRs) are essential in the agri-food industry but face significant limitations that constrain their performance and compromise process reliability. The main defect, known as the “boat error”, results from improper olive orientation during pitting, leading to bone fragmentation, pulp [...] Read more.

Table olive pitting machines (DRRs) are essential in the agri-food industry but face significant limitations that constrain their performance and compromise process reliability. The main defect, known as the “boat error”, results from improper olive orientation during pitting, leading to bone fragmentation, pulp damage, and potential risks to consumer safety. Traditional quality control methods, such as the use of flotation tanks and expert sensory evaluation, rely on destructive sampling, are time-consuming, and reduce overall productivity. To address these challenges, this study presents a novel computer vision (CV) system integrated into a commercial DRR machine. The system captures high-speed images of Gordal olives (Olea europaea regalis) just before pitting; these are later analyzed offline using a custom MATLAB application that applies HSV-based segmentation and morphological analysis to quantify the olive size and orientation. The method accurately identifies boat error cases based on angular thresholds, without interrupting the production flow or damaging the product. The results show that 97% of olives were correctly aligned, with only 1.1% presenting critical misorientation. Additionally, for the first time, the system allowed a detailed evaluation of the olive size distribution at the machine inlet, revealing an unexpected proportion of off-caliber olives. This contamination in sizing suggests a possible link between calibration deviations and the occurrence of boat errors, introducing a new hypothesis for future investigation. While the current implementation is limited to offline analysis, it represents a non-destructive, low-cost, and highly precise diagnostic tool. This work lays the foundation for a deeper understanding of DRR machine behavior and provides a framework for future developments aimed at optimizing their performance through targeted correction strategies. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

15 pages, 35565 KB

Open AccessArticle

Boosting 3D Object Detection with Adversarial Adaptive Data Augmentation Strategy

by Shihao Li, Jingsong Li, Jianghua Fu and Qiuyue Chen

Sensors 2025, 25(11), 3493; https://doi.org/10.3390/s25113493 - 31 May 2025

Viewed by 686

In real-world applications, autonomous driving systems need to handle a variety of complex scenarios, such as object occlusion and lighting changes. In these scenarios, accurately identifying various objects is crucial for perceiving the surrounding environment and making reliable decisions. In this context, the [...] Read more.

In real-world applications, autonomous driving systems need to handle a variety of complex scenarios, such as object occlusion and lighting changes. In these scenarios, accurately identifying various objects is crucial for perceiving the surrounding environment and making reliable decisions. In this context, the fusion of Lidar and cameras is vital for the accuracy of object detection. To this end, we propose an adversarial adaptive data augmentation strategy that introduces virtual adversarial perturbations during the image feature extraction process, effectively enhancing the robustness of 3D object detection methods and enabling them to maintain stable performance when facing environmental changes and data perturbations. Experimental results on the nuScenes-mini and KITTI datasets show that, compared with previous 3D object detection methods, our method not only improves detection accuracy but also demonstrates stronger stability. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

13 pages, 3467 KB

Open AccessArticle

Pattern Matching-Based Denoising for Images with Repeated Sub-Structures

by Anil Kumar Mysore Badarinarayana, Christoph Pratsch, Thomas Lunkenbein and Florian Jug

Mach. Learn. Knowl. Extr. 2025, 7(2), 34; https://doi.org/10.3390/make7020034 - 7 Apr 2025

Viewed by 725

In electron microscopy, obtaining low-noise images is often difficult, especially when examining biological samples or delicate materials. Therefore, the suppression of noise is essential for the analysis of such noisy images. State-of-the-art image denoising methods are dominated by supervised Convolution neural network (CNN)-based [...] Read more.

In electron microscopy, obtaining low-noise images is often difficult, especially when examining biological samples or delicate materials. Therefore, the suppression of noise is essential for the analysis of such noisy images. State-of-the-art image denoising methods are dominated by supervised Convolution neural network (CNN)-based methods. However, supervised CNNs cannot be used if a noise-free ground truth is unavailable. To address this problem, we propose a method that uses re-occurring patterns in images. Our proposed method does not require noise-free images for the denoising task. Instead, it is based on the idea that averaging images with the same signal having independent noise suppresses the overall noise. In order to evaluate the performance of our method, we compare our results with other state-of-the-art denoising methods that do not require a noise-free image. We show that our method is the best for retaining fine image structures. Additionally, we develop a confidence map for evaluating the denoising quality of the proposed method. Furthermore, we analyze the time complexity of the algorithm to ensure scalability and optimize the algorithm to improve the runtime efficiency. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

23 pages, 8209 KB

Open AccessArticle

Spatio-Temporal Transformer with Kolmogorov–Arnold Network for Skeleton-Based Hand Gesture Recognition

by Pengcheng Han, Xin He, Takafumi Matsumaru and Vibekananda Dutta

Sensors 2025, 25(3), 702; https://doi.org/10.3390/s25030702 - 24 Jan 2025

Viewed by 2099

Manually crafted features often suffer from being subjective, having an inadequate accuracy, or lacking in robustness in recognition. Meanwhile, existing deep learning methods often overlook the structural and dynamic characteristics of the human hand, failing to fully explore the contextual information of joints [...] Read more.

Manually crafted features often suffer from being subjective, having an inadequate accuracy, or lacking in robustness in recognition. Meanwhile, existing deep learning methods often overlook the structural and dynamic characteristics of the human hand, failing to fully explore the contextual information of joints in both the spatial and temporal domains. To effectively capture dependencies between the hand joints that are not adjacent but may have potential connections, it is essential to learn long-term relationships. This study proposes a skeleton-based hand gesture recognition framework, the ST-KT, a spatio-temporal graph convolution network, and a transformer with the Kolmogorov–Arnold Network (KAN) model. It incorporates spatio-temporal graph convolution network (ST-GCN) modules and a spatio-temporal transformer module with KAN (KAN–Transformer). ST-GCN modules, which include a spatial graph convolution network (SGCN) and a temporal convolution network (TCN), extract primary features from skeleton sequences by leveraging the strength of graph convolutional networks in the spatio-temporal domain. A spatio-temporal position embedding method integrates node features, enriching representations by including node identities and temporal information. The transformer layer includes a spatial KAN–Transformer (S-KT) and a temporal KAN–Transformer (T-KT), which further extract joint features by learning edge weights and node embeddings, providing richer feature representations and the capability for nonlinear modeling. We evaluated the performance of our method on two challenging skeleton-based dynamic gesture datasets: our method achieved an accuracy of 97.5% on the SHREC’17 track dataset and 94.3% on the DHG-14/28 dataset. These results demonstrate that our proposed method, ST-KT, effectively captures dynamic skeleton changes and complex joint relationships. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

31 pages, 7619 KB

Open AccessArticle

Multi-Label Text Classification Based on Label-Sentence Bi-Attention Fusion Network with Multi-Level Feature Extraction

by Anqi Li and Lin Zhang

Electronics 2025, 14(1), 185; https://doi.org/10.3390/electronics14010185 - 5 Jan 2025

Viewed by 2457

Multi-label text classification (MLTC) aims to assign the most appropriate label or labels to each input text. Previous studies have focused on mining textual information, ignoring the interdependence of labels and texts, thus leading to the loss of information about labels. In addition, [...] Read more.

Multi-label text classification (MLTC) aims to assign the most appropriate label or labels to each input text. Previous studies have focused on mining textual information, ignoring the interdependence of labels and texts, thus leading to the loss of information about labels. In addition, previous studies have tended to focus on the single granularity of information in documents, ignoring the degree of inclination towards labels in different sentences in multi-labeled texts. In order to solve the above problems, this paper proposes a Label-Sentence Bi-Attention Fusion Network (LSBAFN) with multi-level feature extraction for mining multi-granularity information and label information in documents. Specifically, document-level and sentence-level word embeddings are first obtained. Then, the textual relevance of the labels to these two levels is utilized to construct sentence-level textual representations. Next, a multi-level feature extraction mechanism is utilized to acquire a sentence-level textual representation that incorporates contextual information and a document-level textual representation that reflects label features. Subsequently, the label-sentence bi-attention fusion mechanism is used to learn the feature relationships in the two text representations and fuse them. Label attention identifies text features related to labels from the document-level text representation, while sentence attention focuses on the tendency of sentences towards labels. Finally, the effective portion of the fused features is extracted for classification by a multi-layer perceptron. The experimental findings indicate that the LSBAFN can improve the effectiveness of the MLTC task. Compared with the baseline models, the LSBAFN obtains a significant improvement of 0.6% and 7.81% in Micro-F1 and Macro-F1 on the Article Topic dataset and improvements of 1.03% and 0.47% in P@k and 1.02% and 0.38% in nDCG@k on the Software Category dataset and RCV1 dataset. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

24 pages, 7279 KB

Open AccessArticle

An Accurate Book Spine Detection Network Based on Improved Oriented R-CNN

by Haibo Ma, Chaobo Wang, Ang Li, Aide Xu and Dong Han

Sensors 2024, 24(24), 7996; https://doi.org/10.3390/s24247996 - 14 Dec 2024

Viewed by 2195

Book localization is crucial for the development of intelligent book inventory systems, where the high-precision detection of book spines is a critical requirement. However, the varying tilt angles and diverse aspect ratios of books on library shelves often reduce the effectiveness of conventional [...] Read more.

Book localization is crucial for the development of intelligent book inventory systems, where the high-precision detection of book spines is a critical requirement. However, the varying tilt angles and diverse aspect ratios of books on library shelves often reduce the effectiveness of conventional object detection algorithms. To address these challenges, this study proposes an enhanced oriented R-CNN algorithm for book spine detection. First, we replace the standard 3 × 3 convolutions in ResNet50’s residual blocks with deformable convolutions to enhance the network’s capacity for modeling the geometric deformations of book spines. Additionally, the PAFPN (Path Aggregation Feature Pyramid Network) was integrated into the neck structure to enhance multi-scale feature fusion. To further optimize the anchor box design, we introduce an adaptive initial cluster center selection method for K-median clustering. This allows for a more accurate computation of anchor box aspect ratios that are better aligned with the book spine dataset, enhancing the model’s training performance. We conducted comparison experiments between the proposed model and other state-of-the-art models on the book spine dataset, and the results demonstrate that the proposed approach reaches an mAP of 90.22%, which outperforms the baseline algorithm by 4.47 percentage points. Our method significantly improves detection accuracy, making it highly effective for identifying book spines in real-world library environments. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

21 pages, 14612 KB

Open AccessArticle

Corrupted Point Cloud Classification Through Deep Learning with Local Feature Descriptor

by Xian Wu, Xueyi Guo, Hang Peng, Bin Su, Sabbir Ahamod and Fenglin Han

Sensors 2024, 24(23), 7749; https://doi.org/10.3390/s24237749 - 4 Dec 2024

Cited by 2 | Viewed by 1476

Three-dimensional point cloud recognition is a very fundamental work in fields such as autonomous driving and face recognition. However, in real industrial scenarios, input point cloud data are often accompanied by factors such as occlusion, rotation, and noise. These factors make it challenging [...] Read more.

Three-dimensional point cloud recognition is a very fundamental work in fields such as autonomous driving and face recognition. However, in real industrial scenarios, input point cloud data are often accompanied by factors such as occlusion, rotation, and noise. These factors make it challenging to apply existing point cloud classification algorithms in real industrial scenarios. Currently, most studies enhance model robustness from the perspective of neural network structure. However, researchers have found that simply adjusting the neural network structure has proven insufficient in addressing the decline in accuracy caused by data corruption. In this article, we use local feature descriptors as a preprocessing method to extract features from point cloud data and propose a new neural network architecture aligned with these local features, effectively enhancing performance even in extreme cases of data corruption. In addition, we conducted data augmentation to the 10 intentionally selected categories in ModelNet40. Finally, we conducted multiple experiments, including testing the robustness of the model to occlusion and coordinate transformation and then comparing the model with existing SOTA models. Furthermore, in actual scene experiments, we used depth cameras to capture objects and input the obtained data into the established model. The experimental results show that our model outperforms existing popular algorithms when dealing with corrupted point cloud data. Even when the input point cloud data are affected by occlusion or coordinate transformation, our proposed model can maintain high accuracy. This suggests that our method can alleviate the problem of decreased model accuracy caused by the aforementioned factors. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

20 pages, 9588 KB

Open AccessArticle

Research on Video Monitoring Technology for Galloping of OCS Additional Conductors of High-Speed Railway in Strong Wind Zone

by Wentao Zhang, Wenhao Wang, Shanpeng Zhao, Huayu Yuan, Youpeng Zhang, Xiaotong Yao and Guangwu Chen

Sensors 2024, 24(23), 7521; https://doi.org/10.3390/s24237521 - 25 Nov 2024

Viewed by 1024

The strong wind environment causes the additional conductor of the overhead contact system (OCS) of the Lanzhou–Xinjiang high-speed railway to gallop, significantly impacting the safe operation of the train. This paper presents the design of an online monitoring system for the galloping of [...] Read more.

The strong wind environment causes the additional conductor of the overhead contact system (OCS) of the Lanzhou–Xinjiang high-speed railway to gallop, significantly impacting the safe operation of the train. This paper presents the design of an online monitoring system for the galloping of additional conductors in the OCS, utilizing video monitoring for accurate and real-time assessment. Initially, the dynamics of the OCS additional conductor and its operational environment are examined, leading to the selection of suitable data transmission and power supply methods to finalize the camera configuration. Secondly, a preprocessing method for enhancing images of galloping in OCS additional conductors is developed, effectively reducing noise in edge detection through a region chain code clustering analysis. The video monitoring system effectively extracts wire edges, addressing the issues of splitting, breakage, and edge overlap in edge detection, while accurately identifying wire targets in video images. In conclusion, a galloping monitoring test platform is established to extract galloping data from additional conductors through video monitoring. The analysis of the galloping frequency and amplitude facilitates the comprehensive monitoring and assessment of the galloping status of OCS additional conductors. The video monitoring system effectively extracts and analyzes galloping data of the OCS additional conductor, fulfilling the fundamental requirements for the online monitoring of additional conductor galloping, and possesses significant engineering application value. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 9284 KB

Open AccessArticle

BSDA: Bayesian Random Semantic Data Augmentation for Medical Image Classification

by Yaoyao Zhu, Xiuding Cai, Xueyao Wang, Xiaoqing Chen, Zhongliang Fu and Yu Yao

Sensors 2024, 24(23), 7511; https://doi.org/10.3390/s24237511 - 25 Nov 2024

Cited by 2 | Viewed by 1983

Data augmentation is a crucial regularization technique for deep neural networks, particularly in medical imaging tasks with limited data. Deep learning models are highly effective at linearizing features, enabling the alteration of feature semantics through the shifting of latent space representations—an approach known [...] Read more.

Data augmentation is a crucial regularization technique for deep neural networks, particularly in medical imaging tasks with limited data. Deep learning models are highly effective at linearizing features, enabling the alteration of feature semantics through the shifting of latent space representations—an approach known as semantic data augmentation (SDA). The paradigm of SDA involves shifting features in a specified direction. Current SDA methods typically sample the amount of shifting from a Gaussian distribution or the sample variance. However, excessive shifting can lead to changes in data labels, which may negatively impact model performance. To address this issue, we propose a computationally efficient method called Bayesian Random Semantic Data Augmentation (BSDA). BSDA can be seamlessly integrated as a plug-and-play component into any neural network. Our experiments demonstrate that BSDA outperforms competitive methods and is suitable for both 2D and 3D medical image datasets, as well as most medical imaging modalities. Additionally, BSDA is compatible with mainstream neural network models and enhances baseline performance. The code is available online. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 5541 KB

Open AccessArticle

An Efficient Printing Defect Detection Based on YOLOv5-DCN-LSK

by Jie Liu, Zelong Cai, Kuanfang He, Chengqiang Huang, Xianxin Lin, Zhenyong Liu, Zhicong Li and Minsheng Chen

Sensors 2024, 24(23), 7429; https://doi.org/10.3390/s24237429 - 21 Nov 2024

Cited by 1 | Viewed by 1580

During the production process of inkjet printing labels, printing defects can occur, affecting the readability of product information. The distinctive shapes and subtlety of printing defects present a significant challenge for achieving high accuracy and rapid detection in existing deep learning-based defect detection [...] Read more.

During the production process of inkjet printing labels, printing defects can occur, affecting the readability of product information. The distinctive shapes and subtlety of printing defects present a significant challenge for achieving high accuracy and rapid detection in existing deep learning-based defect detection systems. To overcome this problem, we propose an improved model based on the structure of the YOLOv5 network to enhance the detection performance of printing defects. The main improvements include the following: First, we introduce the C3-DCN module to replace the C3 module in the backbone network, enhancing the model’s ability to detect narrow and elongated defects. Secondly, we incorporate the Large Selective Kernel (LSK) and RepConv modules into the feature fusion network, while also integrating a loss function that combines Normalized Gaussian Wasserstein Distance (NWD) with Efficient IoU (EIoU) to enhance the model’s focus on small targets. Finally, we apply model pruning techniques to reduce the model’s size and parameter count, thereby achieving faster detection. Experimental results demonstrate that the improved YOLOv5 achieved a mAP@0.5 of 0.741 after training, with 323.2 FPS, which is 2.7 and 20.8% higher than that of YOLOv5, respectively. The method meets the requirements of high precision and high efficiency for printing defect detection. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 4619 KB

Open AccessArticle

Efficient Video Compression Using Afterimage Representation

by Minseong Jeon and Kyungjoo Cheoi

Sensors 2024, 24(22), 7398; https://doi.org/10.3390/s24227398 - 20 Nov 2024

Viewed by 1555

Recent advancements in large-scale video data have highlighted the growing need for efficient data compression techniques to enhance video processing performance. In this paper, we propose an afterimage-based video compression method that significantly reduces video data volume while maintaining analytical performance. The proposed [...] Read more.

Recent advancements in large-scale video data have highlighted the growing need for efficient data compression techniques to enhance video processing performance. In this paper, we propose an afterimage-based video compression method that significantly reduces video data volume while maintaining analytical performance. The proposed approach utilizes optical flow to adaptively select the number of keyframes based on scene complexity, optimizing compression efficiency. Additionally, object movement masks extracted from keyframes are accumulated over time using alpha blending to generate the final afterimage. Experiments on the UCF-Crime dataset demonstrated that the proposed method achieved a 95.97% compression ratio. In binary classification experiments on normal/abnormal behaviors, the compressed videos maintained performance comparable to the original videos, while in multi-class classification, they outperformed the originals. Notably, classification experiments focused exclusively on abnormal behaviors exhibited a significant 4.25% improvement in performance. Moreover, further experiments showed that large language models (LLMs) can interpret the temporal context of original videos from single afterimages. These findings confirm that the proposed afterimage-based compression technique effectively preserves spatiotemporal information while significantly reducing data size. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

18 pages, 2236 KB

Open AccessArticle

Flame Combustion State Detection Method of Cement Rotary Furnace Based on Improved RE-DDPM and DAF-FasterNet

by Yizhuo Zhang, Zixuan Gu, Huiling Yu and Shen Shi

Appl. Sci. 2024, 14(22), 10640; https://doi.org/10.3390/app142210640 - 18 Nov 2024

Cited by 1 | Viewed by 1112

It is of great significance to effectively identify the flame-burning state of cement rotary kilns to optimize the calcination process and ensure the quality of cement. However, high-temperature and smoke-filled environments bring about difficulties with respect to accurate feature extraction and data acquisition. [...] Read more.

It is of great significance to effectively identify the flame-burning state of cement rotary kilns to optimize the calcination process and ensure the quality of cement. However, high-temperature and smoke-filled environments bring about difficulties with respect to accurate feature extraction and data acquisition. To address these challenges, this paper proposes a novel approach. First, an improved denoising diffusion probability model (RE-DDPM) is proposed. By applying a mask to the burning area and mixing it with the actual image in the denoising process, local diversity generation in the image was realized, and the problem of limited and uneven data was solved. Secondly, this article proposes the DAF-FasterNet model, which incorporates a deformable attention mechanism (DAS) and replaces the ReLU activation function with FReLU so that it can better focus on key flame features and extract finer spatial details. The RE-DDPM method exhibits faster convergence and lower FID scores, indicating that the generated images are more realistic. DAF-FasterNet achieves 98.9% training accuracy, 98.1% test accuracy, and a 22.3 ms delay, making it superior to existing methods in flame state recognition. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 74988 KB

Open AccessArticle

EDMF: A New Benchmark for Multi-Focus Images with the Challenge of Exposure Difference

by Hui Li, Tianyu Shen, Zeyang Zhang, Xuefeng Zhu and Xiaoning Song

Sensors 2024, 24(22), 7287; https://doi.org/10.3390/s24227287 - 14 Nov 2024

Cited by 3 | Viewed by 1323

The goal of the multi-focus image fusion (MFIF) task is to merge images with different focus areas into a single clear image. In real world scenarios, in addition to varying focus attributes, there are also exposure differences between multi-source images, which is an [...] Read more.

The goal of the multi-focus image fusion (MFIF) task is to merge images with different focus areas into a single clear image. In real world scenarios, in addition to varying focus attributes, there are also exposure differences between multi-source images, which is an important but often overlooked issue. To address this drawback and improve the development of the MFIF task, a new image fusion dataset is introduced called EDMF. Compared with the existing public MFIF datasets, it contains more images with exposure differences, which is more challenging and has a numerical advantage. Specifically, EDMF contains 1000 pairs of color images captured in real-world scenes, with some pairs exhibiting significant exposure difference. These images are captured using smartphones, encompassing diverse scenes and lighting conditions. Additionally, in this paper, a baseline method is also proposed, which is an improved version of memory unit-based unsupervised learning. By incorporating multiple adaptive memory units and spatial frequency information, the network is guided to focus on learning features from in-focus areas. This approach enables the network to effectively learn focus features during training, resulting in clear fused images that align with human visual perception. Experimental results demonstrate the effectiveness of the proposed method in handling exposure difference, achieving excellent fusion results in various complex scenes. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 1520 KB

Open AccessArticle

A Strip Steel Surface Defect Salient Object Detection Based on Channel, Spatial and Self-Attention Mechanisms

by Yange Sun, Siyu Geng, Huaping Guo, Chengyi Zheng and Li Zhang

Electronics 2024, 13(21), 4277; https://doi.org/10.3390/electronics13214277 - 31 Oct 2024

Cited by 2 | Viewed by 1411

Strip steel is extensively utilized in industries such as automotive manufacturing and aerospace due to its superior machinability, economic benefits, and adaptability. However, defects on the surface of steel strips, such as inclusions, patches, and scratches, significantly affect the performance and service life [...] Read more.

Strip steel is extensively utilized in industries such as automotive manufacturing and aerospace due to its superior machinability, economic benefits, and adaptability. However, defects on the surface of steel strips, such as inclusions, patches, and scratches, significantly affect the performance and service life of the product. Therefore, the salient object detection of surface defects on strip steel is crucial to ensure the quality of the final product. Many factors, such as the low contrast of surface defects on strip steel, the diversity of defect types, complex texture structures, and irregular defect distribution, hinder existing detection technologies from accurately identifying and segmenting defect areas against complex backgrounds. To address the above problems, we propose a novel detector called S3D-SOD for the salient object detection of strip steel surface defects. For the encoding stage, a residual self-attention block is proposed to explore semantic information cues of high-level features to locate and guide low-level feature information. In addition, we apply a general residual channel and spatial attention to low-level features, enabling the model to adaptively focus on the key channels and spatial areas of feature maps with high resolutions, thereby enhancing the encoder features and accelerating the convergence of the model. For the decoding stage, a simple residual decoder block with an upsampling operation is proposed to realize the integration and interaction of feature information between different layers. Here, the simple residual decoder block is used for feature integration due to the following observation: backbone networks like ResNet and the Swin Transformer, after being pretrained on the large dataset ImageNet and then fine-tuned on a smaller dataset for strip steel surface defects, are capable of extracting feature maps that contain both general image features and the specific characteristics required for the salient object detection of strip steel surface defects. The experimental results on the SD-saliency-900 dataset show that S3D-SOD is better than advanced methods, and it has strong generalization ability and robustness. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

13 pages, 1761 KB

Open AccessArticle

Leveraging Multi-Modality and Enhanced Temporal Networks for Robust Violence Detection

by Gwangho Na, Jaepil Ko and Kyungjoo Cheoi

Mach. Learn. Knowl. Extr. 2024, 6(4), 2422-2434; https://doi.org/10.3390/make6040119 - 28 Oct 2024

Cited by 1 | Viewed by 2117

In this paper, we present a novel model that enhances performance by extending the dual-modality TEVAD model—originally leveraging visual and textual information—into a multi-modal framework that integrates visual, audio, and textual data. Additionally, we refine the multi-scale temporal network (MTN) to improve feature [...] Read more.

In this paper, we present a novel model that enhances performance by extending the dual-modality TEVAD model—originally leveraging visual and textual information—into a multi-modal framework that integrates visual, audio, and textual data. Additionally, we refine the multi-scale temporal network (MTN) to improve feature extraction across multiple temporal scales between video snippets. Using the XD-Violence dataset, which includes audio data for violence detection, we conduct experiments to evaluate various feature fusion methods. The proposed model achieves an average precision (AP) of 83.9%, surpassing the performance of single-modality approaches (visual: 73.9%, audio: 67.1%, textual: 29.9%) and dual-modality approaches (visual + audio: 78.8%, visual + textual: 78.5%). These findings demonstrate that the proposed model outperforms models based on the original MTN and reaffirm the efficacy of multi-modal approaches in enhancing violence detection compared to single- or dual-modality methods. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

15 pages, 70999 KB

Open AccessArticle

Lightweight Infrared Image Denoising Method Based on Adversarial Transfer Learning

by Wen Guo, Yugang Fan and Guanghui Zhang

Sensors 2024, 24(20), 6677; https://doi.org/10.3390/s24206677 - 17 Oct 2024

Cited by 3 | Viewed by 1850

A lightweight infrared image denoising method based on adversarial transfer learning is proposed. The method adopts a generative adversarial network (GAN) framework and optimizes the model through a phased transfer learning strategy. In the initial stage, the generator is pre-trained using a large-scale [...] Read more.

A lightweight infrared image denoising method based on adversarial transfer learning is proposed. The method adopts a generative adversarial network (GAN) framework and optimizes the model through a phased transfer learning strategy. In the initial stage, the generator is pre-trained using a large-scale grayscale visible light image dataset. Subsequently, the generator is fine-tuned on an infrared image dataset using feature transfer techniques. This phased transfer strategy helps address the problem of insufficient sample quantity and variety in infrared images. Through the adversarial process of the GAN, the generator is continuously optimized to enhance its feature extraction capabilities in environments with limited data. Moreover, the generator structure incorporates structural reparameterization technology, edge convolution modules, and progressive multi-scale attention block (PMAB), significantly improving the model’s ability to recognize edge and texture features. During the inference stage, structural reparameterization further optimizes the network architecture, significantly reducing model parameters and complexity and thereby improving denoising efficiency. The experimental results of public and real-world datasets demonstrate that this method effectively removes additive white Gaussian noise from infrared images, showing outstanding denoising performance. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

20 pages, 36201 KB

Open AccessArticle

CPDet: Circle-Permutation-Aware Object Detection for Heat Exchanger Cleaning

by Jinshuo Liang, Yiqiang Wu, Yu Qin, Haoyu Wang, Xiaomao Li, Yan Peng and Xie Xie

Appl. Sci. 2024, 14(19), 9115; https://doi.org/10.3390/app14199115 - 9 Oct 2024

Cited by 1 | Viewed by 1292

Shell–tube heat exchangers are commonly used equipment in large-scale industrial systems of wastewater heat exchange to reclaim the thermal energy generated during industrial processes. However, the internal surfaces of the heat exchanger tubes often accumulate fouling, which subsequently reduces their heat transfer efficiency. [...] Read more.

Shell–tube heat exchangers are commonly used equipment in large-scale industrial systems of wastewater heat exchange to reclaim the thermal energy generated during industrial processes. However, the internal surfaces of the heat exchanger tubes often accumulate fouling, which subsequently reduces their heat transfer efficiency. Therefore, regular cleaning is essential. We aim to detect circle holes on the end surface of the heat exchange tubes to further achieve automated positioning and cleaning tubes. Notably, these holes exhibit a regular distribution. To this end, we propose a circle-permutation-aware object detector for heat exchanger cleaning to sufficiently exploit prior information of the original inputs. Specifically, the interval prior to the extraction module extracts interval information among circle holes based on prior statistics, yielding prior interval context. The following interval prior fusion module slices original images into circle domain and background domain maps according to the prior interval context. For the circle domain map, prior-guided sparse attention using prior a circle–hole diameter as the step divides the circle domain map into patches and performs patch-wise self-attention. The background domain map is multiplied by a hyperparameter weak coefficient matrix. In this way, our method fully leverages prior information to selectively weigh the original inputs to achieve more effective hole detection. In addition, to adapt the hole shape, we adopt the circle representation instead of the rectangle one. Extensive experiments demonstrate that our method achieves state-of-the-art performance and significantly boosts the YOLOv8 baseline by 5.24% mAP₅₀ and 5.25% mAP_50:95. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 12919 KB

Open AccessArticle

Fast Fault Line Selection Technology of Distribution Network Based on MCECA-CloFormer

by Can Ding, Pengcheng Ma, Changhua Jiang and Fei Wang

Appl. Sci. 2024, 14(18), 8270; https://doi.org/10.3390/app14188270 - 13 Sep 2024

Cited by 2 | Viewed by 1401

When a single-phase grounding fault occurs in resonant ground distribution network, the fault characteristics are weak and it is difficult to detect the fault line. Therefore, a fast fault line selection method based on MCECA-CloFormer is proposed in this paper. Firstly, zero-sequence current [...] Read more.

When a single-phase grounding fault occurs in resonant ground distribution network, the fault characteristics are weak and it is difficult to detect the fault line. Therefore, a fast fault line selection method based on MCECA-CloFormer is proposed in this paper. Firstly, zero-sequence current signals were converted into images using the moving average filter method and motif difference field to construct fault data set. Then, the ECA module was modified to MCECA (MultiCNN-ECA) so that it can accept data input from multiple measurement points. Secondly, the lightweight model CloFormer was used in the back end of MCECA module to further perceive the feature map and complete the establishment of the line selection model. Finally, the line selection model was trained, and the information such as model weight was saved. The simulation results demonstrated that the pre-trained MCECA-CloFormer achieved a line selection accuracy of over 98% under 10 dB noise, with a remarkably low single fault processing time of approximately 0.04 s. Moreover, it exhibited suitability for arc high-resistance grounding faults, data-missing cases, neutral-point ungrounded systems, and active distribution networks. In addition, the method was still valid when tested with actual field recording data. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

13 pages, 1876 KB

Open AccessArticle

Comparative Analysis of Solar Radiation Forecasting Techniques in Zacatecas, Mexico

by Martha Isabel Escalona-Llaguno, Luis Octavio Solís-Sánchez, Celina L. Castañeda-Miranda, Carlos A. Olvera-Olvera, Ma. del Rosario Martinez-Blanco, Héctor A. Guerrero-Osuna, Rodrigo Castañeda-Miranda, Germán Díaz-Flórez and Gerardo Ornelas-Vargas

Appl. Sci. 2024, 14(17), 7449; https://doi.org/10.3390/app14177449 - 23 Aug 2024

Cited by 3 | Viewed by 1457

This work explores the prediction of daily Global Horizontal Irradiance (GHI) patterns in the region of Zacatecas, Mexico, using a diverse range of predictive models, encompassing traditional regressors and advanced neural networks like Evolutionary Neural Architecture Search (ENAS), Convolutional Neural Networks (CNN), Recurrent [...] Read more.

This work explores the prediction of daily Global Horizontal Irradiance (GHI) patterns in the region of Zacatecas, Mexico, using a diverse range of predictive models, encompassing traditional regressors and advanced neural networks like Evolutionary Neural Architecture Search (ENAS), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Meta’s Prophet. This work addressing a notable gap in regional research, and aims to democratize access to accurate solar radiation forecasting methodologies. The evaluations carried out using the time series data obtained by Comisión Nacional del Agua (Conagua) covering the period from 2015 to 2018 reveal different performances of the model in different sky conditions, showcasing strengths in forecasting clear and partially cloudy days while encountering challenges with cloudy conditions. Overall, correlation coefficients (r) ranged between 0.55 and 0.72, with Root Mean Square Error % (RMSE %) values spanning from 20.05% to 20.54%, indicating moderate to good predictive accuracy. This study underscores the need for longer datasets to bolster future predictive capabilities. By democratizing access to these predictive tools, this research facilitates informed decision-making in renewable energy planning and sustainable development strategies tailored to the unique environmental dynamics of the region of Zacatecas and comparable regions. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

22 pages, 8959 KB

Open AccessArticle

Enhanced Detection and Recognition of Road Objects in Infrared Imaging Using Multi-Scale Self-Attention

by Poyi Liu, Yunkang Zhang, Guanlun Guo and Jiale Ding

Sensors 2024, 24(16), 5404; https://doi.org/10.3390/s24165404 - 21 Aug 2024

Cited by 3 | Viewed by 1708

In infrared detection scenarios, detecting and recognizing low-contrast and small-sized targets has always been a challenge in the field of computer vision, particularly in complex road traffic environments. Traditional target detection methods usually perform poorly when processing infrared small targets, mainly due to [...] Read more.

In infrared detection scenarios, detecting and recognizing low-contrast and small-sized targets has always been a challenge in the field of computer vision, particularly in complex road traffic environments. Traditional target detection methods usually perform poorly when processing infrared small targets, mainly due to their inability to effectively extract key features and the significant feature loss that occurs during feature transmission. To address these issues, this paper proposes a fast detection and recognition model based on a multi-scale self-attention mechanism, specifically for small road targets in infrared detection scenarios. We first introduce and improve the DyHead structure based on the YOLOv8 algorithm, which employs a multi-head self-attention mechanism to capture target features at various scales and enhance the model’s perception of small targets. Additionally, to prevent information loss during the feature transmission process via the FPN structure in traditional YOLO algorithms, this paper introduces and enhances the Gather-and-Distribute Mechanism. By computing dependencies between features using self-attention, it reallocates attention weights in the feature maps to highlight important features and suppress irrelevant information. These improvements significantly enhance the model’s capability to detect small targets. Moreover, to further increase detection speed, we pruned the network architecture to reduce computational complexity and parameter count, making the model suitable for real-time processing scenarios. Experiments on our self built infrared road traffic dataset (mainly including two types of targets: vehicles and people) show that compared with the baseline, our method achieves a 3.1% improvement in AP and a 2.5% increase in mAP on the VisDrone2019 dataset, showing significant enhancements in both detection accuracy and processing speed for small targets, with improved robustness and adaptability. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

32 pages, 2074 KB

Open AccessArticle

Symbol Detection in Mechanical Engineering Sketches: Experimental Study on Principle Sketches with Synthetic Data Generation and Deep Learning

by Sebastian Bickel, Stefan Goetz and Sandro Wartzack

Appl. Sci. 2024, 14(14), 6106; https://doi.org/10.3390/app14146106 - 12 Jul 2024

Cited by 2 | Viewed by 4755

Digital transformation is omnipresent in our daily lives and its impact is noticeable through new technologies, like smart devices, AI-Chatbots or the changing work environment. This digitalization also takes place in product development, with the integration of many technologies, such as Industry 4.0, [...] Read more.

Digital transformation is omnipresent in our daily lives and its impact is noticeable through new technologies, like smart devices, AI-Chatbots or the changing work environment. This digitalization also takes place in product development, with the integration of many technologies, such as Industry 4.0, digital twins or data-driven methods, to improve the quality of new products and to save time and costs during the development process. Therefore, the use of data-driven methods reusing existing data has great potential. However, data from product design are very diverse and strongly depend on the respective development phase. One of the first few product representations are sketches and drawings, which represent the product in a simplified and condensed way. But, to reuse the data, the existing sketches must be found with an automated approach, allowing the contained information to be utilized. One approach to solve this problem is presented in this paper, with the detection of principle sketches in the early phase of the development process. The aim is to recognize the symbols in these sketches automatically with object detection models. Therefore, existing approaches were analyzed and a new procedure developed, which uses synthetic training data generation. In the next step, a total of six different data generation types were analyzed and tested using six different one- and two-stage detection models. The entire procedure was then evaluated on two unknown test datasets, one focusing on different gearbox variants and a second dataset derived from CAD assemblies. In the last sections the findings are discussed and a procedure with high detection accuracy is determined. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

20 pages, 3739 KB

Open AccessArticle

Automatic Switching of Electric Locomotive Power in Railway Neutral Sections Using Image Processing

by Christopher Thembinkosi Mcineka, Nelendran Pillay, Kevin Moorgas and Shaveen Maharaj

J. Imaging 2024, 10(6), 142; https://doi.org/10.3390/jimaging10060142 - 11 Jun 2024

Cited by 1 | Viewed by 2285

This article presents a computer vision-based approach to switching electric locomotive power supplies as the vehicle approaches a railway neutral section. Neutral sections are defined as a phase break in which the objective is to separate two single-phase traction supplies on an overhead [...] Read more.

This article presents a computer vision-based approach to switching electric locomotive power supplies as the vehicle approaches a railway neutral section. Neutral sections are defined as a phase break in which the objective is to separate two single-phase traction supplies on an overhead railway supply line. This separation prevents flashovers due to high voltages caused by the locomotives shorting both electrical phases. The typical system of switching traction supplies automatically employs the use of electro-mechanical relays and induction magnets. In this paper, an image classification approach is proposed to replace the conventional electro-mechanical system with two unique visual markers that represent the ‘Open’ and ‘Close’ signals to initiate the transition. When the computer vision model detects either marker, the vacuum circuit breakers inside the electrical locomotive will be triggered to their respective positions depending on the identified image. A Histogram of Oriented Gradient technique was implemented for feature extraction during the training phase and a Linear Support Vector Machine algorithm was trained for the target image classification. For the task of image segmentation, the Circular Hough Transform shape detection algorithm was employed to locate the markers in the captured images and provided cartesian plane coordinates for segmenting the Object of Interest. A signal marker classification accuracy of 94% with 75 objects per second was achieved using a Linear Support Vector Machine during the experimental testing phase. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

21 pages, 5602 KB

Open AccessArticle

EMR-HRNet: A Multi-Scale Feature Fusion Network for Landslide Segmentation from Remote Sensing Images

by Yuanhang Jin, Xiaosheng Liu and Xiaobin Huang

Sensors 2024, 24(11), 3677; https://doi.org/10.3390/s24113677 - 6 Jun 2024

Cited by 9 | Viewed by 2162

Landslides constitute a significant hazard to human life, safety and natural resources. Traditional landslide investigation methods demand considerable human effort and expertise. To address this issue, this study introduces an innovative landslide segmentation framework, EMR-HRNet, aimed at enhancing accuracy. Initially, a novel data [...] Read more.

Landslides constitute a significant hazard to human life, safety and natural resources. Traditional landslide investigation methods demand considerable human effort and expertise. To address this issue, this study introduces an innovative landslide segmentation framework, EMR-HRNet, aimed at enhancing accuracy. Initially, a novel data augmentation technique, CenterRep, is proposed, not only augmenting the training dataset but also enabling the model to more effectively capture the intricate features of landslides. Furthermore, this paper integrates a RefConv and Multi-Dconv Head Transposed Attention (RMA) feature pyramid structure into the HRNet model, augmenting the model’s capacity for semantic recognition and expression at various levels. Last, the incorporation of the Dilated Efficient Multi-Scale Attention (DEMA) block substantially widens the model’s receptive field, bolstering its capability to discern local features. Rigorous evaluations on the Bijie dataset and the Sichuan and surrounding area dataset demonstrate that EMR-HRNet outperforms other advanced semantic segmentation models, achieving mIoU scores of 81.70% and 71.68%, respectively. Additionally, ablation studies conducted across the comprehensive dataset further corroborate the enhancements’ efficacy. The results indicate that EMR-HRNet excels in processing satellite and UAV remote sensing imagery, showcasing its significant potential in multi-source optical remote sensing for landslide segmentation. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

24 pages, 21847 KB

Open AccessArticle

A Learnable Viewpoint Evolution Method for Accurate Pose Estimation of Complex Assembled Product

by Delong Zhao, Feifei Kong and Fuzhou Du

Appl. Sci. 2024, 14(11), 4405; https://doi.org/10.3390/app14114405 - 22 May 2024

Viewed by 1456

Balancing adaptability, reliability, and accuracy in vision technology has always been a major bottleneck limiting its application in appearance assurance for complex objects in high-end equipment production. Data-driven deep learning shows robustness to feature diversity but is limited by interpretability and accuracy. The [...] Read more.

Balancing adaptability, reliability, and accuracy in vision technology has always been a major bottleneck limiting its application in appearance assurance for complex objects in high-end equipment production. Data-driven deep learning shows robustness to feature diversity but is limited by interpretability and accuracy. The traditional vision scheme is reliable and can achieve high accuracy, but its adaptability is insufficient. The deeper reason is the lack of appropriate architecture and integration strategies between the learning paradigm and empirical design. To this end, a learnable viewpoint evolution algorithm for high-accuracy pose estimation of complex assembled products under free view is proposed. To alleviate the balance problem of exploration and optimization in estimation, shape-constrained virtual–real matching, evolvable feasible region, and specialized population migration and reproduction strategies are designed. Furthermore, a learnable evolution control mechanism is proposed, which integrates a guided model based on experience and is cyclic-trained with automatically generated effective trajectories to improve the evolution process. Compared to the

{1.69}^{°}, 55.67

mm of the state-of-the-art data-driven method and the

{1.28}^{°}, 77.67

mm of the classic strategy combination, the pose estimation error of complex assembled product in this study is

{0.23}^{°}, 23.71

mm, which proves the effectiveness of the proposed method. Meanwhile, through in-depth exploration, the robustness, parameter sensitivity, and adaptability to the virtual–real appearance variations are sequentially verified. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

22 pages, 38737 KB

Open AccessArticle

A Computer Vision Framework for Structural Analysis of Hand-Drawn Engineering Sketches

by Isaac Joffe, Yuchen Qian, Mohammad Talebi-Kalaleh and Qipei Mei

Sensors 2024, 24(9), 2923; https://doi.org/10.3390/s24092923 - 3 May 2024

Viewed by 2848

Structural engineers are often required to draw two-dimensional engineering sketches for quick structural analysis, either by hand calculation or using analysis software. However, calculation by hand is slow and error-prone, and the manual conversion of a hand-drawn sketch into a virtual model is [...] Read more.

Structural engineers are often required to draw two-dimensional engineering sketches for quick structural analysis, either by hand calculation or using analysis software. However, calculation by hand is slow and error-prone, and the manual conversion of a hand-drawn sketch into a virtual model is tedious and time-consuming. This paper presents a complete and autonomous framework for converting a hand-drawn engineering sketch into an analyzed structural model using a camera and computer vision. In this framework, a computer vision object detection stage initially extracts information about the raw features in the image of the beam diagram. Next, a computer vision number-reading model transcribes any handwritten numerals appearing in the image. Then, feature association models are applied to characterize the relationships among the detected features in order to build a comprehensive structural model. Finally, the structural model generated is analyzed using OpenSees. In the system presented, the object detection model achieves a mean average precision of 99.1%, the number-reading model achieves an accuracy of 99.0%, and the models in the feature association stage achieve accuracies ranging from 95.1% to 99.5%. Overall, the tool analyzes 45.0% of images entirely correctly and the remaining 55.0% of images partially correctly. The proposed framework holds promise for other types of structural sketches, such as trusses and frames. Moreover, it can be a valuable tool for structural engineers that is capable of improving the efficiency, safety, and sustainability of future construction projects. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

19 pages, 15195 KB

Open AccessArticle

Color and Luminance Separated Enhancement for Low-Light Images with Brightness Guidance

by Feng Zhang, Xinran Liu, Changxin Gao and Nong Sang

Sensors 2024, 24(9), 2711; https://doi.org/10.3390/s24092711 - 24 Apr 2024

Cited by 2 | Viewed by 2402

Existing retinex-based low-light image enhancement strategies focus heavily on crafting complex networks for Retinex decomposition but often result in imprecise estimations. To overcome the limitations of previous methods, we introduce a straightforward yet effective strategy for Retinex decomposition, dividing images into colormaps and [...] Read more.

Existing retinex-based low-light image enhancement strategies focus heavily on crafting complex networks for Retinex decomposition but often result in imprecise estimations. To overcome the limitations of previous methods, we introduce a straightforward yet effective strategy for Retinex decomposition, dividing images into colormaps and graymaps as new estimations for reflectance and illumination maps. The enhancement of these maps is separately conducted using a diffusion model for improved restoration. Furthermore, we address the dual challenge of perturbation removal and brightness adjustment in illumination maps by incorporating brightness guidance. This guidance aids in precisely adjusting the brightness while eliminating disturbances, ensuring a more effective enhancement process. Extensive quantitative and qualitative experimental analyses demonstrate that our proposed method improves the performance by approximately

4.4 %

on the LOL dataset compared to other state-of-the-art diffusion-based methods, while also validating the model’s generalizability across multiple real-world datasets. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

18 pages, 9114 KB

Open AccessArticle

Study on Gesture Recognition Method with Two-Stream Residual Network Fusing sEMG Signals and Acceleration Signals

by Zhigang Hu, Shen Wang, Cuisi Ou, Aoru Ge and Xiangpan Li

Sensors 2024, 24(9), 2702; https://doi.org/10.3390/s24092702 - 24 Apr 2024

Cited by 2 | Viewed by 1833

Currently, surface EMG signals have a wide range of applications in human–computer interaction systems. However, selecting features for gesture recognition models based on traditional machine learning can be challenging and may not yield satisfactory results. Considering the strong nonlinear generalization ability of neural [...] Read more.

Currently, surface EMG signals have a wide range of applications in human–computer interaction systems. However, selecting features for gesture recognition models based on traditional machine learning can be challenging and may not yield satisfactory results. Considering the strong nonlinear generalization ability of neural networks, this paper proposes a two-stream residual network model with an attention mechanism for gesture recognition. One branch processes surface EMG signals, while the other processes hand acceleration signals. Segmented networks are utilized to fully extract the physiological and kinematic features of the hand. To enhance the model’s capacity to learn crucial information, we introduce an attention mechanism after global average pooling. This mechanism strengthens relevant features and weakens irrelevant ones. Finally, the deep features obtained from the two branches of learning are fused to further improve the accuracy of multi-gesture recognition. The experiments conducted on the NinaPro DB2 public dataset resulted in a recognition accuracy of 88.25% for 49 gestures. This demonstrates that our network model can effectively capture gesture features, enhancing accuracy and robustness across various gestures. This approach to multi-source information fusion is expected to provide more accurate and real-time commands for exoskeleton robots and myoelectric prosthetic control systems, thereby enhancing the user experience and the naturalness of robot operation. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

21 pages, 1948 KB

Open AccessArticle

Tensorized Discrete Multi-View Spectral Clustering

by Qin Li, Geng Yang, Yu Yun, Yu Lei and Jane You

Electronics 2024, 13(3), 491; https://doi.org/10.3390/electronics13030491 - 24 Jan 2024

Cited by 1 | Viewed by 1641

Discrete spectral clustering directly obtains the discrete labels of data, but existing clustering methods assume that the real-valued indicator matrices of different views are identical, which is unreasonable in practical applications. Moreover, they do not effectively exploit the spatial structure and complementary information [...] Read more.

Discrete spectral clustering directly obtains the discrete labels of data, but existing clustering methods assume that the real-valued indicator matrices of different views are identical, which is unreasonable in practical applications. Moreover, they do not effectively exploit the spatial structure and complementary information embedded in views. To overcome this disadvantage, we propose a tensorized discrete multi-view spectral clustering model that integrates spectral embedding and spectral rotation into a unified framework. Specifically, we leverage the weighted tensor nuclear-norm regularizer on the third-order tensor, which consists of the real-valued indicator matrices of views, to exploit the complementary information embedded in the indicator matrices of different views. Furthermore, we present an adaptively weighted scheme that takes into account the relationship between views for clustering. Finally, discrete labels are obtained by spectral rotation. Experiments show the effectiveness of our proposed method. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

20 pages, 15144 KB

Open AccessArticle

HRYNet: A Highly Robust YOLO Network for Complex Road Traffic Object Detection

by Lindong Tang, Lijun Yun, Zaiqing Chen and Feiyan Cheng

Sensors 2024, 24(2), 642; https://doi.org/10.3390/s24020642 - 19 Jan 2024

Cited by 18 | Viewed by 3869

Object detection is a crucial component of the perception system in autonomous driving. However, the road scene presents a highly intricate environment where the visibility and characteristics of traffic targets are susceptible to attenuation and loss due to various complex road scenarios such [...] Read more.

Object detection is a crucial component of the perception system in autonomous driving. However, the road scene presents a highly intricate environment where the visibility and characteristics of traffic targets are susceptible to attenuation and loss due to various complex road scenarios such as lighting conditions, weather conditions, time of day, background elements, and traffic density. Nevertheless, the current object detection network must exhibit more learning capabilities when detecting such targets. This also exacerbates the loss of features during the feature extraction and fusion process, significantly compromising the network’s detection performance on traffic targets. This paper presents a novel methodology by which to overcome the concerns above, namely HRYNet. Firstly, a dual fusion gradual pyramid structure (DFGPN) is introduced, which employs a two-stage gradient fusion strategy to enhance the generation of more comprehensive multi-scale high-level semantic information, strengthen the interconnection between non-adjacent feature layers, and reduce the information gap that exists between them. HRYNet introduces an anti-interference feature extraction module, the residual multi-head self-attention mechanism (RMA). RMA enhances the target information by implementing a characteristic channel weighting policy, thereby reducing background interference and improving the attention capability of the network. Finally, the detection performance of HRYNet was evaluated by utilizing three datasets: the horizontally collected dataset BDD1000K, the UAV high-altitude dataset Visdrone, and a custom dataset. Experimental results demonstrate that HRYNet achieves a higher mAP_0.5 compared with YOLOv8s on the three datasets, with increases of 10.8%, 16.7%, and 5.5%, respectively. To optimize HRYNet for mobile devices, this study presents Lightweight HRYNet (LHRYNet), which effectively reduces the number of model parameters by 2 million. The results demonstrate that LHRYNet outperforms YOLOv8s in terms of mAP_0.5, with improvements of 6.7%, 10.9%, and 2.5% observed on the three datasets, respectively. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

22 pages, 10627 KB

Open AccessArticle

ScanGuard-YOLO: Enhancing X-ray Prohibited Item Detection with Significant Performance Gains

by Xianning Huang and Yaping Zhang

Sensors 2024, 24(1), 102; https://doi.org/10.3390/s24010102 - 24 Dec 2023

Cited by 7 | Viewed by 2521

To address the problem of low recall rate in the detection of prohibited items in X-ray images due to the severe object occlusion and complex background, an X-ray prohibited item detection network, ScanGuard-YOLO, based on the YOLOv5 architecture, is proposed to effectively improve [...] Read more.

To address the problem of low recall rate in the detection of prohibited items in X-ray images due to the severe object occlusion and complex background, an X-ray prohibited item detection network, ScanGuard-YOLO, based on the YOLOv5 architecture, is proposed to effectively improve the model’s recall rate and the comprehensive metric F1 score. Firstly, the RFB-s module was added to the end part of the backbone, and dilated convolution was used to increase the receptive field of the backbone network to better capture global features. In the neck section, the efficient RepGFPN module was employed to fuse multiscale information from the backbone output. This aimed to capture details and contextual information at various scales, thereby enhancing the model’s understanding and representation capability of the object. Secondly, a novel detection head was introduced to unify scale-awareness, spatial-awareness, and task-awareness altogether, which significantly improved the representation ability of the object detection heads. Finally, the bounding box regression loss function was defined as the WIOUv3 loss, effectively balancing the contribution of low-quality and high-quality samples to the loss. ScanGuard-YOLO was tested on OPIXray and HiXray datasets, showing significant improvements compared to the baseline model. The mean average precision (mAP@0.5) increased by 2.3% and 1.6%, the recall rate improved by 4.5% and 2%, and the F1 score increased by 2.3% and 1%, respectively. The experimental results demonstrate that ScanGuard-YOLO effectively enhances the detection capability of prohibited items in complex backgrounds and exhibits broad prospects for application. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 6588 KB

Open AccessArticle

Autoencoder-Based Visual Anomaly Localization for Manufacturing Quality Control

by Devang Mehta and Noah Klarmann

Mach. Learn. Knowl. Extr. 2024, 6(1), 1-17; https://doi.org/10.3390/make6010001 - 21 Dec 2023

Cited by 10 | Viewed by 4670

Manufacturing industries require the efficient and voluminous production of high-quality finished goods. In the context of Industry 4.0, visual anomaly detection poses an optimistic solution for automatically controlled product quality with high precision. In general, automation based on computer vision is a promising [...] Read more.

Manufacturing industries require the efficient and voluminous production of high-quality finished goods. In the context of Industry 4.0, visual anomaly detection poses an optimistic solution for automatically controlled product quality with high precision. In general, automation based on computer vision is a promising solution to prevent bottlenecks at the product quality checkpoint. We considered recent advancements in machine learning to improve visual defect localization, but challenges persist in obtaining a balanced feature set and database of the wide variety of defects occurring in the production line. Hence, this paper proposes a defect localizing autoencoder with unsupervised class selection by clustering with k-means the features extracted from a pretrained VGG16 network. Moreover, the selected classes of defects are augmented with natural wild textures to simulate artificial defects. The study demonstrates the effectiveness of the defect localizing autoencoder with unsupervised class selection for improving defect detection in manufacturing industries. The proposed methodology shows promising results with precise and accurate localization of quality defects on melamine-faced boards for the furniture industry. Incorporating artificial defects into the training data shows significant potential for practical implementation in real-world quality control scenarios. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 3683 KB

Open AccessArticle

A Weakly Supervised Semantic Segmentation Model of Maize Seedlings and Weed Images Based on Scrawl Labels

by Lulu Zhao, Yanan Zhao, Ting Liu and Hanbing Deng

Sensors 2023, 23(24), 9846; https://doi.org/10.3390/s23249846 - 15 Dec 2023

Cited by 5 | Viewed by 1853

The task of semantic segmentation of maize and weed images using fully supervised deep learning models requires a large number of pixel-level mask labels, and the complex morphology of the maize and weeds themselves can further increase the cost of image annotation. To [...] Read more.

The task of semantic segmentation of maize and weed images using fully supervised deep learning models requires a large number of pixel-level mask labels, and the complex morphology of the maize and weeds themselves can further increase the cost of image annotation. To solve this problem, we proposed a Scrawl Label-based Weakly Supervised Semantic Segmentation Network (SL-Net). SL-Net consists of a pseudo label generation module, encoder, and decoder. The pseudo label generation module converts scrawl labels into pseudo labels that replace manual labels that are involved in network training, improving the backbone network for feature extraction based on the DeepLab-V3+ model and using a migration learning strategy to optimize the training process. The results show that the intersection over union of the pseudo labels that are generated by the pseudo label module with the ground truth is 83.32%, and the cosine similarity is 93.55%. In the semantic segmentation testing of SL-Net for image seedling of maize plants and weeds, the mean intersection over union and average precision reached 87.30% and 94.06%, which is higher than the semantic segmentation accuracy of DeepLab-V3+ and PSPNet under weakly and fully supervised learning conditions. We conduct experiments to demonstrate the effectiveness of the proposed method. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

13 pages, 4444 KB

Open AccessArticle

Go-Game Image Recognition Based on Improved Pix2pix

by Yanxia Zheng and Xiyuan Qian

J. Imaging 2023, 9(12), 273; https://doi.org/10.3390/jimaging9120273 - 7 Dec 2023

Cited by 1 | Viewed by 2729

Go is a game that can be won or lost based on the number of intersections surrounded by black or white pieces. The traditional method is a manual counting method, which is time-consuming and error-prone. In addition, the generalization of the current Go-image-recognition [...] Read more.

Go is a game that can be won or lost based on the number of intersections surrounded by black or white pieces. The traditional method is a manual counting method, which is time-consuming and error-prone. In addition, the generalization of the current Go-image-recognition methods is poor, and accuracy needs to be further improved. To solve these problems, a Go-game image recognition based on an improved pix2pix was proposed. Firstly, a channel-coordinate mixed-attention (CCMA) mechanism was designed by combining channel attention and coordinate attention effectively; therefore, the model could learn the target feature information. Secondly, in order to obtain the long-distance contextual information, a deep dilated-convolution (DDC) module was proposed, which densely linked the dilated convolution with different dilated rates. The experimental results showed that compared with other existing Go-image-recognition methods, such as DenseNet, VGG-16, and Yolo v5, the proposed method could effectively improve the generalization ability and accuracy of a Go-image-recognition model, and the average accuracy rate was over 99.99%. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 11761 KB

Open AccessArticle

RepECN: Making ConvNets Better Again for Efficient Image Super-Resolution

by Qiangpu Chen, Jinghui Qin and Wushao Wen

Sensors 2023, 23(23), 9575; https://doi.org/10.3390/s23239575 - 2 Dec 2023

Cited by 1 | Viewed by 1972

Traditional Convolutional Neural Network (ConvNet, CNN)-based image super-resolution (SR) methods have lower computation costs, making them more friendly for real-world scenarios. However, they suffer from lower performance. On the contrary, Vision Transformer (ViT)-based SR methods have achieved impressive performance recently, but these methods [...] Read more.

Traditional Convolutional Neural Network (ConvNet, CNN)-based image super-resolution (SR) methods have lower computation costs, making them more friendly for real-world scenarios. However, they suffer from lower performance. On the contrary, Vision Transformer (ViT)-based SR methods have achieved impressive performance recently, but these methods often suffer from high computation costs and model storage overhead, making them hard to meet the requirements in practical application scenarios. In practical scenarios, an SR model should reconstruct an image with high quality and fast inference. To handle this issue, we propose a novel CNN-based Efficient Residual ConvNet enhanced with structural Re-parameterization (RepECN) for a better trade-off between performance and efficiency. A stage-to-block hierarchical architecture design paradigm inspired by ViT is utilized to keep the state-of-the-art performance, while the efficiency is ensured by abandoning the time-consuming Multi-Head Self-Attention (MHSA) and by re-designing the block-level modules based on CNN. Specifically, RepECN consists of three structural modules: a shallow feature extraction module, a deep feature extraction, and an image reconstruction module. The deep feature extraction module comprises multiple ConvNet Stages (CNS), each containing 6 Re-Parameterization ConvNet Blocks (RepCNB), a head layer, and a residual connection. The RepCNB utilizes larger kernel convolutions rather than MHSA to enhance the capability of learning long-range dependence. In the image reconstruction module, an upsampling module consisting of nearest-neighbor interpolation and pixel attention is deployed to reduce parameters and maintain reconstruction performance, while bicubic interpolation on another branch allows the backbone network to focus on learning high-frequency information. The extensive experimental results on multiple public benchmarks show that our RepECN can achieve 2.5∼5× faster inference than the state-of-the-art ViT-based SR model with better or competitive super-resolving performance, indicating that our RepECN can reconstruct high-quality images with fast inference. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

22 pages, 2387 KB

Open AccessArticle

Android Malware Classification Based on Fuzzy Hashing Visualization

by Horacio Rodriguez-Bazan, Grigori Sidorov and Ponciano Jorge Escamilla-Ambrosio

Mach. Learn. Knowl. Extr. 2023, 5(4), 1826-1847; https://doi.org/10.3390/make5040088 - 28 Nov 2023

Cited by 5 | Viewed by 3456

The proliferation of Android-based devices has brought about an unprecedented surge in mobile application usage, making the Android ecosystem a prime target for cybercriminals. In this paper, a new method for Android malware classification is proposed. The method implements a convolutional neural network [...] Read more.

The proliferation of Android-based devices has brought about an unprecedented surge in mobile application usage, making the Android ecosystem a prime target for cybercriminals. In this paper, a new method for Android malware classification is proposed. The method implements a convolutional neural network for malware classification using images. The research presents a novel approach to transforming the Android Application Package (APK) into a grayscale image. The image creation utilizes natural language processing techniques for text cleaning, extraction, and fuzzy hashing to represent the decompiled code from the APK in a set of hashes after preprocessing, where the image is composed of n fuzzy hashes that represent an APK. The method was tested on an Android malware dataset with 15,493 samples of five malware types. The proposed method showed an increase in accuracy compared to others in the literature, achieving up to 98.24% in the classification task. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

22 pages, 1471 KB

Open AccessArticle

Attention-Assisted Feature Comparison and Feature Enhancement for Class-Agnostic Counting

by Liang Dong, Yian Yu, Di Zhang and Yan Huo

Sensors 2023, 23(22), 9126; https://doi.org/10.3390/s23229126 - 11 Nov 2023

Viewed by 2103

In this study, we address the class-agnostic counting (CAC) challenge, aiming to count instances in a query image, using just a few exemplars. Recent research has shifted towards few-shot counting (FSC), which involves counting previously unseen object classes. We present ACECount, an FSC [...] Read more.

In this study, we address the class-agnostic counting (CAC) challenge, aiming to count instances in a query image, using just a few exemplars. Recent research has shifted towards few-shot counting (FSC), which involves counting previously unseen object classes. We present ACECount, an FSC framework that combines attention mechanisms and convolutional neural networks (CNNs). ACECount identifies query image–exemplar similarities, using cross-attention mechanisms, enhances feature representations with a feature attention module, and employs a multi-scale regression head, to handle scale variations in CAC. ACECount’s experiments on the FSC-147 dataset exhibited the expected performance. ACECount achieved a reduction of 0.3 in the mean absolute error (MAE) on the validation set and a reduction of 0.26 on the test set of FSC-147, compared to previous methods. Notably, ACECount also demonstrated convincing performance in class-specific counting (CSC) tasks. Evaluation on crowd and vehicle counting datasets revealed that ACECount surpasses FSC algorithms like GMN, FamNet, SAFECount, LOCA, and SPDCN, in terms of performance. These results highlight the robust dataset generalization capabilities of our proposed algorithm. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

20 pages, 10786 KB

Open AccessArticle

A Binary Fast Image Registration Method Based on Fusion Information

by Huaidan Liang, Chenglong Liu, Xueguang Li and Lina Wang

Electronics 2023, 12(21), 4475; https://doi.org/10.3390/electronics12214475 - 31 Oct 2023

Cited by 3 | Viewed by 1786

In the field of airborne aerial imaging, image stitching is often used to expand the field of view. Registration is the foundation of aerial image stitching and directly affects its success and quality. This article develops a fast binary image registration method based [...] Read more.

In the field of airborne aerial imaging, image stitching is often used to expand the field of view. Registration is the foundation of aerial image stitching and directly affects its success and quality. This article develops a fast binary image registration method based on the characteristics of airborne aerial imaging. This method first integrates aircraft parameters and calculates the ground range of the image for coarse registration. Then, based on the characteristics of FAST (Features from Accelerated Segment Test), a new sampling method, named Weighted Angular Diffusion Radial Sampling (WADRS), and matching method are designed. The method proposed in this article can achieve fast registration while ensuring registration accuracy, with a running speed that is approximately four times faster than SURF (Speed Up Robust Features). Additionally, there is no need to manually select any control points before registration. The results indicate that the proposed method can effectively complete remote sensing image registration from different perspectives. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

14 pages, 8937 KB

Open AccessArticle

A Fabric Defect Segmentation Model Based on Improved Swin-Unet with Gabor Filter

by Haitao Xu, Chengming Liu, Shuya Duan, Liangpin Ren, Guozhen Cheng and Bing Hao

Appl. Sci. 2023, 13(20), 11386; https://doi.org/10.3390/app132011386 - 17 Oct 2023

Cited by 5 | Viewed by 2523

Fabric inspection is critical in fabric manufacturing. Automatic detection of fabric defects in the textile industry has always been an important research field. Previously, manual visual inspection was commonly used; however, there were drawbacks such as high labor costs, slow detection speed, and [...] Read more.

Fabric inspection is critical in fabric manufacturing. Automatic detection of fabric defects in the textile industry has always been an important research field. Previously, manual visual inspection was commonly used; however, there were drawbacks such as high labor costs, slow detection speed, and high error rates. Recently, many defect detection methods based on deep learning have been proposed. However, problems need to be solved in the existing methods, such as detection accuracy and interference of complex background textures. In this paper, we propose an efficient segmentation algorithm that combines traditional operators with deep learning networks to alleviate the existing problems. Specifically, we introduce a Gabor filter into the model, which provides the unique advantage of extracting low-level texture features to solve the problem of texture interference and enable the algorithm to converge quickly in the early stages of training. Furthermore, we design a U-shaped architecture that is not completely symmetrical, making model training easier. Meanwhile, multi-stage result fusion is proposed for precise location of defects. The design of this framework significantly improves the detection accuracy and effectively breaks through the limitations of transformer-based models. Experimental results show that on a dataset with one class, a small amount of data, and complex sample background texture, our method achieved 90.03% and 33.70% in ACC and IoU, respectively, which is almost 10% higher than other previous state of the art models. Experimental results based on three different fabric datasets consistently show that the proposed model has excellent performance and great application potential in the industrial field. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

21 pages, 6594 KB

Open AccessArticle

Enhanced YOLOv5: An Efficient Road Object Detection Method

by Hao Chen, Zhan Chen and Hang Yu

Sensors 2023, 23(20), 8355; https://doi.org/10.3390/s23208355 - 10 Oct 2023

Cited by 31 | Viewed by 6796

Accurate identification of road objects is crucial for achieving intelligent traffic systems. However, developing efficient and accurate road object detection methods in complex traffic scenarios has always been a challenging task. The objective of this study was to improve the target detection algorithm [...] Read more.

Accurate identification of road objects is crucial for achieving intelligent traffic systems. However, developing efficient and accurate road object detection methods in complex traffic scenarios has always been a challenging task. The objective of this study was to improve the target detection algorithm for road object detection by enhancing the algorithm’s capability to fuse features of different scales and levels, thereby improving the accurate identification of objects in complex road scenes. We propose an improved method called the Enhanced YOLOv5 algorithm for road object detection. By introducing the Bidirectional Feature Pyramid Network (BiFPN) into the YOLOv5 algorithm, we address the challenges of multi-scale and multi-level feature fusion and enhance the detection capability for objects of different sizes. Additionally, we integrate the Convolutional Block Attention Module (CBAM) into the existing YOLOv5 model to enhance its feature representation capability. Furthermore, we employ a new non-maximum suppression technique called Distance Intersection Over Union (DIOU) to effectively address issues such as misjudgment and duplicate detection when significant overlap occurs between bounding boxes. We use mean Average Precision (mAP) and Precision (P) as evaluation metrics. Finally, experimental results on the BDD100K dataset demonstrate that the improved YOLOv5 algorithm achieves a 1.6% increase in object detection mAP, while the P value increases by 5.3%, effectively improving the accuracy and robustness of road object recognition. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

17 pages, 4295 KB

Open AccessArticle

Few-Shot Air Object Detection Network

by Wei Cai, Xin Wang, Xinhao Jiang, Zhiyong Yang, Xingyu Di and Weijie Gao

Electronics 2023, 12(19), 4133; https://doi.org/10.3390/electronics12194133 - 4 Oct 2023

Cited by 1 | Viewed by 1548

Focusing on the problem of low detection precision caused by the few-shot and multi-scale characteristics of air objects, we propose a few-shot air object detection network (FADNet). We first use a transformer as the backbone network of the model and then build a [...] Read more.

Focusing on the problem of low detection precision caused by the few-shot and multi-scale characteristics of air objects, we propose a few-shot air object detection network (FADNet). We first use a transformer as the backbone network of the model and then build a multi-scale attention mechanism (MAM) to deeply fuse the W- and H-dimension features extracted from the channel dimension and the local and global features extracted from the spatial dimension with the object features to improve the network’s performance when detecting air objects. Second, the neck network is innovated based on the path aggregation network (PANet), resulting in an improved path aggregation network (IPANet). Our proposed network reduces the information lost during feature transfer by introducing a jump connection, utilizes sparse connection convolution, strengthens feature extraction abilities at all scales, and improves the discriminative properties of air object features at all scales. Finally, we propose a multi-scale regional proposal network (MRPN) that can establish multiple RPNs based on the scale types of the output features, utilizing adaptive convolutions to effectively extract object features at each scale and enhancing the ability to process multi-scale information. The experimental results showed that our proposed method exhibits good performance and generalization, especially in the 1-, 2-, 3-, 5-, and 10-shot experiments, with average accuracies of 33.2%, 36.8%, 43.3%, 47.2%, and 60.4%, respectively. The FADNet solves the problems posed by the few-shot characteristics and multi-scale characteristics of air objects, as well as improving the detection capabilities of the air object detection model. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

23 pages, 17933 KB

Open AccessArticle

Dual Histogram Equalization Algorithm Based on Adaptive Image Correction

by Bowen Ye, Sun Jin, Bing Li, Shuaiyu Yan and Deng Zhang

Appl. Sci. 2023, 13(19), 10649; https://doi.org/10.3390/app131910649 - 25 Sep 2023

Cited by 10 | Viewed by 2991

For the visual measurement of moving arm holes in complex working conditions, a histogram equalization algorithm can be used to improve image contrast. To lessen the problems of image brightness shift, image over-enhancement, and gray-level merging that occur with the traditional histogram equalization [...] Read more.

For the visual measurement of moving arm holes in complex working conditions, a histogram equalization algorithm can be used to improve image contrast. To lessen the problems of image brightness shift, image over-enhancement, and gray-level merging that occur with the traditional histogram equalization algorithm, a dual histogram equalization algorithm based on adaptive image correction (AICHE) is proposed. To prevent luminance shifts from occurring during image equalization, the AICHE algorithm protects the average luminance of the input image by improving upon the Otsu algorithm, enabling it to split the histogram. Then, the AICHE algorithm uses the local grayscale correction algorithm to correct the grayscale to prevent the image over-enhancement and gray-level merging problems that arise with the traditional algorithm. It is experimentally verified that the AICHE algorithm can significantly improve the histogram segmentation effect and enhance the contrast and detail information while protecting the average brightness of the input image, and thus the image quality is significantly increased. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

15 pages, 4788 KB

Open AccessArticle

Saliency-Driven Hand Gesture Recognition Incorporating Histogram of Oriented Gradients (HOG) and Deep Learning

by Farzaneh Jafari and Anup Basu

Sensors 2023, 23(18), 7790; https://doi.org/10.3390/s23187790 - 11 Sep 2023

Cited by 4 | Viewed by 2252

Hand gesture recognition is a vital means of communication to convey information between humans and machines. We propose a novel model for hand gesture recognition based on computer vision methods and compare results based on images with complex scenes. While extracting skin color [...] Read more.

Hand gesture recognition is a vital means of communication to convey information between humans and machines. We propose a novel model for hand gesture recognition based on computer vision methods and compare results based on images with complex scenes. While extracting skin color information is an efficient method to determine hand regions, complicated image backgrounds adversely affect recognizing the exact area of the hand shape. Some valuable features like saliency maps, histogram of oriented gradients (HOG), Canny edge detection, and skin color help us maximize the accuracy of hand shape recognition. Considering these features, we proposed an efficient hand posture detection model that improves the test accuracy results to over 99% on the NUS Hand Posture Dataset II and more than 97% on the hand gesture dataset with different challenging backgrounds. In addition, we added noise to around 60% of our datasets. Replicating our experiment, we achieved more than 98% and nearly 97% accuracy on NUS and hand gesture datasets, respectively. Experiments illustrate that the saliency method with HOG has stable performance for a wide range of images with complex backgrounds having varied hand colors and sizes. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

19 pages, 21026 KB

Open AccessArticle

Detection of Wheat Yellow Rust Disease Severity Based on Improved GhostNetV2

by Zhihui Li, Xin Fang, Tong Zhen and Yuhua Zhu

Appl. Sci. 2023, 13(17), 9987; https://doi.org/10.3390/app13179987 - 4 Sep 2023

Cited by 19 | Viewed by 3186

Wheat production safety is facing serious challenges because wheat yellow rust is a worldwide disease. Wheat yellow rust may have no obvious external manifestations in the early stage, and it is difficult to detect whether it is infected, but in the middle and [...] Read more.

Wheat production safety is facing serious challenges because wheat yellow rust is a worldwide disease. Wheat yellow rust may have no obvious external manifestations in the early stage, and it is difficult to detect whether it is infected, but in the middle and late stages of onset, the symptoms of the disease are obvious, though the severity is difficult to distinguish. A traditional deep learning network model has a large number of parameters, a large amount of calculation, a long time for model training, and high resource consumption, making it difficult to transplant to mobile and edge terminals. To address the above issues, this study proposes an optimized GhostNetV2 approach. First, to increase communication between groups, a channel rearrangement operation is performed on the output of the Ghost module. Then, the first five G-bneck layers of the source model GhostNetV2 are replaced with Fused-MBConv to accelerate model training. Finally, to further improve the model’s identification of diseases, the source attention mechanism SE is replaced by ECA. After experimental comparison, the improved algorithm shortens the training time by 37.49%, and the accuracy rate reaches 95.44%, which is 2.24% higher than the GhostNetV2 algorithm. The detection accuracy and speed have major improvements compared with other lightweight model algorithms. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

19 pages, 6677 KB

Open AccessArticle

A Long Skip Connection for Enhanced Color Selectivity in CNN Architectures

by Oscar Sanchez-Cesteros, Mariano Rincon, Margarita Bachiller and Sonia Valladares-Rodriguez

Sensors 2023, 23(17), 7582; https://doi.org/10.3390/s23177582 - 31 Aug 2023

Cited by 1 | Viewed by 2189

Some recent studies show that filters in convolutional neural networks (CNNs) have low color selectivity in datasets of natural scenes such as Imagenet. CNNs, bio-inspired by the visual cortex, are characterized by their hierarchical learning structure which appears to gradually transform the representation [...] Read more.

Some recent studies show that filters in convolutional neural networks (CNNs) have low color selectivity in datasets of natural scenes such as Imagenet. CNNs, bio-inspired by the visual cortex, are characterized by their hierarchical learning structure which appears to gradually transform the representation space. Inspired by the direct connection between the LGN and V4, which allows V4 to handle low-level information closer to the trichromatic input in addition to processed information that comes from V2/V3, we propose the addition of a long skip connection (LSC) between the first and last blocks of the feature extraction stage to allow deeper parts of the network to receive information from shallower layers. This type of connection improves classification accuracy by combining simple-visual and complex-abstract features to create more color-selective ones. We have applied this strategy to classic CNN architectures and quantitatively and qualitatively analyzed the improvement in accuracy while focusing on color selectivity. The results show that, in general, skip connections improve accuracy, but LSC improves it even more and enhances the color selectivity of the original CNN architectures. As a side result, we propose a new color representation procedure for organizing and filtering feature maps, making their visualization more manageable for qualitative color selectivity analysis. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

19 pages, 2992 KB

Open AccessArticle

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

by Haiping Zhang, Fuxing Zhou, Conghao Ma, Dongjing Wang and Wanjun Zhang

Sensors 2023, 23(17), 7563; https://doi.org/10.3390/s23177563 - 31 Aug 2023

Cited by 2 | Viewed by 1925

Temporal action detection is a very important and challenging task in the field of video understanding, especially for datasets with significant differences in action duration. The temporal relationships between the action instances contained in these datasets are very complex. For such videos, it [...] Read more.

Temporal action detection is a very important and challenging task in the field of video understanding, especially for datasets with significant differences in action duration. The temporal relationships between the action instances contained in these datasets are very complex. For such videos, it is necessary to capture information with a richer temporal distribution as much as possible. In this paper, we propose a dual-stream model that can model contextual information at multiple temporal scales. First, the input video is divided into two resolution streams, followed by a Multi-Resolution Context Aggregation module to capture multi-scale temporal information. Additionally, an Information Enhancement module is added after the high-resolution input stream to model both long-range and short-range contexts. Finally, the outputs of the two modules are merged to obtain features with rich temporal information for action localization and classification. We conducted experiments on three datasets to evaluate the proposed approach. On ActivityNet-v1.3, an average mAP (mean Average Precision) of 32.83% was obtained. On Charades, the best performance was obtained, with an average mAP of 27.3%. On TSU (Toyota Smarthome Untrimmed), an average mAP of 33.1% was achieved. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

16 pages, 9000 KB

Open AccessArticle

Infrared Dim and Small Target Sequence Dataset Generation Method Based on Generative Adversarial Networks

by Leihong Zhang, Weihong Lin, Zimin Shen, Dawei Zhang, Banglian Xu, Kaimin Wang and Jian Chen

Electronics 2023, 12(17), 3625; https://doi.org/10.3390/electronics12173625 - 28 Aug 2023

Cited by 7 | Viewed by 2288

With the development of infrared technology, infrared dim and small target detection plays a vital role in precision guidance applications. To address the problems of insufficient dataset coverage and huge actual shooting costs in infrared dim and small target detection methods, this paper [...] Read more.

With the development of infrared technology, infrared dim and small target detection plays a vital role in precision guidance applications. To address the problems of insufficient dataset coverage and huge actual shooting costs in infrared dim and small target detection methods, this paper proposes a method for generating infrared dim and small target sequence datasets based on generative adversarial networks (GANs). Specifically, first, the improved deep convolutional generative adversarial network (DCGAN) model is used to generate clear images of the infrared sky background. Then, target–background sequence images are constructed using multi-scale feature extraction and improved conditional generative adversarial networks. This method fully considers the infrared characteristics of the target and the background, which can achieve effective expansion of the image data and provide a test set for the infrared small target detection and recognition algorithm. In addition, the classifier’s performance can be improved by expanding the training set, which enhances the accuracy and effect of infrared dim and small target detection based on deep learning. After experimental evaluation, the dataset generated by this method is similar to the real infrared dataset, and the model detection accuracy can be improved after training with the latest deep learning model. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

23 pages, 9230 KB

Open AccessArticle

Unification of Road Scene Segmentation Strategies Using Multistream Data and Latent Space Attention

by August J. Naudé and Herman C. Myburgh

Sensors 2023, 23(17), 7355; https://doi.org/10.3390/s23177355 - 23 Aug 2023

Viewed by 1714

Road scene understanding, as a field of research, has attracted increasing attention in recent years. The development of road scene understanding capabilities that are applicable to real-world road scenarios has seen numerous complications. This has largely been due to the cost and complexity [...] Read more.

Road scene understanding, as a field of research, has attracted increasing attention in recent years. The development of road scene understanding capabilities that are applicable to real-world road scenarios has seen numerous complications. This has largely been due to the cost and complexity of achieving human-level scene understanding, at which successful segmentation of road scene elements can be achieved with a mean intersection over union score close to 1.0. There is a need for more of a unified approach to road scene segmentation for use in self-driving systems. Previous works have demonstrated how deep learning methods can be combined to improve the segmentation and perception performance of road scene understanding systems. This paper proposes a novel segmentation system that uses fully connected networks, attention mechanisms, and multiple-input data stream fusion to improve segmentation performance. Results show comparable performance compared to previous works, with a mean intersection over union of 87.4% on the Cityscapes dataset. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

18 pages, 5485 KB

Open AccessArticle

Vision Transformer Customized for Environment Detection and Collision Prediction to Assist the Visually Impaired

by Nasrin Bayat, Jong-Hwan Kim, Renoa Choudhury, Ibrahim F. Kadhim, Zubaidah Al-Mashhadani, Mark Aldritz Dela Virgen, Reuben Latorre, Ricardo De La Paz and Joon-Hyuk Park

J. Imaging 2023, 9(8), 161; https://doi.org/10.3390/jimaging9080161 - 15 Aug 2023

Cited by 6 | Viewed by 3261

This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the [...] Read more.

This paper presents a system that utilizes vision transformers and multimodal feedback modules to facilitate navigation and collision avoidance for the visually impaired. By implementing vision transformers, the system achieves accurate object detection, enabling the real-time identification of objects in front of the user. Semantic segmentation and the algorithms developed in this work provide a means to generate a trajectory vector of all identified objects from the vision transformer and to detect objects that are likely to intersect with the user’s walking path. Audio and vibrotactile feedback modules are integrated to convey collision warning through multimodal feedback. The dataset used to create the model was captured from both indoor and outdoor settings under different weather conditions at different times across multiple days, resulting in 27,867 photos consisting of 24 different classes. Classification results showed good performance (95% accuracy), supporting the efficacy and reliability of the proposed model. The design and control methods of the multimodal feedback modules for collision warning are also presented, while the experimental validation concerning their usability and efficiency stands as an upcoming endeavor. The demonstrated performance of the vision transformer and the presented algorithms in conjunction with the multimodal feedback modules show promising prospects of its feasibility and applicability for the navigation assistance of individuals with vision impairment. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

18 pages, 4489 KB

Open AccessArticle

Center Deviation Measurement of Color Contact Lenses Based on a Deep Learning Model and Hough Circle Transform

by Gi-nam Kim, Sung-hoon Kim, In Joo, Gui-bae Kim and Kwan-hee Yoo

Sensors 2023, 23(14), 6533; https://doi.org/10.3390/s23146533 - 19 Jul 2023

Cited by 6 | Viewed by 2347

Ensuring the quality of color contact lenses is vital, particularly in detecting defects during their production since they are directly worn on the eyes. One significant defect is the “center deviation (CD) defect”, where the colored area (CA) deviates from the center point. [...] Read more.

Ensuring the quality of color contact lenses is vital, particularly in detecting defects during their production since they are directly worn on the eyes. One significant defect is the “center deviation (CD) defect”, where the colored area (CA) deviates from the center point. Measuring the extent of deviation of the CA from the center point is necessary to detect these CD defects. In this study, we propose a method that utilizes image processing and analysis techniques for detecting such defects. Our approach involves employing semantic segmentation to simplify the image and reduce noise interference and utilizing the Hough circle transform algorithm to measure the deviation of the center point of the CA in color contact lenses. Experimental results demonstrated that our proposed method achieved a 71.2% reduction in error compared with existing research methods. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

28 pages, 4274 KB

Open AccessArticle

Improving Small-Scale Human Action Recognition Performance Using a 3D Heatmap Volume

by Lin Yuan, Zhen He, Qiang Wang, Leiyang Xu and Xiang Ma

Sensors 2023, 23(14), 6364; https://doi.org/10.3390/s23146364 - 13 Jul 2023

Cited by 8 | Viewed by 3630

In recent years, skeleton-based human action recognition has garnered significant research attention, with proposed recognition or segmentation methods typically validated on large-scale coarse-grained action datasets. However, there remains a lack of research on the recognition of small-scale fine-grained human actions using deep learning [...] Read more.

In recent years, skeleton-based human action recognition has garnered significant research attention, with proposed recognition or segmentation methods typically validated on large-scale coarse-grained action datasets. However, there remains a lack of research on the recognition of small-scale fine-grained human actions using deep learning methods, which have greater practical significance. To address this gap, we propose a novel approach based on heatmap-based pseudo videos and a unified, general model applicable to all modality datasets. Leveraging anthropometric kinematics as prior information, we extract common human motion features among datasets through an ad hoc pre-trained model. To overcome joint mismatch issues, we partition the human skeleton into five parts, a simple yet effective technique for information sharing. Our approach is evaluated on two datasets, including the public Nursing Activities and our self-built Tai Chi Action dataset. Results from linear evaluation protocol and fine-tuned evaluation demonstrate that our pre-trained model effectively captures common motion features among human actions and achieves steady and precise accuracy across all training settings, while mitigating network overfitting. Notably, our model outperforms state-of-the-art models in recognition accuracy when fusing joint and limb modality features along the channel dimension. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

16 pages, 3234 KB

Open AccessArticle

Unsupervised Vehicle Re-Identification Based on Cross-Style Semi-Supervised Pre-Training and Feature Cross-Division

by Guowei Zhan, Qi Wang, Weidong Min, Qing Han, Haoyu Zhao and Zitai Wei

Electronics 2023, 12(13), 2931; https://doi.org/10.3390/electronics12132931 - 3 Jul 2023

Cited by 2 | Viewed by 1832

Vehicle Re-Identification (Re-ID) based on Unsupervised Domain Adaptation (UDA) has shown promising performance. However, two main issues still exist: (1) existing methods that use Generative Adversarial Networks (GANs) for domain gap alleviation combine supervised learning with hard labels of the source domain, resulting [...] Read more.

Vehicle Re-Identification (Re-ID) based on Unsupervised Domain Adaptation (UDA) has shown promising performance. However, two main issues still exist: (1) existing methods that use Generative Adversarial Networks (GANs) for domain gap alleviation combine supervised learning with hard labels of the source domain, resulting in a mismatch between style transfer data and hard labels; (2) pseudo label assignment in the fine-tuning stage is solely determined by similarity measures of global features using clustering algorithms, leading to inevitable label noise in generated pseudo labels. To tackle these issues, this paper proposes an unsupervised vehicle re-identification framework based on cross-style semi-supervised pre-training and feature cross-division. The framework consists of two parts: cross-style semi-supervised pre-training (CSP) and feature cross-division (FCD) for model fine-tuning. The CSP module generates style transfer data containing source domain content and target domain style using a style transfer network, and then pre-trains the model in a semi-supervised manner using both source domain and style transfer data. A pseudo-label reassignment strategy is designed to generate soft labels assigned to the style transfer data. The FCD module obtains feature partitions through a novel interactive division to reduce the dependence of pseudo-labels on global features, and the final similarity measurement combines the results of partition features and global features. Experimental results on the VehicleID and VeRi-776 datasets show that the proposed method outperforms existing unsupervised vehicle re-identification methods. Compared with the last best method on each dataset, the method proposed in this paper improves the mAP by 0.63% and the Rank-1 by 0.73% on the three sub-datasets of VehicleID on average, and it improves mAP by 0.9% and Rank-1 by 1% on VeRi-776 dataset. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

18 pages, 595 KB

Open AccessArticle

Transfer Learning for Sentiment Classification Using Bidirectional Encoder Representations from Transformers (BERT) Model

by Ali Areshey and Hassan Mathkour

Sensors 2023, 23(11), 5232; https://doi.org/10.3390/s23115232 - 31 May 2023

Cited by 20 | Viewed by 5476

Sentiment is currently one of the most emerging areas of research due to the large amount of web content coming from social networking websites. Sentiment analysis is a crucial process for recommending systems for most people. Generally, the purpose of sentiment analysis is [...] Read more.

Sentiment is currently one of the most emerging areas of research due to the large amount of web content coming from social networking websites. Sentiment analysis is a crucial process for recommending systems for most people. Generally, the purpose of sentiment analysis is to determine an author’s attitude toward a subject or the overall tone of a document. There is a huge collection of studies that make an effort to predict how useful online reviews will be and have produced conflicting results on the efficacy of different methodologies. Furthermore, many of the current solutions employ manual feature generation and conventional shallow learning methods, which restrict generalization. As a result, the goal of this research is to develop a general approach using transfer learning by applying the “BERT (Bidirectional Encoder Representations from Transformers)”-based model. The efficiency of BERT classification is then evaluated by comparing it with similar machine learning techniques. In the experimental evaluation, the proposed model demonstrated superior performance in terms of outstanding prediction and high accuracy compared to earlier research. Comparative tests conducted on positive and negative Yelp reviews reveal that fine-tuned BERT classification performs better than other approaches. In addition, it is observed that BERT classifiers using batch size and sequence length significantly affect classification performance. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

22 pages, 5873 KB

Open AccessArticle

Lightweight Multiscale CNN Model for Wheat Disease Detection

by Xin Fang, Tong Zhen and Zhihui Li

Appl. Sci. 2023, 13(9), 5801; https://doi.org/10.3390/app13095801 - 8 May 2023

Cited by 27 | Viewed by 5227

Wheat disease detection is crucial for disease diagnosis, pesticide application optimization, disease control, and wheat yield and quality improvement. However, the detection of wheat diseases is difficult due to their various types. Detecting wheat diseases in complex fields is also challenging. Traditional models [...] Read more.

Wheat disease detection is crucial for disease diagnosis, pesticide application optimization, disease control, and wheat yield and quality improvement. However, the detection of wheat diseases is difficult due to their various types. Detecting wheat diseases in complex fields is also challenging. Traditional models are difficult to apply to mobile devices because they have large parameters, and high computation and resource requirements. To address these issues, this paper combines the residual module and the inception module to construct a lightweight multiscale CNN model, which introduces the CBAM and ECA modules into the residual block, enhances the model’s attention to diseases, and reduces the influence of complex backgrounds on disease recognition. The proposed method has an accuracy rate of 98.7% on the test dataset, which is higher than classic convolutional neural networks such as AlexNet, VGG16, and InceptionresnetV2 and lightweight models such as MobileNetV3 and EfficientNetb0. The proposed model has superior performance and can be applied to mobile terminals to quickly identify wheat diseases. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

19 pages, 8938 KB

Open AccessArticle

Development of an Accurate and Automated Quality Inspection System for Solder Joints on Aviation Plugs Using Fine-Tuned YOLOv5 Models

by Junwei Sha, Junpu Wang, Huanran Hu, Yongqiang Ye and Guili Xu

Appl. Sci. 2023, 13(9), 5290; https://doi.org/10.3390/app13095290 - 23 Apr 2023

Cited by 13 | Viewed by 3363

The quality inspection of solder joints on aviation plugs is extremely important in modern manufacturing industries. However, this task is still mostly performed by skilled workers after welding operations, posing the problems of subjective judgment and low efficiency. To address these issues, an [...] Read more.

The quality inspection of solder joints on aviation plugs is extremely important in modern manufacturing industries. However, this task is still mostly performed by skilled workers after welding operations, posing the problems of subjective judgment and low efficiency. To address these issues, an accurate and automated detection system using fine-tuned YOLOv5 models is developed in this paper. Firstly, we design an intelligent image acquisition system to obtain the high-resolution image of each solder joint automatically. Then, a two-phase approach is proposed for fast and accurate weld quality detection. In the first phase, a fine-tuned YOLOv5 model is applied to extract the region of interest (ROI), i.e., the row of solder joints to be inspected, within the whole image. With the sliding platform, the ROI is automatically moved to the center of the image to enhance its imaging clarity. Subsequently, another fine-tuned YOLOv5 model takes this adjusted ROI as input and realizes quality assessment. Finally, a concise and easy-to-use GUI has been designed and deployed in real production lines. Experimental results in the actual production line show that the proposed method can achieve a detection accuracy of more than 97.5% with a detection speed of about 0.1 s, which meets the needs of actual production Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

15 pages, 7610 KB

Open AccessArticle

FM-STDNet: High-Speed Detector for Fast-Moving Small Targets Based on Deep First-Order Network Architecture

by Xinyu Hu, Defeng Kong, Xiyang Liu, Junwei Zhang and Daode Zhang

Electronics 2023, 12(8), 1829; https://doi.org/10.3390/electronics12081829 - 12 Apr 2023

Cited by 8 | Viewed by 2063

Identifying objects of interest from digital vision signals is a core task of intelligent systems. However, fast and accurate identification of small moving targets in real-time has become a bottleneck in the field of target detection. In this paper, the problem of real-time [...] Read more.

Identifying objects of interest from digital vision signals is a core task of intelligent systems. However, fast and accurate identification of small moving targets in real-time has become a bottleneck in the field of target detection. In this paper, the problem of real-time detection of the fast-moving printed circuit board (PCB) tiny targets is investigated. This task is very challenging because PCB defects are usually small compared to the whole PCB board, and due to the pursuit of production efficiency, the actual production PCB moving speed is usually very fast, which puts higher requirements on the real-time of intelligent systems. To this end, a new model of FM-STDNet (Fast Moving Small Target Detection Network) is proposed based on the well-known deep learning detector YOLO (You Only Look Once) series model. First, based on the SPPNet (Spatial Pyramid Pooling Networks) network, a new SPPFCSP (Spatial Pyramid Pooling Fast Cross Stage Partial Network) spatial pyramid pooling module is designed to adapt to the extraction of different scale size features of different size input images, which helps retain the high semantic information of smaller features; then, the anchor-free mode is introduced to directly classify the regression prediction information and do the structural reparameterization construction to design a new high-speed prediction head RepHead to further improve the operation speed of the detector. The experimental results show that the proposed detector achieves 99.87% detection accuracy at the fastest speed compared to state-of-the-art depth detectors such as YOLOv3, Faster R-CNN, and TDD-Net in the fast-moving PCB surface defect detection task. The new model of FM-STDNet provides an effective reference for the fast-moving small target detection task. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

20 pages, 19138 KB

Open AccessArticle

Improved Mask R-CNN Multi-Target Detection and Segmentation for Autonomous Driving in Complex Scenes

by Shuqi Fang, Bin Zhang and Jingyu Hu

Sensors 2023, 23(8), 3853; https://doi.org/10.3390/s23083853 - 10 Apr 2023

Cited by 51 | Viewed by 9304

Vision-based target detection and segmentation has been an important research content for environment perception in autonomous driving, but the mainstream target detection and segmentation algorithms have the problems of low detection accuracy and poor mask segmentation quality for multi-target detection and segmentation in [...] Read more.

Vision-based target detection and segmentation has been an important research content for environment perception in autonomous driving, but the mainstream target detection and segmentation algorithms have the problems of low detection accuracy and poor mask segmentation quality for multi-target detection and segmentation in complex traffic scenes. To address this problem, this paper improved the Mask R-CNN by replacing the backbone network ResNet with the ResNeXt network with group convolution to further improve the feature extraction capability of the model. Furthermore, a bottom-up path enhancement strategy was added to the Feature Pyramid Network (FPN) to achieve feature fusion, while an efficient channel attention module (ECA) was added to the backbone feature extraction network to optimize the high-level low resolution semantic information graph. Finally, the bounding box regression loss function smooth L1 loss was replaced by CIoU loss to speed up the model convergence and minimize the error. The experimental results showed that the improved Mask R-CNN algorithm achieved 62.62% mAP for target detection and 57.58% mAP for segmentation accuracy on the publicly available CityScapes autonomous driving dataset, which were 4.73% and 3.96%% better than the original Mask R-CNN algorithm, respectively. The migration experiments showed that it has good detection and segmentation effects in each traffic scenario of the publicly available BDD autonomous driving dataset. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

15 pages, 5355 KB

Open AccessArticle

Insights into Batch Selection for Event-Camera Motion Estimation

by Juan L. Valerdi, Chiara Bartolozzi and Arren Glover

Sensors 2023, 23(7), 3699; https://doi.org/10.3390/s23073699 - 3 Apr 2023

Cited by 2 | Viewed by 2532

Event cameras measure scene changes with high temporal resolutions, making them well-suited for visual motion estimation. The activation of pixels results in an asynchronous stream of digital data (events), which rolls continuously over time without the discrete temporal boundaries typical of frame-based cameras [...] Read more.

Event cameras measure scene changes with high temporal resolutions, making them well-suited for visual motion estimation. The activation of pixels results in an asynchronous stream of digital data (events), which rolls continuously over time without the discrete temporal boundaries typical of frame-based cameras (where a data packet or frame is emitted at a fixed temporal rate). As such, it is not trivial to define a priori how to group/accumulate events in a way that is sufficient for computation. The suitable number of events can greatly vary for different environments, motion patterns, and tasks. In this paper, we use neural networks for rotational motion estimation as a scenario to investigate the appropriate selection of event batches to populate input tensors. Our results show that batch selection has a large impact on the results: training should be performed on a wide variety of different batches, regardless of the batch selection method; a simple fixed-time window is a good choice for inference with respect to fixed-count batches, and it also demonstrates comparable performance to more complex methods. Our initial hypothesis that a minimal amount of events is required to estimate motion (as in contrast maximization) is not valid when estimating motion with a neural network. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

10 pages, 1549 KB

Open AccessArticle

Self-Supervised Facial Motion Representation Learning via Contrastive Subclips

by Zheng Sun, Shad A. Torrie, Andrew W. Sumsion and Dah-Jye Lee

Electronics 2023, 12(6), 1369; https://doi.org/10.3390/electronics12061369 - 13 Mar 2023

Cited by 2 | Viewed by 1865

Facial motion representation learning has become an exciting research topic, since biometric technologies are becoming more common in our daily lives. One of its applications is identity verification. After recording a dynamic facial motion video for enrollment, the user needs to show a [...] Read more.

Facial motion representation learning has become an exciting research topic, since biometric technologies are becoming more common in our daily lives. One of its applications is identity verification. After recording a dynamic facial motion video for enrollment, the user needs to show a matched facial appearance and make a facial motion the same as the enrollment for authentication. Some recent research papers have discussed the benefits of this new biometric technology and reported promising results for both static and dynamic facial motion verification tasks. Our work extends the existing approaches and introduces compound facial actions, which contain more than one dominant facial action in one utterance. We propose a new self-supervised pretraining method called contrastive subclips that improves the model performance with these more complex and secure facial motions. The experimental results show that the contrastive subclips method improves upon the baseline approaches, and the model performance for test data can reach 89.7% average precision. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

30 pages, 752 KB

Open AccessReview

The Challenges of Recognizing Offline Handwritten Chinese: A Technical Review

by Lu Shen, Bidong Chen, Jianjing Wei, Hui Xu, Su-Kit Tang and Silvia Mirri

Appl. Sci. 2023, 13(6), 3500; https://doi.org/10.3390/app13063500 - 9 Mar 2023

Cited by 12 | Viewed by 7393

Offline handwritten Chinese recognition is an important research area of pattern recognition, including offline handwritten Chinese character recognition (offline HCCR) and offline handwritten Chinese text recognition (offline HCTR), which are closely related to daily life. With new deep learning techniques and the combination [...] Read more.

Offline handwritten Chinese recognition is an important research area of pattern recognition, including offline handwritten Chinese character recognition (offline HCCR) and offline handwritten Chinese text recognition (offline HCTR), which are closely related to daily life. With new deep learning techniques and the combination with other domain knowledge, offline handwritten Chinese recognition has gained breakthroughs in methods and performance in recent years. However, there have yet to be articles that provide a technical review of this field since 2016. In light of this, this paper reviews the research progress and challenges of offline handwritten Chinese recognition based on traditional techniques, deep learning methods, methods combining deep learning with traditional techniques, and knowledge from other areas from 2016 to 2022. Firstly, it introduces the research background and status of handwritten Chinese recognition, standard datasets, and evaluation metrics. Secondly, a comprehensive summary and analysis of offline HCCR and offline HCTR approaches during the last seven years is provided, along with an explanation of their concepts, specifics, and performances. Finally, the main research problems in this field over the past few years are presented. The challenges still exist in offline handwritten Chinese recognition are discussed, aiming to inspire future research work. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

18 pages, 3434 KB

Open AccessArticle

Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock Images

by Agus Nursikuwagus, Rinaldi Munir and Masayu Leylia Khodra

J. Imaging 2022, 8(11), 294; https://doi.org/10.3390/jimaging8110294 - 22 Oct 2022

Cited by 5 | Viewed by 4025

Captioning is the process of assembling a description for an image. Previous research on captioning has usually focused on foreground objects. In captioning concepts, there are two main objects for discussion: background object and foreground object. In contrast to the previous image-captioning research, [...] Read more.

Captioning is the process of assembling a description for an image. Previous research on captioning has usually focused on foreground objects. In captioning concepts, there are two main objects for discussion: background object and foreground object. In contrast to the previous image-captioning research, generating captions from the geological images of rocks is more focused on the background of the images. This study proposed image captioning using a convolutional neural network, long short-term memory, and word2vec to generate words from the image. The proposed model was constructed by a convolutional neural network (CNN), long short-term memory (LSTM), and word2vec and gave a dense output of 256 units. To make it properly grammatical, a sequence of predicted words was reconstructed into a sentence by the beam search algorithm with K = 3. An evaluation of the pre-trained baseline model VGG16 and our proposed CNN-A, CNN-B, CNN-C, and CNN-D models used BLEU score methods for the N-gram. The BLEU scores achieved for BLEU-1 using these models were 0.5515, 0.6463, 0.7012, 0.7620, and 0.5620, respectively. BLEU-2 showed scores of 0.6048, 0.6507, 0.7083, 0.8756, and 0.6578, respectively. BLEU-3 performed with scores of 0.6414, 0.6892, 0.7312, 0.8861, and 0.7307, respectively. Finally, BLEU-4 had scores of 0.6526, 0.6504, 0.7345, 0.8250, and 0.7537, respectively. Our CNN-C model outperformed the other models, especially the baseline model. Furthermore, there are several future challenges in studying captions, such as geological sentence structure, geological sentence phrase, and constructing words by a geological tagger. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

12 pages, 4018 KB

Open AccessArticle

Face Anti-Spoofing Method Based on Residual Network with Channel Attention Mechanism

by Yueping Kong, Xinyuan Li, Guangye Hao and Chu Liu

Electronics 2022, 11(19), 3056; https://doi.org/10.3390/electronics11193056 - 25 Sep 2022

Cited by 13 | Viewed by 4135

The face recognition system is vulnerable to spoofing attacks by photos or videos of a valid user face. However, edge degradation and texture blurring occur when non-living face images are used to attack the face recognition system. With this in mind, a novel [...] Read more.

The face recognition system is vulnerable to spoofing attacks by photos or videos of a valid user face. However, edge degradation and texture blurring occur when non-living face images are used to attack the face recognition system. With this in mind, a novel face anti-spoofing method combines the residual network and the channel attention mechanism. In our method, the residual network extracts the texture differences of features between face images. In contrast, the attention mechanism focuses on the differences of shadow and edge features located on nasal and cheek areas between living and non-living face images. It can assign weights to different filter features of the face image and enhance the ability of network extraction and expression of different key features in the nasal and cheek regions, improving detection accuracy. The experiments were performed on the public face anti-spoofing datasets of Replay-Attack and CASIA-FASD. We found the best value of the parameter r suitable for face anti-spoofing research is 16, and the accuracy of the method is 99.98% and 97.75%, respectively. Furthermore, to enhance the robustness of the method to illumination changes, the experiment was also performed on the datasets with light changes and achieved a good result. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Displaying articles 1-67

Submit your Manuscript

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Applied Sciences applsci	2.5	5.5	2011	19.8 Days	CHF 2400	Submit
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400	Submit
Machine Learning and Knowledge Extraction make	6.0	9.9	2019	25.5 Days	CHF 1800	Submit
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800	Submit
Sensors sensors	3.5	8.2	2001	19.7 Days	CHF 2600	Submit

Submit your Abstract

Journal Name	Impact Factor	CiteScore	Launched Year	First Decision (median)	APC
Applied Sciences applsci	2.5	5.5	2011	19.8 Days	CHF 2400	Submit
Electronics electronics	2.6	6.1	2012	16.8 Days	CHF 2400	Submit
Machine Learning and Knowledge Extraction make	6.0	9.9	2019	25.5 Days	CHF 1800	Submit
Journal of Imaging jimaging	3.3	6.7	2015	15.3 Days	CHF 1800	Submit
Sensors sensors	3.5	8.2	2001	19.7 Days	CHF 2600	Submit

Back to TopTop