Enhanced YOLO- and Wearable-Based Inspection System for Automotive Wire Harness Assembly

Li, Shuo; Yuan, Mingzhe; Wang, Wenhong; Cao, Feidao; Shi, Hongyan; Zhang, Yuhang; Meng, Xiangpu

doi:10.3390/app14072942

Open AccessArticle

Enhanced YOLO- and Wearable-Based Inspection System for Automotive Wire Harness Assembly

by

Shuo Li

^1,2,

Mingzhe Yuan

^1,*,

Wenhong Wang

^1,3,

Feidao Cao

¹,

Hongyan Shi

²,

Yuhang Zhang

¹ and

Xiangpu Meng

¹

Guangzhou Industrial Intelligence Research Institute, Guangzhou 511458, China

²

College of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

³

Guangdong Machine Vision Industrial Inspection Engineering and Technology Research Centre, Guangzhou 511458, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(7), 2942; https://doi.org/10.3390/app14072942

Submission received: 1 February 2024 / Revised: 8 March 2024 / Accepted: 15 March 2024 / Published: 30 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

In response to the challenges associated with the misassembly, omission, and low manual inspection efficiency in automobile wiring harness relay assemblies, a novel online detection system has been engineered. This system consists of a mobile-based visual imaging system and an improved YOLOv5-based detection algorithm that tracks human movement to acquire images and videos. The system is coupled with deep learning for real-time detection and recognition for error-proofing the installation process of automotive wiring harness relays. This innovation aims to facilitate error-proof inspection during the assembly process of automotive wiring harness relays. The YOLOv5s model is augmented with an Adaptive Spatial Feature Fusion (ASFF) module, enhancing multi-scale feature integration capabilities. A Global Context Network (GCNet) is incorporated into the C3 module to emphasize target information from a global perspective. Additionally, the replacement of standard Convolution (Conv) modules with Global Sparse Convolution (GSConv) modules in the Neck section effectively reduces computational costs while sustaining overall performance efficacy. The experimental results show that the detection system achieved a comprehensive accuracy rate of 99.2% and an F1 score of 99.29. The system possesses high accuracy and stability, enabling flexible and intelligent target detection applications in the automotive industry.

Keywords:

target detection; automobile manufacturing; assembly error proofing; YOLOv5s

1. Introduction

1.1. Motivation

As a crucial component of a vehicle’s structure, the automotive wiring harness is analogous to the human nervous system and vascular network, weaving throughout the entirety of the vehicle to coordinate and monitor the operation of various components. It consists of elements such as connectors, terminals, wires, fuse boxes, relays, cable ties, rubber parts, covering materials, and protective casings. As core control components in automobiles, relays serve roles as both circuit switches and protective devices. In traditional internal combustion engine vehicles, relays are extensively used for various functions such as control, ignition, air conditioning, lighting, windshield wipers, electronic fuel injection systems, fuel pumps, electric doors and windows, power seats, electronic instruments, and fault diagnosis systems. In new energy vehicles, the usage of wiring harnesses and relays significantly surpasses that in traditional vehicles. Particularly, the intricate interplay of power and information currents within new energy vehicles places higher demands on the reliability and electrical performance of wiring harnesses. Consequently, the assembly and quality requirements for wiring harnesses and relay products in new energy vehicles are expected to see a substantial increase.

Automotive wiring harness relay assembly work requires pre-installed relays and wiring harnesses to be sent to the production line for assembly. In order to improve work efficiency and save the installation time, the installation of harness relays is completed during the process of assembly workers picking up and moving the assembly materials. As automotive technology continues to advance and the functionality of automotive products expands, the wiring harness assembly process becomes more complex and requires the support of more components such as connectors, sensors, controllers and relays. Most wiring harness assembly tasks still rely on manual completion [1,2,3]. Today, most automotive production lines utilize co-line manufacturing and mixed-flow assembly modes. During the assembly process, workers are required to inspect various types of information on the wiring harnesses for different vehicle models. This includes checking the wiring harness number, relay box labeling, type of relays, number of relays, and the size and color of the relays. The inspection encompasses characters, colors, shapes, and more. Additionally, it is necessary to repeatedly confirm the unique identification text on the wiring harness to ensure the accuracy of the assembly work. However, given the multitude of relay types, relays for different car models often share identical dimensions and can be installed in the same location. This situation presents a risk of missed or incorrect installations when reliance is solely on visual verification to ensure consistency between production instructions and actual markings. In a high-load production environment, the limited capacity of humans to process information necessitates that assembly workers concentrate on the assembly task. As a result, the inspection process is not only time consuming but also prone to errors. Particularly after prolonged periods of work, a worker’s attention may wane, and reliance on visual inspection alone cannot guarantee the quality of the product [4]. As new car models continue to proliferate, the issues of missed and erroneous inspections are becoming increasingly severe. Once defective products pass downstream in the production line, they can cause the automobile’s electronic control system to malfunction. Consequently, there is an urgent need for a detection system based on computer vision to assist workers in efficiently completing the inspection tasks in this phase. However, we have encountered many challenges in the intelligent inspection process of automotive wiring harness assembly. For example, in complex motion assembly scenarios, image acquisition may have problems such as blurring and scale variation, which, in turn, leads to the phenomenon of missed and wrong inspection and reduces the inspection accuracy. In order to cope with these challenges, this study designs a deep-learning-based visual inspection system and performs targeted optimization of the inspection algorithm to meet the practical application requirements in industrial sites.

1.2. Related Work

In recent years, with the rapid advancement of artificial intelligence [5] and computer vision [6], the integration of machine vision and deep learning algorithms has spearheaded an era of innovative upgrades.

In the field of industrial production, an increasing number of visual detection systems employing deep learning are being applied [7,8,9]. The study presented in [10] explores the application of deep learning in the Automated Optical Inspection (AOI) of ceramic substrates in circuit images, aiming to develop an automated defect detection system capable of identifying types and locations of defects. Ref. [11] developed a novel automated computer vision algorithm and system for inspecting component replacements in printed circuit boards. To enhance the efficiency and accuracy of the assembly process, more automotive and wiring harness manufacturers are beginning to incorporate machine vision hardware and software, replacing traditional manual operations. Research studies [12,13,14] investigated new methods and systems of computer vision applied in the assembly process of automotive wiring harnesses. Typical industrial computer vision systems utilize stationary or motion modules and robotic arm-mounted industrial cameras to capture image information of the target objects. With the continuous development of smart manufacturing, wearable image acquisition devices are becoming increasingly popular in industrial applications due to their higher flexibility and stronger adaptability.

The vision inspection system for automotive wiring harness assembly needs to perform detection in tandem with the movements of the assembly personnel, requiring the system’s detection algorithm to adapt well to dynamic scenes. In the realm of deep-learning-based detection of targets in motion, the industrial sector primarily employs two types of object detection algorithms: the two-stage object detection algorithms, exemplified by Faster Region-Convolutional Neural Network (R-CNN) as proposed by Ren et al. [15]; and the single-stage object detection algorithms, represented by YOLO, as introduced by Redmon et al. [16].

The single-stage object detection algorithm, with its streamlined network architecture, achieves a remarkably high detection speed. It reaches a frame rate of 45 on VOC2007, significantly surpassing Faster R-CNN, which only achieves a frame rate of 7. This efficiency has led to its extensive application across various scenarios [17,18,19]. Regarding enhanced single-stage target detection algorithms tailored for specific tasks, refs. [20,21,22] introduce the attention module and feature fusion module, enhance the multi-scale feature network, modify the Backbone network, and employ other measures to improve the YOLO model and enhance the performance of the detection algorithm.

The primary contribution of this paper lies in the design of a wearable vision inspection system for automotive wiring harness assembly. This system assists workers in efficiently completing the inspection tasks of wiring harness assembly without adding extra workload. It is specifically intended for the detection of various assembly components, including the automotive wiring harness number, relay box labeling, relay types, the number of relays, and the color and size of the relays, focusing on aspects such as characters, colors, and shapes. This paper investigates the performance of various mainstream networks and finds that the YOLOv5s [23] network has significant advantages in handling complex motion scenarios and dynamic multi-object detection. Therefore, YOLOv5s was chosen as the base network. To address issues specific to automotive wiring harness assembly scenarios, such as image motion blur and significant variations in target image sizes, key modules of the YOLOv5s network were optimized. Firstly, the Detect module of YOLOv5s was integrated with the Adaptive Spatial Feature Fusion Network (ASFF) [24], effectively utilizing features at different scales. This integration better captures the intricate details of the assembly components, significantly mitigating the issue of target feature information loss. Secondly, the C3 module incorporates the Global Context Network (GCNet) [25], which draws on the strengths of Non-Local Networks and Squeeze-and-Excitation Networks (SENet) [26]. This enhancement bolsters the model’s understanding of global relationships, expands its perceptual range, and adapts more effectively to changes in scale. Finally, to balance the overall performance of the model, Global Sparse Convolution (GSConv) [27] replaces traditional convolution modules in the Neck, reducing the model’s computational and parameter requirements. Additionally, data augmentation techniques were employed to increase the diversity of training samples, further enhancing the network’s generalization capabilities. Through these improvements, the overall detection effectiveness of the YOLOv5s network for wiring harness relay assembly has been significantly enhanced. The system is capable of accurately identifying the characteristics of assembly components such as automotive wiring harness numbers, relay box labels, types of relays, number of relays, and the size and color of the relays. In practical applications, this system has demonstrated exceptional performance, increasing production efficiency and product quality. In practical applications, the system has demonstrated excellent performance. Compared with the previous manual assembly of a wiring harness, which takes an average of 55 s, it takes an average of 43 s to assemble a wiring harness after joining the inspection system, which improves the assembly efficiency by 21.8%, and can achieve 100% full inspection.

The remainder of this paper is structured as follows: Section 2 details the system architecture of the automotive wiring harness relay detection system, with a special emphasis on the improvements made to the YOLO network and data collection and processing efforts. Section 3 presents the experimental data and performance evaluation of the model improvements through various experiments. Section 4 analyzes the results and demonstrates the effectiveness of the algorithmic model in detection. Section 5 concludes the paper and offers an outlook on potential future research directions.

2. Materials and Methods

2.1. Detection System Device

2.1.1. Detection System Introduction

To meet the demands of rapid and accurate relay installation on automotive assembly lines, the automotive wiring harness assembly vision inspection system employs electronic glasses as a wearable image acquisition device for the real-time monitoring of workers’ relay installation process. Under the premise of not interfering with the workers’ normal assembly workflow and operations, the system incorporates “electronic eyes”. This allows workers to focus on their manual assembly actions, delegating the more detailed and efficient inspection tasks to the “electronic eyes”. The system, based on the assembler’s first-person perspective, captures images of the assembly state, actions, and materials during the process. These images are then transmitted wirelessly to a vision inspection server for processing and analysis. Ultimately, the system outputs assembly prompts and quality assessment results, verifying whether the workers’ operations conform to the technical requirements. The workers can then use these results to make corrections and standardize their assembly actions, ensuring the completion of the assembly correctly. The implementation of this system helps prevent missed and incorrect installations, enhances quality assurance, and achieves flexible inspection in industrial assembly operations.

The work cycle for the automotive wiring harness assembly station is 55 s, characterized by a high frequency of operations, tight working hours, and limited surrounding space. The wiring harnesses for relay installation are in irregular states, and the installation positions vary. The automotive wiring harnesses and relays are shown in Figure 1.

The vision detection system for automotive wiring harness assembly utilizes electronic glasses as the imaging device. This device, weighing a total of 120 g (including the battery), employs advanced optical imaging technology. It is designed with various wearing options such as a duckbill cap, safety helmet, and protective goggles, offering high comfort for all-day wear and adaptable to the needs of various work environments.

The electronic glasses have a hot-swappable battery power supply mode, which meets the 24 h working requirements through a lithium battery power supply method for one use and one backup. Equipped with up to 4 K 30 fps video streaming capabilities and a 12-megapixel autofocus camera, they fulfill the needs for image information collection. With an 8-core 2.52 GHz CPU, 64 GB storage, and 6 GB RAM, along with the Android 9.0+ operating system, they can handle high-performance processing demands in industrial scenarios. A 28° field of view with two-axes vision adjustment meets the requirements for a wide collection range. The WiFi network connection, combined with light waveguide screens, voice, touchpad, and button controls for display and operation, makes the device user friendly. The wearable electronic glasses image acquisition device is shown in Figure 2.

2.1.2. Detection System Composition Framework

Figure 3 shows the complete framework of the vision inspection system for automotive wiring harness assembly. The physical system hardware is shown in Figure 4.

The main core components of the detection system include:

Electronic Glasses Imaging Device: three sets of electronic glasses image acquisition devices, meeting the usage requirements of three assembly stations simultaneously, are used to capture image data of automotive wiring harnesses and relay assemblies.
Execution Manufacturing System: issues production data and tasks.
Server: Facilitates data exchange with the Manufacturing Execution System (MES) on the production line. This server is responsible for receiving, processing, and storing product data from the MES system and image data from client-side inspections. It also provides relevant data and inspection results for the MES system on the production line.
Client: It consists of an industrial control computer, a monitor and an alarm, and based on QT 6.5, an intuitive and easy-to-use human–computer interaction interfaces is sent out, which is used for the detection and recognition of assembly images, system interface operation and display of inspection results and assembly operation warnings. Through the user-friendly interface design, the operator can easily operate the system, view the inspection results and receive warning messages.
Peripheral Hardware Support: consists of a wireless router and industrial power supply, responsible for network signal exchange and transmission and reception of wireless signals, ensuring the full functionality of the system.

2.2. Model Selection

2.2.1. YOLOv5 Network

YOLOv5 is an upgraded version of the target detection algorithm of YOLOv4 [28], released by Ultralytics LLC. Compared with YOLOv4, YOLOv5 has a smaller weight file, shorter training and inference time, and performs well in terms of average detection accuracy. Depending on the width and depth of the network, YOLOv5 includes multiple versions of models such as YOLOv5s, YOLOv5l, YOLOv5m, and YOLOv5x. Among them, YOLOv5s has the smallest model, the fastest detection speed, and high detection accuracy; so, it was selected as the basic detection algorithm.

The YOLOv5s algorithm structure comprises four main parts: the input, Backbone, Neck, and Head. The input utilizes a Mosaic data augmentation strategy, enhancing the detection generalization capability for background information and multi-scale targets through random arrangement, random cropping, and color adjustment. The optimal anchor box values are calculated using the K-means adaptive clustering algorithm, effectively reducing the computational load on the model. The Backbone network uses the CSPDarknet53 framework, consisting of Conv, C3, and Spatial Pyramid Pooling Fast (SPPF) segments. The Conv and C3 layers enhance feature extraction through connectivity, while the C3 module avoids gradient vanishing by separating and fusing features of different scales. The SPPF structure, with its three maximum feature pooling layers, strengthens the network’s perception of images and its ability to distinguish feature information. The Neck network employs Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) feature pyramid network structures. FPN merges semantic feature information in a top–down path, whereas the PANet structure transfers strong localization features from lower to higher levels. Finally, the prediction part of YOLOv5s includes three different scale prediction layers, each tailored for detecting large and small objects. After object detection, the final results are obtained through non-maximum suppression processing.

2.2.2. Improved YOLOv5s Network

Although YOLOv5s demonstrates excellent performance in various object detection tasks, it struggles with multi-scale image changes in motion scenarios and the high accuracy demands of industrial applications. Therefore, improvements were made to its network model. Firstly, an adaptive feature fusion network was added to Detect, effectively handling objects and scenes of different scales. This network assigns appropriate weights to different scale feature maps, enabling more comprehensive semantic information fusion, and improving the algorithm’s detection accuracy without increasing computational load or model parameters. Secondly, to enhance the model’s understanding of global relationships and its perceptual range, and to better adapt to varying image scales, the C3 layer in the Backbone network was upgraded to a C3GC module, integrating a global context network module. Finally, the Conv convolution is replaced with the lightweight convolutional neural network GSConv at the Neck of the network to mitigate the amount of model computation and number of parameters and balance the overall model performance. The improved YOLOv5s network structure is shown in Figure 5.

2.2.3. Adaptive Spatial Feature Fusion Seeks Better Fusion Solutions

During the image acquisition process using wearable electronic glasses, the assembly workers continually engage in actions such as bending, reaching, and plugging, causing the distance of the hand in the field of view to vary. This leads to changes in image size and the formation of target images at different scales. The Adaptively Spatial Feature Fusion (ASFF) network utilizes spatial filtering to suppress inconsistencies in spatial features across different scales during fusion, retaining only useful information for combination. By adaptively fusing deep and shallow features, ASFF effectively utilizes feature information at different scales, significantly mitigating the loss of target feature information. In YOLOv5, three different levels of feature information, each with different resolutions and channel numbers, are integrated using ASFF. ASFF_Detect, which replaces the original Detect module of YOLOv5, first aligns the features of other layers to the same resolution and channel number before fusing and training them together, aiming to find the optimal fusion solution. These adaptively fused features from different layers ensure that conflicting information is filtered out, while more discriminative information is retained and dominates. The general workflow of ASFF is illustrated in Figure 6.

ASFF requires the integration of feature information from three different levels, and the first step is to adjust them to the same size. Therefore, changes are needed in both up-sampling and down-sampling for each scale. For up-sampling, 1 × 1 convolution is initially used to convert the channel numbers of other layers to that of the current layer, followed by interpolation to increase the resolution. For 1/2 down-sampling, a 3 × 3 convolution with stride = 2 is used to modify the channel numbers. For 1/4 down-sampling, a layer of maximum pooling with stride = 2 is added before convolution. Once adjusted to a uniform size, they need to be integrated and trained. The formula for integrating features from the three different levels is as follows:

y_{i j}^{l} = α_{i j}^{l} * x_{i j}^{1 \to l} + β_{i j}^{l} * x_{i j}^{2 \to l} + γ_{i j}^{l} * x_{i j}^{3 \to l}, α_{i j}^{l}, β_{i j}^{l}, γ_{i j}^{l} \in [0,1]

(1)

In the formula,

x_{i j}^{n \to l}

represents converting the number of feature map channels of (0,1) from

n

to 1.

y_{i j}^{l}

represents the final output feature map.

α_{i j}^{l}, β_{i j}^{l}, γ_{i j}^{l}

is the weight of integrating features at three different levels and satisfies

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} = 1

.

2.2.4. Global Context Network Enhances Feature Extraction Modules

Traditional convolutional neural networks use local receptive fields for convolution operations in image processing. A limitation of this approach is that it can only model a limited region of an image, leading to a restricted receptive field. This is particularly problematic in scenarios involving targets with distant correlations, as the convolutional kernel can only observe a portion of the image within its convolution range, making it challenging to enhance the detection capability for long-range dependencies. To strengthen the model’s understanding of global relationships and expand its perceptual range, thereby better adapting to changes in scale, this paper integrates the Global Context Network (GCNet) into the C3 module, forming a combined C3GC module to replace the original C3 module.

GCNet is a global context modeling framework inspired by the advantages of Non-Local Networks and Squeeze-and-Excitation Networks (SENet). It combines the global context modeling capabilities of Non-Local Networks with the computational efficiency of SENet.

Non-local uses non-local operations to establish remote dependency relationships between pixels in an image, enabling global attention. This approach allows the model to transcend the limitations of local association points. Additionally, it leverages an attention mechanism to generate globally attentive features, thereby acquiring more comprehensive information. This structure aims to enhance the model’s perceptual range of global relationships, making it more suitable for processing image features of varying scales. The output of a Non-Local block can be expressed as follows:

Z_{i} = X_{i} + W_{z} \sum_{j = 1}^{N_{p}} \frac{f (X_{i,} X_{j})}{C (x)} (W_{v} * X_{j}),

(2)

where

i

is the index of the query location and

j

enumerates all possible locations.

f (X_{i,} X_{j})

represents the relationship between positions

i

and

j

, and has a normalization factor

C (x)

.

W_{z}

and

W_{v}

represent linear transformation matrices (e.g., 1 × 1 convolution). For simplicity, let us denote

\frac{f (X_{i,} X_{j})}{C (x)}

as the normalized pairwise relationship between locations

i

and

j

.

However, the Non-local structure calculates the attention distribution for each pixel, causing a huge waste of computing resources. Therefore, GCNet takes advantage of its advantages and simplifies its structure, which can be expressed as follows:

Z_{i} = X_{i} + W_{z} \sum_{j = 1}^{N_{p}} \frac{e x p (W_{k} X_{j})}{\sum_{m = 1}^{N_{p}} e x p (W_{k} X_{m})} X_{j},

(3)

The simplified Non-local block structure is shown in Figure 7a. The entire GCNet block can be simply expressed as follows:

Z_{i} = {F (X}_{i} {, δ (\sum_{j = 1}^{N_{p}} a_{j} X}_{j})),

(4)

{\sum_{j = 1}^{N_{p}} a_{j} X}_{j}

represents the weight module, and

F

represents the aggregation of global features to the features of each position.

GCNet also incorporates the advantages of the SENet structure, optimizing the computational load while maintaining the basic speed of the model, thereby enhancing its accuracy. The architecture of the SE block is illustrated in Figure 7b. Ultimately, the detailed architecture of the Global Context (GC) block is shown in Figure 7c, with the formula being as follows:

Z_{i} = X_{i} + W_{v 2} R e L U (L N (W_{v 1} \sum_{j = 1}^{N_{p}} \frac{e x p (W_{k} X_{j})}{\sum_{m = 1}^{N_{p}} e x p (W_{k} X_{m})} X_{j})),

(5)

\sum_{j = 1}^{N_{p}} \frac{e x p (W_{k} X_{j})}{\sum_{m = 1}^{N_{p}} e x p (W_{k} X_{m})}

is the weight of the network module, and

W_{v 2} R e L U (L N (W_{v 1} \sum_{j = 1}^{N_{p}} \frac{e x p (W_{k} X_{j})}{\sum_{m = 1}^{N_{p}} e x p (W_{k} X_{m})} X_{j}))

represents the channel conversion module. Among them, there is the nonlinear activation function, which is the layer normalization. In the transformation part of the simplified Non-local module, the Bottleneck structure from the SE module is integrated, and layer normalization is used to address optimization issues. The context modeling part retains the structure of the simplified Non-local module. This approach not only captures the long-range dependency capabilities of Non-local for adapting features but also, like the SE module, reduces computational load. It addresses the issue of losing feature diversity during extraction, thereby enhancing the accuracy of detection.

Figure 7. GCNet modules and components (C is the number of feature map channels, r denotes the reduction rate, H is the height, W is the width,

W_{z}

and

W_{k}

denotes the linear transformation matrix,

W_{v 1}

and

W_{v 2}

are the weight parameters of the two convolutional layers, respectively). (a) Simplified Non-local module structure. (b) SE module structure. (c) GCNet module structure.

Figure 7. GCNet modules and components (C is the number of feature map channels, r denotes the reduction rate, H is the height, W is the width,

W_{z}

and

W_{k}

denotes the linear transformation matrix,

W_{v 1}

and

W_{v 2}

are the weight parameters of the two convolutional layers, respectively). (a) Simplified Non-local module structure. (b) SE module structure. (c) GCNet module structure.

2.2.5. Network Lightweight

Global Sparse Convolution (GSConv) integrates the advantages of both dense and sparse convolutions. By introducing global sparsity into feature maps, it effectively aggregates global information. Compared to traditional convolutions, GSConv utilizes global sparsity and performs convolution calculations only at sparse locations, significantly reducing the computational load and the number of parameters. However, if GSConv is used in all stages of the algorithm, the network layers become deeper, slowing down the network’s operation. After passing through the Backbone network, the input image reaches its maximum channel dimension and minimum width and height dimensions. Using GSConv at this point can increase the network’s inference speed. Therefore, GSConv is only used in the Neck layer. The structure of GSConv is shown in Figure 8. Let the number of input channels be C1 and the number of output channels be C2. The input RGB image, after passing through Conv, generates a feature map with C2/2 channels, and then through DepthWise Convolution (DWConv), another feature map with C2/2 channels is obtained. These two feature maps are concatenated and then shuffled to permeate the generated information throughout each part of the feature map. The network structure of GSConv is depicted in Figure 8.

2.3. Dataset

While the assembly operation was in progress, the staff used the camera of the eyewear to videotape the entire process. The e-glasses captured images with an original size of 600 pixels by 600 pixels at a rate of 30 frames per second. Subsequently, each frame of the video was selected and labeled to ensure the quality of the images, and the required image dataset was finally produced. The images contain five items to be inspected, covering 12 labels such as relay harness sticky notes, colors and characters marked on the main box, etc. Labeling was used for annotation, and the number of labeled samples was 20,947 images. In order to simulate the data pattern under the real assembly motion scene, this paper uses Albumentations’ data enhancement library to enhance the labeled dataset with motion blur, rotation, brightness adjustment and other enhancement processes to improve the feature extraction of the model for target detection in the motion scene and the adaptability to the motion blurred images. The enhanced sample consists of 83,788 images, of which 90% are used as the training set and the remaining 10% images are the validation set. A portion of the dataset images is shown in Figure 9.

3. Results

3.1. Model Training

The experimental environment is based on the Ubuntu 20.04.5 LTS operating system, and the processor is a high-performance 12th Gen Inter (R) Core (TM) i9-12900K with 2.60 GHz, 126 GB of RAM, and a total of 40 TB of disk capacity, and the graphics processor is a 4 NVIDIA RTX 3090 GPU. GPUs with 96 GB of video memory provide powerful computational support for the training of deep learning models. In the training process of the network model, Pytorch 1.12.0 framework was used as the main development tool. Forty samples were selected for each iteration of training, and a total of 300 rounds of iterations were performed. To ensure the stability and efficiency of the training, the initial learning rate was set to 0.01 and the momentum was set to 0.8%. Except for the above parameters, all hyperparameters were kept consistent. During the training process, the SGD optimizer was chosen to optimize the network weights in order to achieve better training results.

3.2. Evaluation Metrics

To measure the effectiveness of the algorithm, the experiment in this paper uses precision (P), recall (R), mean Average Precision (mAP) at an Intersection over Union (IoU) threshold of 0.5 ([email protected]), and F1 score as evaluation metrics to assess the model’s performance. Here, P represents the proportion of correctly predicted positive samples (True Positives, TP) out of the total predicted positive samples (TP + False Positives, FP). R indicates the proportion of correctly predicted positive samples (TP) out of the total labeled samples (TP + False Negatives, FN). Mean Average Precision (mAP) is related to both precision and recall, with [email protected] calculated as the average of the AP for each category across all images at an IoU threshold of 0.5. The specific calculation methods for these metrics are as follows:

P = \frac{T P}{T P + F P},

(6)

R = \frac{T P}{T P + F N},

(7)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i},

(8)

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l},

(9)

3.3. Comparison of Detection Performance of Different Algorithms

To validate the impact of ASFF_Detect and GSConv on the algorithm, ablation experiments were designed to analyze different improvements through three experimental setups. Table 1 presents a comparative experiment adding the C3GC module, ASFF_Detect module, and GSConv to the YOLOv5s network structure. Here, YOLOv5s_AS denotes replacing the Detect module of YOLOv5s with the ASFF_Detect module. YOLOv5s_ASG indicates replacing the C3 module in the Backbone network of YOLOv5s_AS with the C3GC module. YOLOv5s_ASGG refers to replacing the Conv module in the Neck of YOLOv5s_ASG with the GSConv module.

The original YOLOv5s model performed on our dataset with 97.6% accuracy, 98.9% mAP and a 97.99 F1 score. By improving the model in a targeted manner, we achieve an overall improvement in performance. Specifically, the introduction of the ASFF_Detect module resulted in a 0.4% increase in accuracy, a 0.2% increase in mAP, and a 0.46 improvement in F1 scores, and the fusion of the C3GC module on top of this resulted in another 0.5% increase in accuracy, another 0.4% increase in mAP, and a 0.85 improvement in F1 scores. These two phases of enhancement demonstrated both the ASFF_Detect module’s and the C3GC module’s effectiveness in enhancing the network’s feature extraction and global context understanding capabilities. Further, by replacing the Conv convolution module in the Neck with the GSConv convolution module, the number of parameters and the size of the model are reduced by 2% and 3.8%, respectively, which effectively optimizes the computational efficiency and resource consumption of the model, even though the accuracy is slightly decreased by 0.1% to 99.2%.

Combining all the improvements, the final network achieves an average precision (mAP) of 99.4%, an accuracy (P) of 99.2%, and an F1 score of 1.3, which are 0.5%, 1.6%, and 1.3%, respectively, compared to the base YOLOv5s model. These experimental results fully demonstrate that the introduction of the C3GC module and the ASFF_Detect module can significantly improve the network’s detection performance in complex scenarios, while the application of GSConv achieves significant savings in model volume and computation while ensuring performance. These results validate the effectiveness and practicality of our proposed improved algorithm in motion scenario tasks such as automotive wire harness assembly error detection.

To further validate the advantages and performance of the improved YOLOv5s algorithm proposed in this paper, a comparison was conducted with other common algorithms. Under the same experimental conditions, a performance comparison was made with classic algorithms in the field of object detection (YOLOv3, YOLOv5s, YOLOv7, and YOLOv5x). The experimental results are shown in Table 2.

According to Table 2, the YOLOv5s-ASGG algorithm shows higher performance compared to the following five algorithms: YOLOv3, YOLOv7, YOLOv5s, YOLOv5x, and YOLOv8n. Specifically, the accuracy (P) of YOLOv5s-ASGG is higher by 3.3%, 13%, 1.6%, 1.7%, and 7%, respectively. The mean Average Precision (mAP) is higher by 0.8%, 4.2%, 0.5%, 0.3%, and 1.7%, respectively. The recall rate (R) is higher by 0.9%, 5.2%, 1%, 1.6%, and 1.8%, respectively. The F1 scores are higher by 2.11, 9.27, 1.3, 1.64, and 4.47, respectively. Overall, the performance metrics of YOLOv5s-ASGG are significantly better than those of these five algorithms.

3.4. Comparison of Actual Detection Results

From the original dataset, six types of images were randomly selected for detection using both the YOLOv5s and the improved algorithm under the same experimental conditions. The comparison of the detection results is shown in Figure 10. As can be seen from the figure, by using YOLOv5s algorithm in the detection of the relay base number of the phenomenon of missed detection, mildly fuzzy characters “BD” cannot be detected, although YOLOv5s-ASGG improved algorithm is very good at detecting their number; in the detection of the wiring harness number, YOLOv5s appears to identify the T6 T5 wrong detection phenomenon, while the YOLOv5s-ASGG improved algorithm can achieve better detection results. From the comparison of the actual detection effect, it can be seen that the YOLOv5s-ASGG improved algorithm has a better detection ability and better adaptability when facing the blurred images and images that are difficult to be detected due to the change in angle, and it achieves an accuracy rate of 99.2% from 8379 images of the validation set.

4. Discussion

In response to issues such as incorrect assembly, missing components, and low efficiency of manual inspections in the automotive wiring harness relay assembly process, a vision inspection system based on an improved YOLOv5s model was designed. The enhanced YOLOv5s model used in this study significantly improved the capability to recognize types, colors, and characters of relays in motion scenarios. This enhancement has led to improved assembly quality and reduced labor intensity. This section focuses on discussing the research results, potential limitations of this study, and suggestions for future research.

In this paper, the ASFF_Detect module replaces the original Detect module of YOLOv5s, making full use of feature information at different scales to address the issue of lost target feature information. By adaptively fusing deep and shallow features, the overall performance of the model is enhanced. To improve the model’s understanding of global relationships and expand its perceptual range, this paper integrates the GCNet with the C3 module, proposing using the C3GC module to replace the original C3 module. This effectively increases the model’s detection accuracy. Finally, to meet the criteria for industrial applications, GSConv is used in place of the original Conv method in the Neck, reducing the model’s parameter count and increasing detection speed.

The experimental results demonstrate that the use of ASFF_Detect achieved an accuracy of 98%, which is a 0.4% increase compared to YOLOv5s; the mAP reached 98.9%, showing a 0.2% improvement; and the F1 score improved to 98.45, an increase of 0.46. This indicates that ASFF_Detect can comprehensively enhance network performance. Building upon this, the use of C3GC achieved an even higher accuracy of 99.3%, which is a 1.7% improvement over YOLOv5s; the mAP increased to 99.3%, up by 0.6%; and the F1 score rose to 99.3, an increase of 1.31. These results suggest that YOLOv5s-ASG has the best performance, significantly enhancing the baseline model. With the replacement of the Neck Conv with GSConv, the accuracy was 99.2%. Although this is a slight decrease of 0.1% compared to YOLOv5s-ASG, the model’s parameter count and size were reduced by 2% and 3.8%, respectively. Therefore, YOLOv5s-ASGG achieves a reduction in model size and computational load with a minimal decrease in accuracy, further enhancing the overall performance of the network.

The algorithm proposed in this paper is thoroughly compared with the original YOLOv5s algorithm and other mainstream algorithms. While YOLOv3 performs well in multi-target detection tasks, its accuracy is relatively low when dealing with small targets, such as automotive wire harness relays. On the other hand, although YOLOv5x provides high accuracy as a large model in the YOLOv5 family, it does not perform as well as the improved YOLOv5s algorithm in this paper in terms of real-time performance, especially in environments with limited hardware resources. In addition, YOLOv7 and YOLOv8n, as new advances in the YOLO family, demonstrate excellent detection speed and accuracy. However, in specific automotive harness assembly scenarios, the improved YOLOv5s-ASGG algorithm in this paper provides a more suitable solution through targeted optimization. The YOLOv5s-ASGG algorithm shows superior performance in key metrics such as mAP, P and R. Compared with the original YOLOv5s algorithm, the mAP of YOLOv5s-ASGG is improved by 0.5% to 99.4%, and the accuracy rate is improved by 1.6% to 99.2%. Meanwhile, the F1 scores of all types of AP values are also improved. In comparison with other mainstream algorithms, the YOLOv5s-ASGG algorithm outperforms the five algorithms, YOLOv3, YOLOv7, YOLOv5x, and YOLOv8n, in terms of the accuracy P, the average precision mAP, the regression rate R, and the F1 scores, by 3.3%, 13%, 1.7%, and 7%; 0.8%, 4.2%, 0.3%, and 1.7%, respectively; 0.9%, 5.2%, 1%, 1.8%; and 2.11, 9.27, 1.64, and 4.47%, respectively. In the comparison of actual detection results, YOLOv5s-ASGG also shows excellent performance. In summary, the algorithmic model proposed in this paper fully meets the use of target detection in automotive wire harness assembly scenarios, and shows obvious advantages in all indicators.

5. Conclusions

Distinguished from traditional stationary collection methods, this system’s wearable mobile acquisition approach offers high flexibility and adaptability for industrial applications. Without adding extra assembly or inspection processes, it ensures accurate detection of every wiring harness and relay. The effectiveness and reliability of this system have been verified through result analysis. In terms of detection methods, the system has transitioned from random sampling to 100% comprehensive inspection, and with an overall detection accuracy rate of 99.2%, it accurately identifies various quality issues, effectively preventing quality problems from escalating. Moreover, manual inspection consumes a significant amount of time and effort and is susceptible to fatigue and human error. This system substantially reduces the workload and intensity of manual inspections. Beyond improving detection efficiency and accuracy, the use of this system also represents a digital upgrade in enterprise production management and control. The digital reporting of inspection results allows businesses to better understand production conditions, promptly identify and resolve issues, optimize production processes and management methods, and enhance production efficiency and product quality.

Looking ahead, we will continue to upgrade and optimize our system algorithms, for example, by introducing cutting-edge deep learning architectures and algorithms, such as Transformer and Capsule Networks, to further improve detection accuracy and overall system performance. At the same time, multimodal data fusion technology will be actively explored, such as combining infrared, radar and wearable sensor data, to enhance the robustness and reliability of the system in various complex environments. In addition, in order to enhance the flexibility and scalability of the system and lower the threshold of new technology adoption, we will explore modular deployment options. For example, by packaging the training tool module, the training and deployment of new models can be carried out quickly when new models are introduced. Finally, at the level of technical technology upgrade and innovation, by combining the real-time data-processing capabilities of edge devices with the complex analysis and optimization capabilities of cloud resources, we expect to achieve more rapid decision making and higher system performance. This is expected to provide automotive manufacturers with a vision inspection solution that is both cost effective and technologically advanced, helping them to achieve technological innovation and improve production intelligence.

Author Contributions

Conceptualization, M.Y. and S.L.; methodology, S.L.; software, F.C.; validation, M.Y., W.W. and S.L.; formal analysis, S.L.; investigation, S.L., Y.Z. and X.M.; resources, W.W.; data curation, Y.Z. and S.L.; writing—original draft preparation, S.L.; writing—review and editing, M.Y., W.W. and H.S.; visualization, S.L.; supervision, M.Y.; project administration, W.W. and F.C.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available after requisition to the correspondence e-mail.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Trommnau, J.; Kühnle, J.; Siegert, J.; Inderka, R.; Bauernhansl, T. Overview of the state of the art in the production process of automotive wire harnesses, current research and future trends. Procedia CIRP 2019, 81, 387–392. [Google Scholar] [CrossRef]
Gannon, M. Connector Tips. 2019. Available online: https://www.connectortips.com/making-connector-assembly-safer-andmore-efficient-with-workplace-ergonomics/ (accessed on 9 June 2022).
Heisler, P.; Utsch, D.; Kuhn, M.; Franke, J. Optimization of wire harness assembly using human–robot-collaboration. Procedia CIRP 2021, 97, 260–265. [Google Scholar] [CrossRef]
Zheng, L.; Liu, X.; An, Z.; Li, S.; Zhang, R. A smart assistance system for cable assembly by combining wearable augmented reality with portable visual inspection. Virtual Real. Intell. Hardw. 2020, 2, 12–27. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Szeliski, R. Computer Vision: Algorithms and Applications; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Pérez, L.; Rodríguez, Í.; Rodríguez, N.; Usamentiaga, R.; García, D.F. Robot guidance using machine vision techniques in industrial environments: A comparative review. Sensors 2016, 16, 335. [Google Scholar] [CrossRef] [PubMed]
Kostal, P.; Prajova, V.; Vaclav, S.; Stan, S.-D. An Overview of the Practical Use of the CCTV System in a Simple Assembly in a Flexible Manufacturing System. Appl. Syst. Innov. 2022, 5, 52. [Google Scholar] [CrossRef]
Abagiu, M.M.; Cojocaru, D.; Manta, F.; Mariniuc, A. Detecting Machining Defects inside Engine Piston Chamber with Computer Vision and Machine Learning. Sensors 2023, 23, 785. [Google Scholar] [CrossRef] [PubMed]
Huang, C.-Y.; Lin, I.-C.; Liu, Y.-L. Applying deep learning to Construct a defect detection system for ceramic Substrates. App. Sci. 2022, 12, 2269. [Google Scholar] [CrossRef]
Chung, S.-T.; Hwang, W.-J.; Tai, T.-M. Keypoint-Based Automated Component Placement Inspection for Printed Circuit Boards. Appl. Sci. 2023, 13, 9863. [Google Scholar] [CrossRef]
Beck, T.; Langhoff, W. Kabel baumfertigung und Einrichtung zur Kabelbaumfertigung. Letters Patent G01M 11/00, 2016. DE102016123976B3. [Google Scholar]
Nguyen, T.P.; Yoon, J. A novel vision-based method for 3D profile extraction of wire harness in robotized assembly process. J. Manuf. Syst. 2021, 61, 365–374. [Google Scholar] [CrossRef]
Yumbla, F.; Abeyabas, M.; Luong, T.; Yi, J.-S.; Moon, H. Preliminary connector recognition system based on image processing for wire harness assembly tasks. In Proceedings of the 2020 20th International Conference on Control, Automation and Systems (ICCAS), Busan, Republic of Korea, 13–16 October 2020; pp. 1146–1150. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lai, H.; Chen, L.; Liu, W.; Yan, Z.; Ye, S. STC-YOLO: Small object detection network for traffic signs in complex environments. Sensors 2023, 23, 5307. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Zhang, Y. ScanGuard-YOLO: Enhancing X-ray Prohibited Item Detection with Significant Performance Gains. Sensors 2023, 24, 102. [Google Scholar] [CrossRef] [PubMed]
Sun, R.; Wu, C.; Zhao, X.; Zhao, B.; Jiang, Y. Object Recognition and Grasping for Collaborative Robots Based on Vision. Sensors 2023, 24, 195. [Google Scholar] [CrossRef] [PubMed]
Cui, Y.; Guo, D.; Yuan, H.; Gu, H.; Tang, H. Enhanced YOLO Network for Improving the Efficiency of Traffic Sign Detection. Appl. Sci. 2024, 14, 555. [Google Scholar] [CrossRef]
Yu, G.; Wang, T.; Guo, G.; Liu, H. SFHG-YOLO: A Simple Real-Time Small-Object-Detection Method for Estimating Pineapple Yield from Unmanned Aerial Vehicles. Sensors 2023, 23, 9242. [Google Scholar] [CrossRef] [PubMed]
Shi, J.; Bai, Y.; Zhou, J.; Zhang, B. Multi-Crop Navigation Line Extraction Based on Improved YOLO-v8 and Threshold-DBSCAN under Complex Agricultural Environments. Agriculture 2023, 14, 45. [Google Scholar] [CrossRef]
Ultralytics, YOLOv5 (2020) [EB/OL]. 10 June 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 10 June 2020).
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Jie, H.; Li, S.; Gang, S. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE Press: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]

Figure 1. Wiring harness and relay physical diagram.

Figure 2. Wearable electronic glasses image acquisition equipment.

Figure 3. The vision inspection system for automotive wiring harness assembly composition framework.

Figure 4. Physical diagram of the detection system hardware.

Figure 5. Detection system composition framework.

Figure 6. ASFF Network structure (

X^{i \to l}

denotes the output of a location feature map.

α^{1}, β^{2}

,

γ^{3}

denotes learnable weights).

Figure 6. ASFF Network structure (

X^{i \to l}

denotes the output of a location feature map.

α^{1}, β^{2}

,

γ^{3}

denotes learnable weights).

Figure 8. GSConv Network structure.

Figure 9. Example of part of the dataset.

Figure 10. Detection effect comparison. (a) is the original image, (b) is the detection result of YOLOv5s, (c) is the detection result of YOLOv5s-ASGG.

Table 1. Dataset composition quantity details.

Model	P/%	R/%	mAP/%	F1	Parameter Quantity /GFLOPs	Model Size /Mb
YOLOv5s	97.6	98.4	98.9	97.99	15.8	14.4
YOLOv5s-AS	98.0	98.9	99.1	98.45	24.3	25.4
YOLOv5s-ASG	99.3	99.3	99.5	99.30	24.6	26.1
YOLOv5s-ASGG	99.2	99.4	99.4	99.29	24.1	25.1

Table 2. Comparison experiments of different modules.

Model	P/%	R/%	mAP/%	F1
YOLOv5s	97.6	98.4	98.9	97.99
YOLOv5x	97.5	97.8	99.1	97.65
YOLOv3	95.9	98.5	98.6	97.18
YOLOv7	86.2	94.2	95.2	90.02
YOLOv8n	92.2	97.6	97.7	94.82
YOLOv5s-ASGG	99.2	99.4	99.4	99.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Yuan, M.; Wang, W.; Cao, F.; Shi, H.; Zhang, Y.; Meng, X. Enhanced YOLO- and Wearable-Based Inspection System for Automotive Wire Harness Assembly. Appl. Sci. 2024, 14, 2942. https://doi.org/10.3390/app14072942

AMA Style

Li S, Yuan M, Wang W, Cao F, Shi H, Zhang Y, Meng X. Enhanced YOLO- and Wearable-Based Inspection System for Automotive Wire Harness Assembly. Applied Sciences. 2024; 14(7):2942. https://doi.org/10.3390/app14072942

Chicago/Turabian Style

Li, Shuo, Mingzhe Yuan, Wenhong Wang, Feidao Cao, Hongyan Shi, Yuhang Zhang, and Xiangpu Meng. 2024. "Enhanced YOLO- and Wearable-Based Inspection System for Automotive Wire Harness Assembly" Applied Sciences 14, no. 7: 2942. https://doi.org/10.3390/app14072942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced YOLO- and Wearable-Based Inspection System for Automotive Wire Harness Assembly

Abstract

1. Introduction

1.1. Motivation

1.2. Related Work

2. Materials and Methods

2.1. Detection System Device

2.1.1. Detection System Introduction

2.1.2. Detection System Composition Framework

2.2. Model Selection

2.2.1. YOLOv5 Network

2.2.2. Improved YOLOv5s Network

2.2.3. Adaptive Spatial Feature Fusion Seeks Better Fusion Solutions

2.2.4. Global Context Network Enhances Feature Extraction Modules

2.2.5. Network Lightweight

2.3. Dataset

3. Results

3.1. Model Training

3.2. Evaluation Metrics

3.3. Comparison of Detection Performance of Different Algorithms

3.4. Comparison of Actual Detection Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI