Next Article in Journal
Vision-Based UAV Detection and Localization to Indoor Positioning System
Next Article in Special Issue
Unsupervised Transfer Learning Method via Cycle-Flow Adversarial Networks for Transient Fault Detection under Various Operation Conditions
Previous Article in Journal
Method for Underground Mining Shaft Sensor Data Collection
Previous Article in Special Issue
Real-Time Multi-Sensor Joint Fault Diagnosis Method for Permanent Magnet Traction Drive Systems Based on Structural Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Common Knowledge-Driven Generic Vision Inspection Framework for Adaptation to Multiple Scenarios, Tasks, and Objects

by
Delong Zhao
,
Feifei Kong
,
Nengbin Lv
,
Zhangmao Xu
and
Fuzhou Du
*
School of Mechanical Engineering and Automation, Beihang University, 37 College Road, Haidian District, Beijing 100191, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(13), 4120; https://doi.org/10.3390/s24134120
Submission received: 5 June 2024 / Revised: 21 June 2024 / Accepted: 22 June 2024 / Published: 25 June 2024

Abstract

:
The industrial manufacturing model is undergoing a transformation from a product-centric model to a customer-centric one. Driven by customized requirements, the complexity of products and the requirements for quality have increased, which pose a challenge to the applicability of traditional machine vision technology. Extensive research demonstrates the effectiveness of AI-based learning and image processing on specific objects or tasks, but few publications focus on the composite task of the integrated product, the traceability and improvability of methods, as well as the extraction and communication of knowledge between different scenarios or tasks. To address this problem, this paper proposes a common, knowledge-driven, generic vision inspection framework, targeted for standardizing product inspection into a process of information decoupling and adaptive metrics. Task-related object perception is planned into a multi-granularity and multi-pattern progressive alignment based on industry knowledge and structured tasks. Inspection is abstracted as a reconfigurable process of multi-sub-pattern space combination mapping and difference metric under appropriate high-level strategies and experiences. Finally, strategies for knowledge improvement and accumulation based on historical data are presented. The experiment demonstrates the process of generating a detection pipeline for complex products and continuously improving it through failure tracing and knowledge improvement. Compared to the ( 1.767 ° , 69.802 mm) and 0.883 obtained by state-of-the-art deep learning methods, the generated pipeline achieves a pose estimation ranging from ( 2.771 ° , 153.584 mm) to ( 1.034 ° , 52.308 mm) and a detection rate ranging from 0.462 to 0.927. Through verification of other imaging methods and industrial tasks, we prove that the key to adaptability lies in the mining of inherent commonalities of knowledge, multi-dimensional accumulation, and reapplication.

1. Introduction

With the revival of computing power resources, connectionism, particularly in the form of deep convolutional neural networks (CNNs), has emerged as a powerful tool that provides automatic vision inspection (AVI) with stronger expressive power. This has greatly enriched the application of vision in automation and promoted intelligent manufacturing [1,2], especially in industries with batch and standardized production. The data-driven approach allows us to quickly construct models through annotations for specified tasks and achieve adaptation to the feature diversity. However, in the context of this prosperous development, the advancement of emerging intelligent perception and detection methods still faces great resistance in some high-end manufacturing industries with discrete attributes. Taking the assembly appearance quality inspection of complex integrated products in the small batch production mode as an example, the potential challenges can be summarized by the following three points:
1. Detectability of composite vision tasks under weak hardware constraints.
The customization and integration of products give them attributes, such as multiple objects, multiple states, extreme sizes, and varying features, that extend ordinary appearance detection into a composite vision task involving the adaptive extraction of task-related feature and differentiated demand guidance for pattern comparison. From the perspective of information capture, as shown in Figure 1f, constructing a collection method that allows us to capture data from multiple angles like humans is a prerequisite for ensuring object coverage. In terms of feature extraction, it is essential to overcome abnormal states, internal and external feature variations, and noise interference while achieving high-precision positioning of all elements (even though perspective effects can further amplify the size differences). In terms of pattern comparison, it is necessary to comprehensively analyze whether each object has anomalies based on the reference pattern under different degrees of occlusion from both qualitative and quantitative perspectives. If so, further identification of the type of anomaly is also needed. This demand for flexible capture makes many traditional industrial vision solutions [3,4,5] (as shown in Figure 1a–d) no longer applicable, and the demand for intelligent judgement makes some portable solutions (as shown in Figure 1e) only available for auxiliary visualization and active projection [6,7], while the analysis still relies on humans;
2. Traceability of internal failures and improvability of the inspection process.
To address diversity, we can train a scene-understanding model based on CNNs [8], even if the initial sample is small. However, the reliability of this data-driven approach is concerning. Specifically, the favorable performance of the model on prefabricated datasets may be masking numerous latent threats. It is important to ask whether the good performance on the test set reliably indicates that errors will not occur in the actual operation process, and determine what understandable adjustments can be made for possible prediction errors, to prevent repeating the same mistakes. Essentially, this touches on two important factors that are valued in high-end equipment manufacturing processes: traceability and improvability. There have been many excellent studies on the exploration of processing, production, scheduling, assembly, and workshops [9,10], etc., from the perspectives of digital twins and human-centered manufacturing. As an important aspect, however, research in AVI currently rarely discusses reliability issues, and rarely involves tracing the causes of generated errors, and improving inspection methods;
3. Low-cost transferability to other scenarios, tasks, and objects.
In addition to the appearance inspection of the product itself, the assembly stage also includes pre-assembly inspection of some components such as the appearance inspection of connectors. This additional work is a completely new task for the established method. Obviously, it is inconvenient to configure additional dedicated hardware imaging systems or specialized algorithms on the job site. In other words, assuming that we have devoted significant efforts and specialized skills to a detection task, it is important for the established method to possess the ability to allow people to get rid of the dilemma of starting from scratch when facing new applications. Furthermore, established methods and strategies should be transferred easily to provide knowledge for new scenarios and to facilitate re-adaptation [11,12].
To address these challenges, this article proposes a common, knowledge-driven, generic vision inspection framework, which allows us to generate an inspection pipeline suitable for composite vision tasks, accumulating knowledge and experience, and expanding to other scenarios and applications. The framework consists of three stages, namely, pattern connections based on industry common sense, the task-driven progressive perception of multi-objects, and knowledge-driven adaptive inspection. In addition, a knowledge improvement strategy is designed that allows for the interactive transformation of data into knowledge based on observation completeness, providing guidance on how to generate an inspection pipeline, improve inspection, and build a knowledge base with weak annotations for actual manufacturing process. The main contributions of this study can be summarized as follows.
  • We propose a generic inspection framework that includes pattern connection, progressive alignment, and adaptive detection to adapt to composite vision tasks and achieve universality and traceability;
  • We introduce embedding strategies that encompass knowledge of industry common sense, field-task knowledge, and expert experience to integrate learning models and knowledge models, ultimately enhancing adaptability and accuracy;
  • This study allows us to construct an inspection pipeline and accumulate data–knowledge–experience for different scenarios, tasks, and objects under weak annotation, and transfer to other applications;
  • Experimental results demonstrate that compared to end-to-end learning strategies, an inspection pipeline continuously optimized through fault tracking and knowledge improvements has higher performance potential and controllability.
The rest of this article is organized as follows. Section 2 reviews the significantly related works. Section 3 describes the proposed framework in three stages. Next, the experimental results and discussions are presented in Section 4. Finally, Section 5 concludes this study and presents future work ideas.

2. Literature Review

2.1. Generic Vision Inspection Framework

Tracing the progress of Industry 3.0 towards 4.0, the study of system engineering is essential for the informatization and intelligence of complex manufacturing tasks with knowledge-intensive characteristics [13,14]. In terms of AVI, ref. [15] proposed a framework based on 3D vision to solve the structural optimization of tunnels problem, and ref. [16] extended it through hierarchical planning and the structure from a motion pipeline. A component-aware abnormal detection framework was designed based on DL in [17], which can perform multiple adaptive and logical checks on some simple components. The AR-based projection was reconstructed by [18] into a three-stage structure for pose estimation, tracking, and correction. The software and its development framework for visual detection were designed in [19], which specifies the resources involved in AVI. In addition, DL-based detection pipelines for specific parts can be found in [20,21]. A general scheme was proposed by [22] for the appearance quality inspection of various types of electronic products with small size. However, these studies have not emphasized the role and commonality of knowledge, which limits the reusability of their frameworks in other scenarios or tasks.

2.2. Appearance Quality Assurance of Complex Product

The product appearance inspection generally refers to the detection of surface defects, which is a widely studied topic and can be regarded as a task of region sampling and local context pattern comparison [23]. However, the composite vision inspection task of complex assembled products complicates this process and transforms it into a cascading problem of high-accuracy pose alignment and multi-scale object pattern comparison.
In terms of alignment, the most advanced research emphasizing adaptability is represented in the field of pose estimation based on a learning paradigm with a 3D model. Research in this field can be divided into direct estimation [24,25,26,27,28,29] and indirect estimation [30,31,32,33,34]. According to the supervision strategy, it can also be divided into fully supervised [24,25,26,27,28,30,31,32,33,34], semi-supervised [35,36], and unsupervised [37,38] directions. In the unsupervised direction, some emerging topics focus on few-shot learning [39], multi-modal learning [40], and virtual–real domain adaptation [41,42], etc. Pose estimation is a difficult regression task and it is not easy to converge the training. The output of CNN is usually adopted as the initial solution, and a further exact solution can only be obtained by manual feature matching and optimization [29,43]. Other studies focus on offline viewpoint planning so that the alignment can be simplified through hardware and increase the practical feasibility [44,45]. However, in the actual implementation phase, its non-lightweight capacity and the deviations caused by motion errors and mechanism errors limit its popularization.
The pattern comparison described here essentially belongs to the cross-domain matching and difference metric of different information sources. The related research utilizes contrastive, matching, or metric learning strategies to perform comparisons in an encoded feature space [46,47], multi-level space [48], or source/target domain [49,50,51] and end-to-end. When the physical samples are sufficient, large models, such as high-performance target recognizers [52] and segmentation networks [53] can be directly migrated and applied through knowledge distillation [54]. Nevertheless, no matter the learning method, it is bound to face black box problems, annotation performance bottlenecks, and poor consistency of results, which has led to many applications in manufacturing still maintaining the knowledge-based manual models [55]. In addition, there is still a lack of research on standardized integration solutions for the two stages.

2.3. Knowledge Application for Vision-Based Methods

In recent years, it has gradually been discovered that the key to the successful application of this data-driven learning approach lies in the description of domain knowledge and the strategy of combining it with a learning paradigm. A review of relevant research indicates that the application of knowledge mainly includes the following aspects: (1) Datasets: simulation augmentation [56], empirically controlled sample construction [57], and automatic hard sample generation [58,59], etc.; (2) Network architectures: fusion learning of multi-modal data with complementary characteristics [60], contrastive, matching or metric learning with manually designed reference patterns [49,50,61], and designing special network layers to represent knowledge models [62,63], etc.; (3) Tasks: auxiliary tasks in multi-task learning [64,65], a priori based multi-task branch balance [66], etc.; and (4) Loss: physical model constraint terms [67], e.g., differential equations based on oscillation criteria [68], empirical model constraint terms [69], loss regularization terms [70], etc. In terms of knowledge representation, current research is primarily centered on the field of natural language processing (NLP) and its intersection with image understanding, focusing on solving knowledge graph modeling, recommendation systems, image text descriptions [71,72], etc.
Overall, existing knowledge applications are dedicated to describing specific objects or tasks and their combination with the components of DL module, while few studies focus on common knowledge in industry manufacturing and solve the integration of the experience paradigm and learning paradigm from the perspective of reliable system engineering.

2.4. Research Gaps and Motivation

Overall, visual technology can be abstracted as a process of processing perceptual information gradually and extracting task-related information through experience, rules, and external constraints. From this perspective, the core of adaptability to different scenarios, objects, and tasks lies in the extraction of knowledge commonalities, multidimensional expression, accumulation, and re-application. In a sense, this is not just a single technical issue. However, by reviewing existing work in the field of vision, it can be observed that there is a lack of mining common knowledge between scenes, as well as the lack of a generic framework that allows for the accumulation of knowledge and experience. The following observations can also be made: (1) In terms of unconstrained composite vision inspection for complex products, there is a lack of research on a cascaded overall framework for connection, alignment, and comparison. More emphasis is placed on point-based applications, such as pose estimation or defect recognition in local regions; (2) Inspection schemes need to be collaboratively improved in a controllable and reliable manner as data accumulates. However, the lack of a generic framework and standardized mechanisms makes it difficult to accumulate knowledge and experience. Therefore, methods are often limited to data preparation and model training, neglecting fault tracing and interactive improvement; and (3) The application mode of knowledge is still limited to the constraint design of specific networks and has not risen to a high-dimensional strategy to guide low-cost transfer between scenarios and tasks.
Motivated by these observations, this study proposes a generic, cascaded, visual perception and information processing framework driven by common sense, knowledge, and experience, which integrates learning modules and manual modules. Additionally, the application strategies, representation forms, improvement methods, and accumulation mechanisms of knowledge are explicitly summarized for different stages and methods. From the perspective of long-term practical significance, we hope that the proposed study and validation can stimulate more attention to framework research, multi-dimensional knowledge extraction, and knowledge improvement and accumulation.

3. Proposed Method

The proposed framework is shown in Figure 2. The design of the framework execution mechanism is based on common strategies extracted from knowledge at different stages. Stage 1 is planned as a correspondence learning process anchored in a referential pattern, where common sense is seamlessly embedded through the provided form into the DL-based module via task-wise and geometric-wise constraints, etc., thus strengthening the learning purpose. Stage 2 is represented as a hierarchical pose refinement problem. To ensure accuracy, estimation is further extended to a multi-granularity object iterative matching and optimization model, assisted by multi-pattern alignment according to the task-related knowledge. Stage 3 is abstracted as a reconfigurable process of multi-sub-pattern space combination mapping and difference identification, through which expert knowledge and experience can be accumulated in the form of sub-pattern space combination strategies, processing parameter configurations and methods and data for failure simulation or augmentation. The numbers 1 to 7 represent the application of knowledge at different steps.

3.1. Virtual–Real Connection Based on Industry Common Sense

This stage aims to discover the commonalities between a given pattern (e.g., virtual model, template image, feature dictionary) and the physical space; establish a bridge for the exchange of non-homologous information; and facilitate the analysis of the physical object. To reduce the disturbance of multi-source noise on information processing and communication in this stage, it is crucial to identify the appropriate constraints and develop embedding strategies.

3.1.1. Significant Elements and Feature Space

Significant elements S refer to objects that have advantages in size, visual information, and contextual distinguishability with the given scene such as boundaries in PCBs [73], holes of aircraft skins [74], markers on scenarios [75], and designated contours in parts [76]. Determining the feature space means that significant elements are expected to be transformed into a domain independent public space to build correspondence with a given pattern by, for example, encoding elements into 2D, 3D [77], or pose [43] geometric space to perform matching with a given pattern (e.g., pointset, 3D model) [55]. Alternatively, semantic features of elements can be extracted and mapped to the target pattern space [49] or a common space [78] to complete recognition, alignment, positioning, etc.

3.1.2. Constraints Method and Embedding Strategy

The core of this step lies in the mining of rules with inherent invariance; we recommend rigid body structure invariance in industrial manufacturing. For the DL part, constraints are applied to the loss in the form of tasks or regular terms, while for the non-DL part, constraints are employed to adjust the execution process in the form of empirical parameters.
For the sake of illustration, as shown in Figure 3, we designed a significant 3D box estimation model based on [30] that integrates shape self-constraint. In the multitasking mode, according to the concept of focusing attention, quadrant classification, target recognition, and point estimation can be selected for constraint construction to aggregate information. The loss L can be formulated as L = w m L m + w c L c + w r L r , where w is the branch balance coefficient, L c , L r can be derived based on common classification and recognition tasks [46] and as follows:
L m = i S j 9 f t I , Θ g t t + m 1 f m g t c e l l c + f m g t c e l l c .
Point regression loss and cell-oriented confidence loss are adopted in (1), where s = f t I , Θ represents the coordinate obtained through neural network f with parameter Θ , g t denotes coordinate annotations, and c e l l depends on the size of the output. In terms of regular terms, L m can be further extended as follows:
L m = L m + { λ P L P + λ L L L + λ C L C } ,    
where λ is a set of trade-off coefficients. L P considers the feature of close size, and can be formulated as follows:
L P = S 1 | s e t | G s e t g i , g j p r i o m a x a r c o s ( g i , g j ) τ P , 0 ,  
where g G denotes the manually specified segment between two s s , s e t describes the set of selected segments, p r i o represents a prior knowledge that depends on the shape of the product. L L emphasizes the approximate parallelism of specific structures and can be described as follows:
L L = S 1 | s e t | G s e t max 1 g ¯ G g G g l ¯ G 2 / ( G 1 ) τ L , 0 , g ¯ G = g G g G ,
L C explores the relationship between the centroid of a rigid body and its shape as follows:
L C = S 1 | s e t | G s e t g p r i o m a x d ( c , g ) τ C , 0 ,
where d c , g measures the distance between the center c and a segment g , τ P , τ L , τ C > 0 are used to characterize tolerance degree. In addition, these knowledges can also be modified according to actual situations, such as [79]. Once the physical samples are insufficient, domain adaptation [41] or dataset augmentation [58] can be supplemented. Additionally, when it is difficult to simplify the representation of geometric invariance, operator can refer to an end-to-end learning approach [25] or implicitly embed constraints in the form of multi-modal learning [37].
In the non-DL part, post-processing includes, but is not limited to, outlier filtering, and the correction of missed detections [80]. Then, the refined significant feature s is fed into pattern matching, and this work can be empirically guided by weight coefficients. For example, the c o n f of s estimated by DL and the visibility of s can be combined to initialize weight comprehensively to assist in virtual and real pose alignment as follows:
R , t = argmin R , t i g s i v , c o n f T R , t s i p i ,  
where g s i · is an empirical function, p i p represents the i -th item in the reference. In 2D matching, R R 2 × 2 , t R 2 × 1 , T R , t s i = R s i + t , and (6) can be solved by [81]. For pose estimation, R R 3 × 3 , t R 3 × 1 , T R , t s i = π K R s i + t , (6) can be solved using [82], where K refers to the camera matrix and π denotes the de-homogenization operation.

3.2. Task-Driven Progressive Perception of Multi-Object

This section aims to construct a task driven complex assembly product pose adjustment model to achieve multi-scale object localization and accuracy adaptation under background and local abnormal disturbances.

3.2.1. Structured Representation of Task

As shown in Figure 4, we design a structured task representation model based on a quadtree mode that endows the inspection framework with task-driven mechanism. Object provides “name” and “spatial relation”. “name” can be used to create learning labels and build association in data structured management. “spatial relation” describes the relationship between object and scene, which can be expressed as coordinate system transformation, topological relationship, etc. The expected detection task is arranged in Item. Criterion defines the alignment and detection scheme for Item. For example, assuming a method pool containing multiple DL models and image processing algorithms has been constructed. Then for “*-B”, a classification DL model, a color algorithm, and a geometric algorithm are manually configured. The “+” in the “Execution strategy” indicates the execution order of the three. Next, for each method, “Method Para”. in Parameter defines the internal parameters of the execution, “Evaluation Para”. is the validity parameter adopted to evaluate whether there is a need to continue between “H” and “M” or “M” and “M”, and . e is employed to determine whether there are omissions.

3.2.2. Multi-Granularity Pattern Alignment

Accurate extraction is essentially a high-accuracy geometric estimation problem. The estimation can be reconstructed as an alignment problem under given reference pattern constraints. Furthermore, as the required accuracy increases, the reference pattern also needs to be improved (re-rendering image, expanding template library), at which the focus of alignment shifts to the formulation of multi-spatial invariance mining and collaborative optimization. To this end, taking the connected virtual-real as the initial state, a multi-granularity pattern alignment pipeline for positioning of multi-scale objects is designed as shown in Figure 5 (3D, 2D can be regarded as a simplification of this problem).
Assuming that the objects involved in the appearance inspection of a product include large-sized components (LSC), medium-sized parts (MSP), small-sized accessories (SSA), and specific details (SD) (e.g., holes, faces). The perception process can be summarized based on task characteristics as follows:
1. Mainly LSC, with a few MSP (e.g., robotic assembly, robot grasping, AR/VR) Low accuracy requirement ( > 5 ° , 10 c m ) means that more effort can be allocated to cope with feature variation, and the corresponding data-driven learning strategies are listed in Figure 5. If the geometric feature is difficult to explicitly define, an end-to-end direct estimation [24,26,27,28] can be adopted to infer pose or relative pose as follows:
L = t T a f T a a   f x X T a Θ Ω T a e x , Θ , g t T a + f T m a   f x X T m Θ Ω T m e x , Θ , T ( g t T m )    
where T m , X T m denote the main task and input set (e.g., image, depth), T a , X T a consist of auxiliary tasks (e.g., depth prediction, target recognition, foreground segmentation [30], etc.) and their inputs, f e , f a are feature encoder and aggregator, respectively, and T is a converter used to re-parameterize pose for CNN (e.g., quaternion, [29], etc.). On the contrary, the estimation can also be indirectly achieved by learning sparse or dense geometric correspondence, in which T ( g t T m ) in (7) can be replaced by offset [25], vector distribution [31], 3D coordinates [32,33], intermediate representation [34], etc. On the other hand, considering the difficulty of preparing high-quality dataset, research combining self-supervision and domain adaptation has gradually drawn more attention.
The self-supervision promotes alignment learning by manually designing domain independent consistency evaluation rules [37], which is summarized in “4” of Figure 2 and can be described as L s e l f = L g e o + L f e a t + L s e m ,
L g e o = D g e o 2 D G 2 D , r f 2 D + D d e p t h G d e p t h , r f d e p t h v + C D G 3 D , r f 3 D ,    
L f e a t = D f e a t c o F c o , r f c o + D f e a t t e x F t e x , r f t e x , c o R g b , L a b , ,      
L s e m = l a y e r { D s e m ( f t l a y e r , r f l a y e r ) } .
In different representation spaces, ( G , F , f t ) are the generated results of the input pattern, and r f belongs to the reference pattern. Thus (8) includes 2D rules (e.g., contour, mask), depth rule, and 3D rules, where D d e p t h ( . ) v indicates that only the visualized body can be considered, and C D is used to measure point cloud differences (e.g., chamfer distance). The similarity of the LAB space [83], structural similarity in the RGB space [84], and texture consistency are modeled in (9). As expressed in (10), it is also reasonable that the semantics of different patterns in the corresponding network layers should be similar. The domain adaptation performs the distribution alignment of a given pattern to the target pattern in the semantic space through technologies such as GAN [38]. This research can be seen as an extension of the former, targeting to replace manual embedding rules through end-to-end learning.
2. Mainly LSC, MSP, with a few SSA, SD (e.g., Electronic/Electromechanical Products) In some large scenarios, the results of DL adjustment in a) may still not satisfy the requirements, which means that improving alignment accuracy requires more fine-grained element support and synchronous improvement of the current reference. Therefore, in the second stage of Figure 5, three basic alignment strategies are provided. When an object appears frequently but its features are not prominent, CNN-based learning paradigm can be maintained (e.g., metric learning [85], matching learning [86]) to identify the differences between the extracted region of interested (ROI) and its reference (e.g., virtual ROI). Otherwise, for ROIs with low information, manual image features and geometric features will be better choices.
Define a complete alignment process between input ROI x k with its reference pattern r f k as follows:
r e f w C f C x k , r f k , Θ C 1 w C f M x k , r f k , Θ M ,    
R , t , s = a r g m i n R , t , s i D T T Ω p i , T Ω t i ,
p i G k , 2 D , t i r f k , 2 D ,   r f k , 2 D = r f k , 2 D r e f ,   R R 2 × 2 , t R 2 × 1 .
r e f defines a state correction strategy, where w C is the attention and f C , f M , Θ C , Θ M represents the methods and their internal parameters. For example, the region most similar to r f k within x k can be extracted by matching CNN f C and the matching vector can be obtained. Similarly, f M is the template matching. In this way, r e f is the weighting of two potential directions, by which we can adjust the geometric feature r f k , 2 D of r f k to form r f k , 2 D . Equation (12) describes a generalized model for point set registration, where R , t , s is a set of 2D transformation parameters, T Ω represents a mapping approach with implicit parameter Ω (e.g., weight mapping [81], kernel function mapping [87], mixed Gaussian mapping [88], etc.). D T refers to the measurement established on T (e.g., Euclidean distance, kernel correlation, probability likelihood, etc.).
In such way, define any nearest neighbor point of t i r f k , 2 D ,   r f k , 2 D = R ( s ( r f k , 2 D r f ¯ k , 2 D ) + r f ¯ k , 2 D ) + t as p i . { P k , P k , 3 D } is the set of all p i , and its 3D coordinate. Then, in the ROI-wise internal loop, the input pose and reference pattern can be updated through the r f k , 2 D , P k , 3 D of all ROIs. In the external loop, this stage can be used to synergistically adjust the labels of the previous stage.
3. LSC, MSP, SSA, and SD are all required. (e.g., this case or a similar product). This type of task means that any interested details may not be allowed to deviate from the alignment process, even if the current performance has significantly exceeded the common indicators ( 5 ° , 5 c m ) adopted in the CV field. In other words, the internal loop Optimization in b) needs to be performed again on the details associated with inspection items and manually selected auxiliary features (e.g., holes, vertices). It should be noted that we recommend r e f f M x k , j , r f k , j , Θ M as DL is not suitable for small objects. Equation (12) can be simplified as an isolated point-to-point relationship ( r f k , j , 2 D , p k , j , p k , j , 3 D ) , where j is the j -th detail of ROI x k . Then, each ROI will be bound to its details, further achieving joint optimization.
A complete process is presented below and explained in Figure 6:
I o p a l i N I r f k , P k k L S C M S P ,    
I I o p a l i 1 { ( r f k 1 , P k 1 ) ; ( r f , P ) k 2 | k 1 , k 2 { L S C M S P } }   ,
r f , P k 2 = { ( r f k 2 r e f m , r f m r e f m , P ) | m S S A k 2 } ,
r e f m = o p 2 D N I I G m , r f m m S S A k 2 , I .
o p a l i 1 represents performing a pose optimization, updating reference pattern, and aligning features. o p 2 D N I I Formula (12). Numeral III is similar to II, except that r e f is directly determined by single-point matching. The above work establishes an integrated association of task–object–perception, and ensures the accuracy and reliability of extraction at different granularities.

3.3. Knowledge-Driven Adaptive Inspection

Theoretically, all the components to be detected participate in alignment, which means that their processing data will provide the basis for inspection. Some potential anomalies that were overlooked during alignment due to inappropriate results will be further confirmed in this section. Therefore, as shown in Figure 7, this section introduces a knowledge-based adaptive inspection scheme for perceptual objects.

3.3.1. Method Pool

A multi-dimensional method pool is constructed based on industrial vision knowledge (e.g., objects, features, and possible methods), which includes geometry, statistics, template, and semantics. In each mode, we provide general strategies for information with different dimensions. Due to the mature development of manual feature processing, only DL-based strategies listed in the semantic mode are emphasized here.
D1” assumes a monotonic reference pattern and constructs data-mapping learning to establish a matching bridge between the perception and the target information. According to the type of mapping space, it can be further divided into an encoding space [89] and reconstructed space [49] as shown in “5” in Figure 2. Fusion-based perceptual enhancement has proven to be an effective strategy for improving CNN performance. There are many sources of manufacturing information that can be integrated, among which the most commonly used are images (“D2”) and point clouds (“D3”).
According to different fusion strategies, it can be divided into data-wise fusion (e.g., mixed virtual real dataset), and feature space fusion [78,90]. In practice, it is particularly important to note that the fused information should be able to improve intra-class consistency and inter-class distinguishability.
For rare or unseen anomalies, “D4” suggests a way to learn from positive samples [91], with a focus on whether the learned prototype under imbalanced categories is close enough to the true distribution. For industrial inspections that emphasize reliability, it is recommended to strictly control the information received by “D4” (e.g., accurate positioning) to prevent unexpected shifts in distribution caused by interference. In addition, CNN models for general tasks also have high application value, such as [92]. These approaches encode perception into an information space that humans can understand, enabling many traditional algorithms to cope with complex contexts.

3.3.2. Inspection under Different Situations

In theory, as long as the object has not been omitted or obvious incorrectly installed, its matching results will exhibit high consistency in multiple stages during alignment. Otherwise, situations, such as low similarity or no matching point, may indicate anomalies. To verify these suspected anomalies, as shown in the second line of Figure 7, we summarize the following representative cases based on field knowledge from aspects such as information volume, internal pattern, contextual consistency, regularity, etc.:
1. Components with surface treatment. The internal texture of surface-treated components is difficult to recognize, and distinction from the context can be achieved through contour feature or external information. “D5” will be a good strategy to provide robust edges when the contour is difficult to manually extract, otherwise conventional operators like Canny can be directly applied [55]. When the context is difficult to distinguish, “D3” enhances its discriminative ability by incorporating 3D information, enabling end-to-end estimation of the target state;
2. Fasteners/Accessories. As small objects, fastener-like targets typically exhibit high texture clustering and limited feature variation. These attributes permit the examination from two distinct aspects: color and shape, as well as pattern similarity;
3. Parts with rotational similarity. To ensure inspection reliability, a robust Criterion is provided here. Virtual–real metric learning in “D2” can generally provide coarse-grained presence recognition. If the local pattern changes of an object exhibit continuity and regularity across different global viewpoints, a strategy of reference pattern sampling and multi-space fusion can be adopted. Specifically, semantics from different viewpoints can be fused in the semantic space, or multiple templates can be predesigned in the feature space. In addition, to address the contradiction between the estimated state and the abnormality in alignment, a more targeted geometric strategy “A1” is assembled to identify any improper installation, as shown in Figure 7;
4. Defects on small-sized objects. Precisely extracted small objects or regions have limited information content, and any abnormality will cause significant pattern differences. In such cases, an acceptable defect recognition result can be obtained by template-based mapping learning with few-shot learning [49]. In addition, multi-directional gradient information extraction and signal matching can also provide a reliable basis for defect localization;
5. Defects on large surfaces. Defects that occur on large flat surfaces on a component or a specific area of an assembled product are generally unpredictable in location and appearance. To prevent small defects from being indistinguishable, joint sampling and comparison is a widely adopted approach [93]. In this study, the occurrence distribution obtained from production experience can be used to guide sampling, and the texture consistency model can assist in judgment from aspects such as gradient, frequency domain [94], etc. The initially adopted CNN can be learned through “D4”. With the accumulation of historical data in the framework, the CNN can gradually be adjusted to a classification model targeting negative samples;
6. Foreign object debris. There is limited research on this topic, mainly due to the challenges in vision-based locating of foreign object debris in a complex assembled product, and its recognition falls under the category of open-set problems [3]. Nevertheless, this task can be transformed into a joint task of empirical sampling and pattern comparison within this framework. For comparison, we suggest choosing “D5+A1”, “D2+A3”, “D3” or a combination of them.

3.3.3. Knowledge Improvement Strategies

The framework presented in this paper can be regarded as an intermediary operating within the current manufacturing phase and knowledge space. The processing results of various stages on perception are continuously accumulated in historical data. Essentially, these results reveal the evolutionary regularity of information in the pattern spaces associated with various methods. To improve the observation of the regularities generated in each pattern space, it is crucial to collect alignment and detection data during the trial operation phase.
Here, as shown in Figure 8, we demonstrate an interactive hybrid iteration strategy that follows the principles of technological improvement, experience accumulation, strategy adjustment, and adaptation (TESA), converting accumulated data into knowledge for inspection. Specifically, the data generated by the current framework can be divided into two parts based on performance and the descriptive ability of each sub-pattern space, namely, complete observation, and incomplete observation:
  • Complete observation. This means that the current observation can be well-represented by the configured sub-pattern (or combination), while the subsequent addition of data has little impact. For example, the presence or absence of fasteners can be determined using typical templates or area-based conditions. Due to good visibility and fewer unpredictable changes, some objects can be modeled by CNN and existing datasets to improve their sub-pattern (semantic);
  • Incomplete observation. Essentially, incomplete observation is a challenge that less constrained vision-based applications will inevitably face. In other words, any changes in Man–Machine–Material–Method–Environment (4M1E) may lead to deviations in perception that are difficult to be covered by the pattern space constructed by existing knowledge such as an unseen viewpoint. In this situation, if the difference with the internal sub-pattern space is too large, then manual intervention is necessary. For the semantic space, difficult case mining and augmentation [57,59] has been proven to be an effective strategy. For other spaces, exploring new sub-patterns or reconstructing existing patterns can be considered;
  • Intermediate state. More objects belong to the intermediate state between (1) and (2). For these data, a combined disturbance strategy oriented towards perception and pattern space is adopted here to simulate potential variation based on existing knowledge, as follows: (a) Perception. Perform pose perturbation around the viewpoint of the existing sample and simulate possible geometric matching differences using virtual rendering. For example, in the small part recognition, geometric dictionaries under different states can be constructed to improve the perception by perturbing the local virtual viewpoint; For foreign object debris, normal patch collection can be carried out around the frequently occurring areas on the data after viewpoint perturbation, to improve observation of reference pattern and enhance metric ability; and (b) Pattern. For high-dimensional semantic features, it is recommended to use data augmentation strategies and feature layer Gaussian noise, while for low-dimensional manual features and modeling functions, it is recommended to use internal parameter perturbations. For example, in the fine-defect detection task of small objects, recognition results obtained through manual rules or empirical functions based on multiple factors, such as shape, texture, and geometric attributes, may fail due to insufficient consideration of abnormal changes in these factors. Therefore, establishing disturbance and simulation (e.g., shape deformation, feature space noise) based on intermediate results at each stage can enhance the adaptability of the method to new patterns. Then, re-train or refit (cluster) the data after perturbation and embed the discovered inherent invariance as constraints into the learning process.
Given the updated pattern library (including fine-tuned DL models, re-fitted functions, expanded feature libraries or dictionaries, supplementary template images, and adjusted parameter sets), Criterion and Parameter are then locally adjusted object by object as shown in number 7 of Figure 2. Subsequently, the improved Criterion and Parameter, are integrated, as well as the expanded pattern library, into the entire framework for joint refining with Connection and Alignment. For example, in the new alignment process, we can incorporate small-sized objects that were not previously considered to enhance alignment accuracy.

4. Case Study

4.1. Preliminary Work

4.1.1. Scenarios and Datasets

Overall idea: As shown in Figure 9, to simulate the real-world problems faced in the actual manufacturing process, we will start from an experimental platform containing a small amount of available data. By continuously introducing new samples, tracing failures, and improvement to enhance performance and accumulate knowledge. During this period, we will gradually demonstrate its performance comparison with state-of-the-art deep learning methods under ideal conditions. Furthermore, based on existing knowledge, low-cost migration to other imaging approaches, scenarios, and objects can be achieved through Criterion planning and Parameter fine-tuning.
Scenarios: The construction of scenarios, tasks, imaging, datasets, etc., is shown in Figure 9. Scenario I: a complex assembled product with 113 components was constructed, where the objects were further divided by size (ES~ED) as shown in Figure 10. The product was placed in a complex background, and its appearance is inconsistent from different viewpoints. Some equipment and parts were surface treated to exhibit low texture and high color consistency. To verify universality, two common imaging pathways (I-A, I-B) were arranged here; Scenario II: Robot with fixed station was deployed to perform multi-view imaging of a structural component, targeting to identify the positions of these holes; Scenario III: 46 kinds of electrical connectors were prepared, which include 6 types of standard parts (pins), more than 40 different arrangements and 4 installation approaches. Besides, the experimental platform simply adopted the binocular vision, and was equipped with low-angle ring lights, without any special design or servo control system; and Scenario IV: Products of Scenario III were captured by an industrial tablet.
Inspection Task: The task is to identify the assembly status of all objects in Scenarios I-A, -B, and check for any omissions, wrong types, or incorrect positions (misalignment, not tightened, or skewed). For I-A, this task involves overcoming disturbances posed by the background, lighting conditions, and self-occlusion while addressing high-accuracy positioning based on monocular image and virtual-real comparison of multi-scale objects. In II, rapid and minimally costly deployment and accuracy assurance are the focus of this common AR task. For III and IV, it is necessary to demonstrate adaptability to variations in appearance and imaging posture across different models, and to achieve accurate identification of products and their densely packed small targets.
Dataset: In I-A, D 1 ~ D 3 were built by capturing a handheld portable device around the platform at different pitch angles. According to the artificial markers, the ground truth of the pose and ROIs were calculated. D 1 contains 89 images with no abnormalities, while D 2 contains 112 images, which are configured with defects for specific EA-ED objects. After modifying the illumination and background, 80 more images were captured as D 3 . The number of visible defects is manually counted and listed in Table 1.
D 4 contains 20 images captured by mobile robot; D 5 contains 60 images captured by collaborative robots. The configuration of D 6 is shown in Figure 9, and two types with artificially set missing and skewed pins are arranged in D 7 . The color image of each model connector was prepared in D 8 .
Indicators: Euler angle error e ¯ R , translation error e ¯ t , positioning performance I o U ¯ , normal state r ¯ n , missing part r ¯ m , wrong type r ¯ w , incorrect location r ¯ i , where can be considered as D or specific object, etc.

4.1.2. Criterion and Method Pool

Figure 10 depicts the configurations of Criterion for different scenarios, which are based on the knowledge of the intersection between vision and manufacturing summarized in Section 3.3.2.
It should be noted that, only the inspection logic was presented here, based on which different specified methods would be arranged during the experimental phased to demonstrate the universality and scalability of the proposed framework. Additionally, the initial Parameter is set based on the actual processing results of each stage, and not be elaborated here. Then, the Criterion, Parameter, and the above tasks will be encapsulated as quads and associated in XML with the given process 3D model.

4.2. Evaluation on Complex Assembled Product

4.2.1. Deployment Work

In this section, for I-A, datasets D 1 ~ D 3 were, respectively, used for simulating prototype system construction ( P S C ), trial operation ( T O ), and practical application ( P T ). Unlike the prevalent practice in the computer vision field that relies on extensive pre-prepared datasets, this study initially assumed that only images of P S C and labels (may contain errors caused by occlusion and intuitive annotation) of ES0 and ES1 could be used. The framework construction detail is listed in Table 2. The network used in D2 refers to [61], where the task branch is modified to object recognition to assist in identifying abnormal states. D3 adopts depth-data processing branch and performs feature fusion according to [89], maintaining consistency with D2 in the task. In D5, a lightweight contour prediction network is constructed based on [95].
Based on the TESA concept, the experiment is divided into four stages as shown in Table 3. Each stage is required to be tested on P T to demonstrate the current potential.

4.2.2. Validation under Possible Non-Ideal Conditions in Actual Production

Stage-A. Five representative pose estimation methods were used as control groups. Specifically, Group I learned indirect relationships through [30] and EPnP, Group II and III, respectively, improved the results of Connection and Group I using poseNet2 [29] and pose ground truth. Group IV employed the advanced estimator PVNet [31], in which semantic labels were generated from the minimum bounding rectangle of significant geometric feature. Group V arranged the unsupervised learning method self6D [37], where the depth map is replaced by the simulation data with pose ground truth, and the differentiable renderer was pytorch3D [100].
The performance of each group on different datasets is listed in Table 4, and some examples can be found in Figure 11. The comparison of the performance between Connection and Group I indicate that embedding prior knowledge or industry-specific rules into CNN in the form of constraints can yield significant benefits. The results of Group II and III demonstrate the fine-tuning ability of supervised mode. However, although the accuracy achieved is enough for grasping, tracking, or AR, it is still slightly insufficient for a complex inspection task. Group IV is not constrained by the limitations of manual geometric representation, thereby achieving better results. Nevertheless, the unsatisfactory alignment performance in the surrounding region shown in Figure 11 shows that accurate estimation based solely on a few elements is not sufficient for such a complex case. Although the performance of unsupervised learning is inadequate, the constraint strategy emphasized in self6D highlights the potential of this approach in industrial manufacturing. In P S C , for applications where actual datasets are difficult to obtain, the Connection or the first phase of Alignment in this framework can be modified to an unsupervised form with knowledge constraints. On the other hand, the performance changes on P S C and P T demonstrate the stronger adaptability of learning paradigms to pattern changes, which also implies the necessity of the transition from knowledge-based agent to learning agent emphasized in TESA.
Stage-B. Inspection initialization and knowledge improvement based on the data generated by Stage-A are shown in Figure 12. The failure cases summarize the detection failure caused by alignment deviation, differences between virtual and reality, and local matching errors.
It can be observed that in practical industrial vision applications, a series of unpredictable situations may be implied in seemingly good results (e.g., 0.542 ° , 27.413 mm in Figure 12), which further confirms the necessity of emphasizing interpretability, traceability, and improvability. Therefore, knowledge improvement based on TESA was performed as shown in Figure 12. For manual check step, only those cases with significant deviations will be manually corrected. In the pattern part, each sub-pattern space typically has its own dedicated measurement strategy, such as shape residual, cross entropy, Euclidean distance, etc. According to the field knowledge summarized in Section 3.3.2, Criterion was updated from 10-(a) to 10-(b), as shown in Figure 10. Additionally, an alignment balance strategy for improving case c was provided here. Specifically, this strategy can be achieved by increasing the optimization weights of objects located at the far end or around or reducing the geometric feature filtering work of high occluded objects. After the aforementioned work, the experimental results of stage-B were summarized in Table 5, where r ¯ n P T consists of the true positive rate r ¯ n r P T and false positive rate r ¯ n f P T (Recall rate).
The performance improvement in P S C is foreseeable, while the improvement in P T (from 0.712 to 0.845) proves that knowledge improvement has a positive effect on the practical application stage. However, according to our observations, pattern expansion and parameter adjustment on P S C increase the detection tolerance for pattern mismatch, resulting in r ¯ n P S C increasing from 0.778 to 0.894 while r ¯ n f P T also increased from 0.571 to 0.597. This phenomenon is actually due to overfitting in a specific pattern space when the observed information is limited. To address this issue, based on this study, it can be improved by accumulating experience in the target pattern space and adjusting the Criterion to increase the dimension of information perception.
Next, in the transitional Stage-C, while collecting experience with new patterns, we review the feedback of this system engineering on various types of anomalies. Simultaneously, selected [49] as the control group DL1, updated the templates of [50] as DL2, and arranged the state-of-the-art object recognizer Yolo7 [52] in DL3. To demonstrate the upper limit of the DL-based approaches performance, the dataset adopted here was generated from the pose truth of Stage-B (virtual-real ROIs and object bounding rectangle). additionally, the determination of whether it is an anomaly is based on the average confidence given on the test dataset.
Unexpectedly, compared to the results reported in r ¯ n in Table 6, the performance of high-quality label-supervised learning models does not seem to have an absolute advantage over the previously encountered patterns. In summary, learning with reference patterns demonstrates superior discrimination ability for low-granularity position-sensitive anomalies ( r ¯ i on ED), while mature recognition baselines demonstrate better potential performance for semantic-sensitive anomalies with high information volume ( r ¯ m on EA/B). Despite the effectiveness being verified, Figure 13 demonstrates that there still exists a significant number of logical errors and unexpected errors hidden within the actual process data. More seriously, the analysis of the causes of these errors and the construction of avoidance methods are quite difficult. However, in the framework of this study, as shown in the second part of Figure 13, we can easily trace back these errors.
Similar to the previous stage, knowledge improvement work was carried out based on the TESA strategy. Abnormal cases were employed to construct hard samples and retrain D2, as well as to expand the pattern library. The Criterion was updated from (b) to (c) to increase the dimension of information observation to avoid misjudgment caused by incorrect matching in Figure 13, especially by incorporating D2 into alignment and inspection to alleviate the pressure of the evaluation parameter fitting. Finally, parameter debugging and orthogonal experiments on the improved framework were conducted. Correspondingly, the dataset of DL1 and DL2 were supplemented with OT and abnormal states were given greater weight during training.
After careful modifications, these methods have made significant progress compared to before, as shown in Table 7. In our framework, due to the control of Criterion and Parameter constraints, the results obtained at each stage are observable and interpretable, and some abnormal results given by D* module can be well-controlled. For DL-based approaches, the entire inspection of a complex assembled product heavily relies on positioning accuracy and dataset quality. By comparing the r ¯ i in the table, system engineering based on industry experience still holds an advantage in identifying abnormal patterns related to highly specialized knowledge in industrial vision. The provided Criterion and methods may not be the optimal solution and can be further improved through P T or future accumulation. The reasons and basis for the revision of the Criterion is summarized in Table 8.

4.2.3. Validation on Mobile Robot Capture System

The system model is AUBO AMR300-E5, where the positioning of the mobile base relies on laser SLAM and the indoor map is pre-built. The pose ground truth of D 4 is converted into a recognizable control command for the system. The error sources of the system include: a joint calibration error, system error, map construction and positioning error, and motion control error. In this scenario, we have observed that the real-time positioning error and the motion error caused by the idle movement of the base are the primary sources of errors. Therefore, in Table 9 and Figure 14, the first view captured after each vehicle reaches the target point was selected as the representative example. Due to the difficulty in obtaining real-time status, IoU was used to describe the potential deviation of the current system.
Upon tracing ( I o U ¯ E C , I o U ¯ E D ) , it can be observed that SLAM possesses the ability to perform real-time correction. However, as the system continues to operate for an extended period, error accumulation is inevitable, leading to a tendency for deviations to amplify. The results of Group II demonstrate the compatibility and transferability of the proposed framework. Through the comparison between I and II, we can see that, although it is possible to simplify Connection based on multi-device combination practically, complete alignment and inspection are still indispensable.

4.3. Validation on AR Projection under Different Object

The core problem of AR assisted assembly tasks is virtual–real alignment, which essentially still falls within the coverage scope of this framework. The suggested Criterion is shown in Figure 10a,b. D0 was retrained based on dataset D 5 . EAx was selected as the significant element, and the five points associated with its recognition box were used as geometric features. The method pool remains unchanged, but Parameter needs to be debugged again on D 5 . Through the trial operation on (a), it was found that the participation of EBx had a limited impact on overall accuracy, and therefore it was excluded in (b). Additionally, the pattern library (shape, gradient) of ECx was expanded in (b) to improve alignment performance. After the above work, two different Connection strategies were verified; the results are summarized in Table 10 and Figure 15.
Compared to Section 4.2, this case belongs to a common, simple case in industry. As can be seen in Figure 15, after alignment, the performance of the two connection strategies is similar and both demonstrate excellent projection effects. Considering that D5 can be replaced by many excellent feature extraction algorithms in the tracking field, the Connection with B4 + C3 may be a better choice from an efficiency perspective. To further improve accuracy, knowledge improvement work can be carried out from aspects such as incorporating the contour of EBx into the alignment process, expanding the pattern library of holes in various postures, etc.

4.4. Validation on Different Scenario and Task

In Criterion-(a), a manual proxy based on multiple templates was constructed to preliminarily create a dataset. In Criterion-(b), ES0/EA0/EBx were selected as the significant elements. For D0, the embedded constraints include (1) auxiliary tasks for recognizing ES0/EA0; (2) that the ordinates of the centroids of each element are close; and (3) that they are approximately parallel. These explicit constraints enable us to quickly train a specialized model without the risk of transferring large baseline models that are not applicable. The product model can be determined by the distribution of estimated key points and the corresponding 3D structure is loaded to perform Connection and Alignment. When the pattern library is insufficient, to prevent missed identification in perceived ROI, field knowledge of pin-tip reflection characteristics was leveraged to design a more reliable but inefficient method such as D5+A2+A1. After the expansion of the C3 and A2 pattern libraries, as well as the retraining of D0, the entire process was improved and simplified in Criterion-(c). Considering the local positional uncertainty of the skewed pin, an empirical sampling+D1 Criterion was assembled based on (5) in Section 3.3.2, where the template used in D1 was replaced with a normal pin tip image.
The comparison between the method constructed according to this framework and advanced inspection schemes MTCI1 [80] and MTCI2 [22] for multi-type electrical connector is summarized in Table 11. I o U_1 ¯ E C is employed to compare the performance of each group’s initial matching work (recognition and registration in MTCI, Connection in this study). I o U_2 ¯ E C is adopted to evaluate the performance of the strategies implemented by each group (expand ROI and rematch in MTCI, Alignment + Detection in this study) to prevent missing recognition. Indicator I o U_e ¯ E C refers to the IoU between the detected abnormal pin and the local search region.
The first row in Figure 16 shows the inapplicability of 2D-matching modes under large rotation angles. The method of combining registration and directly retraining the large baseline model, as utilized in MTCI1 and 2, does not fully account for task-related prior knowledge. This limitation results in a significant need for efforts directed towards improving and correcting outliers and missed recognitions, particularly when the target is imaged with a large rotation angle. Conversely, the proposed framework enables an easier and faster construction of a robust inspection scheme, leading to more accurate perceptual results. With the above foundation, we can generate a lightweight model suitable for real-time recognition from D0 through knowledge distillation. The image can be converted to grayscale and directly applied above works after Parameter debugging. Additionally, D5 and D1 can be replaced with B1, making it easier to determine the status of the pins from the color space. The effect of tracking and recognition is shown in the second row of Figure 16.

4.5. Knowledge Integration and Reapplication

Through the aforementioned works, pattern libraries, Criterion, and Parameter associated with task-objects, and the intermediate process data of different steps can be accumulated for different capture approaches and different scenarios. Criterion combinations that have been successfully validated on previous objects can be stored as new sub-patterns in the method pool. The original dataset, disturbance strategies for each stage, and the perturbed dataset are prepared to provide a foundation for more advanced DL models. The validated Parameter will be employed as additional expert knowledge to support similar tasks.
Overall, when migrating to new scenarios or tasks, the practical process can be summarized as follows:
  • Perform Criterion design (or directly call based on experience, or automatically recommend based on established pattern space (e.g., knowledge graph));
  • Build pattern connection and alignment;
  • Conduct internal debugging of Criterion + Parameter;
  • Implement joint testing of Connection + Alignment;
  • Determine if repetition is necessary.
With the continuous accumulation of experience, the above work will be progressively simplified and even directly applied in future encounters with new tasks.

5. Conclusions

We believe that liberating people from the complex physical and mental work to play a role in more critical positions, such as innovation and high-level decision-making, is one of the most important propositions towards next-generation intelligent vision and sensor technology. This paper focuses on the adaptability challenges brought by the transformation of manufacturing modes to traditional AVI, and proposes a common, knowledge-driven, generic vision inspection framework that includes three stages to standardize the inspection process, enhance the detection ability of composite vision task, and promote knowledge accumulation and reapplication within the industry. In the first stage, we demonstrated how to utilize common sense, such as the rigid structure stability of industrial products to construct constraints and seamlessly integrate them into both learning and non-learning processes for pattern connection. In the second stage, a multi-granularity, multi-pattern, hierarchical iterative alignment mechanism driven by structured tasks is designed to facilitate traceable and interpretable extraction of task-related objects. In the third stage, general detection items are normalized to the information mapping and processing of different sub-pattern spaces or their combinations, and TESA knowledge improvement and accumulation strategy is designed to ensure internal improvability and transferability between applications.
For a composite appearance inspection task on an assembled product consisting of 113 objects, the inspection pipeline generated based on the proposed framework demonstrates the advantages of traceability, improvability, and weakly annotated dependency compared to advanced deep learning methods. Moreover, compared to the state-of-the-art recognition model with truth labels, the better potential and performance demonstrated in this study on the positioning accuracy and position-sensitive anomaly detection indicates its effectiveness. In addition, the validation of other capture approaches, tasks, and objects preliminarily verifies the adaptability of this study, and the significantly reduced deployment efforts prove the importance and necessity of knowledge accumulation.
It should be noted that this study is only applicable to situations where common knowledge is consistent, including where reference patterns can be provided (3D models, drawings), rigid products, non-precision measurements, etc. In terms of practical application, it is particularly suitable for the customized production process of single piece and small batch in the aerospace field. On the other hand, it is undeniable that the adjustment of Criterion and Parameter in knowledge improvement still rely on humans, and that, perhaps, the workload of improvement is greater than that of data annotation. Therefore, in future work, we plan to conduct validation in more practical scenarios to enrich our experience in Criterion design and encapsulate validated Parameter, sub-patterns, and intermediate data. Furthermore, we will construct a knowledge graph for AVI, allowing us to establish a recommendation system to assist operators in improving designment and decision-making when dealing with new tasks.

Author Contributions

Conceptualization, D.Z.; methodology, D.Z.; software, D.Z. and F.K.; validation, D.Z., F.K. and N.L.; formal analysis, Z.X. and F.K.; investigation, D.Z. and F.D.; resources, F.D.; data curation, F.K. and N.L.; writing—original draft preparation, D.Z.; writing—review and editing, N.L. and F.K.; visualization, D.Z. and Z.X.; supervision, F.D.; project administration, F.D.; funding acquisition, F.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is financially supported by the National Natural Science Foundation of China (grant number 52375478).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. 2021, 54, 38. [Google Scholar] [CrossRef]
  2. Wang, J.; Ma, Y.; Zhang, L.; Gao, R.X.; Wu, D. Deep learning for smart manufacturing: Methods and applications. J. Manuf. Syst. 2018, 48, 144–156. [Google Scholar] [CrossRef]
  3. Kong, F.; Zhao, D.; Du, F. A doubt–confirmation-based visual detection method for foreign object debris aided by assembly models. Trans. Can. Soc. Mech. Eng. 2023, 47, 508–520. [Google Scholar] [CrossRef]
  4. Realyvásquez-Vargas, A.; Arredondo-Soto, K.C.; García-Alcaraz, J.L.; Márquez-Lobato, B.Y.; Cruz-García, J. Introduction and configuration of a collaborative robot in an assembly task as a means to decrease occupational risks and increase efficiency in a manufacturing company. Robot. Comput. Manuf. 2018, 57, 315–328. [Google Scholar] [CrossRef]
  5. Guo, S.; Diao, Q.; Xi, F. Vision based navigation for omni-directional mobile industrial robot. Procedia Comput. Sci. 2017, 105, 20–26. [Google Scholar] [CrossRef]
  6. Rentzos, L.; Papanastasiou, S.; Papakostas, N.; Chryssolouris, G. Augmented reality for human-based assembly: Using product and process semantics. IFAC Proc. 2013, 46, 98–101. [Google Scholar] [CrossRef]
  7. Hořejší, P.; Novikov, K.; Šimon, M. A smart factory in a Smart City: Virtual and augmented reality in a Smart assembly line. IEEE Access 2020, 8, 94330–94340. [Google Scholar] [CrossRef]
  8. Yang, S.; Wang, W.; Liu, C.; Deng, W. Scene understanding in deep learning-based end-to-end controllers for autonomous vehicles. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 53–63. [Google Scholar] [CrossRef]
  9. Zhang, C.; Zhou, G.; Hu, J.; Li, J. Deep learning-enabled intelligent process planning for digital twin manufacturing cell. Knowl.-Based Syst. 2020, 191, 105247. [Google Scholar] [CrossRef]
  10. Sharfuddin, A.K.; Iram, N.; Simonov, K.-S.; Himanshu, G.; Ashraf, R.I. A knowledge-based experts’ system for evaluation of digital supply chain readiness. Knowl.-Based Syst. 2021, 228, 107262. [Google Scholar] [CrossRef]
  11. Wang, M.; Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 2018, 312, 135–153. [Google Scholar] [CrossRef]
  12. Wuest, T.; Weimer, D.; Irgens, C.; Thoben, K.D. Machine learning in manufacturing: Advantages, challenges, and applications. Prod. Manuf. Res. 2016, 4, 23–45. [Google Scholar] [CrossRef]
  13. Zheng, P.; Wang, H.; Sang, Z.; Zhong, R.Y.; Liu, Y.; Liu, C.; Khamdi, M.; Yu, S.; Xu, X. Smart manufacturing systems for Industry 4.0: Conceptual framework, scenarios, and future perspectives. Front. Mech. Eng. 2018, 13, 137–150. [Google Scholar] [CrossRef]
  14. Kamble, S.S.; Gunasekaran, A.; Gawankar, S.A. Sustainable Industry 4.0 framework: A systematic literature review identifying the current trends and future perspectives. Process Saf. Environ. Prot. 2018, 117, 408–425. [Google Scholar] [CrossRef]
  15. Insa-Iglesias, M.; Jenkins, M.D.; Morison, G. 3D visual inspection system framework for structural condition monitoring and analysis. Autom. Constr. 2021, 128, 103755. [Google Scholar] [CrossRef]
  16. Xu, Z.; Chen, B.; Zhan, X.; Xiu, Y.; Suzuki, C.; Shimada, K. A vision-based autonomous UAV inspection framework for unknown tunnel construction sites with dynamic obstacles. arXiv 2023. [Google Scholar] [CrossRef]
  17. Liu, T.; Li, B.; Du, X.; Jiang, B.; Jin, X.; Jin, L.; Zhao, Z. Component-aware anomaly detection framework for adjustable and logical industrial visual inspection. arXiv 2023. [Google Scholar] [CrossRef]
  18. Yang, X.; Cai, J.; Li, K.; Fan, X.; Cao, H. A monocular-based tracking framework for industrial augmented reality applications. Int. J. Adv. Manuf. Technol. 2023, 128, 2571–2588. [Google Scholar] [CrossRef]
  19. Zhu, Q.; Zhang, Y.; Luan, J.; Hu, L. A Machine Vision Development Framework for Product Appearance Quality Inspection. Appl. Sci. 2022, 12, 11565. [Google Scholar] [CrossRef]
  20. Singh, S.A.; Desai, K.A. Automated surface defect detection framework using machine vision and convolutional neural networks. J. Intell. Manuf. 2023, 34, 1995–2011. [Google Scholar] [CrossRef]
  21. Hridoy, M.W.; Rahman, M.M.; Sakib, S. A Framework for Industrial Inspection System using Deep Learning. Ann. Data Sci. 2022, 11, 445–478. [Google Scholar] [CrossRef]
  22. Zhao, D.; Xue, D.; Wang, X.; Du, F. Adaptive vision inspection for multi-type electronic products based on prior knowledge. J. Ind. Inf. Integr. 2022, 27, 100283. [Google Scholar] [CrossRef]
  23. Xiao, M.; Yang, B.; Wang, S.; Mo, F.; He, Y.; Gao, Y. GRA-Net: Global receptive attention network for surface defect detection. Knowl.-Based Syst. 2023, 280, 111066. [Google Scholar] [CrossRef]
  24. Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv 2017. [Google Scholar] [CrossRef]
  25. Hu, Y.; Hugonot, J.; Fua, P.; Salzmann, M. Segmentation-driven 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3385–3394. [Google Scholar] [CrossRef]
  26. Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; Fox, D. DeepIM: Deep iterative matching for 6D pose estimation. In Computer Vision—ECCV 2018, Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 695–711. [Google Scholar] [CrossRef]
  27. Hu, Y.; Fua, P.; Wang, W.; Salzmann, M. Single-stage 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2930–2939. [Google Scholar] [CrossRef]
  28. Labbé, Y.; Carpentier, J.; Aubry, M.; Sivic, J. Cosypose: Consistent multi-view multi-object 6D pose estimation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVII; Springer: Cham, Switzerland, 2020; pp. 574–591. [Google Scholar] [CrossRef]
  29. Kendall, A.; Cipolla, R. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5974–5983. [Google Scholar] [CrossRef]
  30. Tekin, B.; Sinha, S.N.; Fua, P. Real-Time Seamless Single Shot 6D Object Pose Prediction. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 292–301. [Google Scholar] [CrossRef]
  31. Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-wise voting network for 6D of pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3212–3223. [Google Scholar] [CrossRef] [PubMed]
  32. Park, K.; Patten, T.; Vincze, M. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7668–7677. [Google Scholar] [CrossRef]
  33. Li, Z.; Wang, G.; Ji, X. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-dof object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7678–7687. [Google Scholar] [CrossRef]
  34. Song, C.; Song, J.; Huang, Q. HybridPose: 6D object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 431–440. [Google Scholar] [CrossRef]
  35. Mariotti, O.; Bilen, H. Semi-supervised Viewpoint Estimation with Geometry-Aware Conditional Generation. In Computer Vision—ECCV 2020 Workshops, Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12536. [Google Scholar] [CrossRef]
  36. Zhou, G.; Wang, D.; Yan, Y.; Chen, H.; Chen, Q. Semi-Supervised 6D Object Pose Estimation Without Using Real Annotations. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5163–5174. [Google Scholar] [CrossRef]
  37. Wang, G.; Manhardt, F.; Shao, J.; Ji, X.; Navab, N.; Tombari, F. Self6D: Self-supervised monocular 6D object pose estimation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I; Springer: Cham, Switzerland, 2020; pp. 108–125. [Google Scholar] [CrossRef]
  38. Langerman, J.; Qiu, Z.; Sörös, G.; Sebők, D.; Wang, Y.; Huang, H. Domain Adaptation of Networks for Camera Pose Estimation: Learning Camera Pose Estimation without Pose Labels. arXiv 2021. [Google Scholar] [CrossRef]
  39. Ito, S.; Aizawa, H.; Kato, K. Few-Shot NeRF-Based View Synthesis for Viewpoint-Biased Camera Pose Estimation. In Artificial Neural Networks and Machine Learning—ICANN 2023, Proceedings of the ICANN 2023, Heraklion, Crete, Greece, 26–29 September 2023; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; Volume 14255. [Google Scholar] [CrossRef]
  40. Shu, Q.; Luan, Z.; Poslad, S.; Bourguet, M.L.; Xu, M. MCAPR: Multi-modality Cross Attention for Camera Absolute Pose Regression. In Artificial Neural Networks and Machine Learning—ICANN 2023, Proceedings of the ICANN 2023, Heraklion, Crete, Greece, 26–29 September 2023; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; Volume 14255. [Google Scholar] [CrossRef]
  41. Lee, T.; Lee, B.U.; Shin, I.; Choe, J.; Shin, U.; Kweon, I.S.; Yoon, K.J. UDA-COPE: Unsupervised domain adaptation for category-level object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14891–14900. [Google Scholar] [CrossRef]
  42. Zhang, D.; Barbot, A.; Seichepine, F.; Lo, F.P.-W.; Bai, W.; Yang, G.-Z.; Lo, B. Micro-object pose estimation with sim-to-real transfer learning using small dataset. Commun. Phys. 2022, 5, 80. [Google Scholar] [CrossRef]
  43. Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar] [CrossRef]
  44. Manon, P.J.; Arnaud, P.; Dominique, N.; Jean-Luc, M.; Jean-Philippe, P. Survey on the View Planning Problem for Reverse Engineering and Automated Control Applications. Comput.-Aided Des. 2021, 141, 103094. [Google Scholar] [CrossRef]
  45. Mehdi, M.; MohammadReza, H.; Soohwan, S.; Shirin, M.; Mohammad, S. A Review on Viewpoints and Path Planning for UAV-Based 3-D Reconstruction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5026–5048. [Google Scholar] [CrossRef]
  46. Youkachen, S.; Ruchanurucks, M.; Phatrapomnant, T.; Kaneko, H. Defect Segmentation of Hot-rolled Steel Strip Surface by using Convolutional Auto-Encoder and Conventional Image processing. In Proceedings of the 2019 10th International Conference of Information and Communication Technology for Embedded Systems (IC-ICTES), Bangkok, Thailand, 25–27 March 2019; pp. 1–5. [Google Scholar] [CrossRef]
  47. Wang, K. Contrastive learning-based semantic segmentation for In-situ stratified defect detection in additive manufacturing. J. Manuf. Syst. 2023, 68, 465–476. [Google Scholar] [CrossRef]
  48. Hu, X.; Yang, J.; Jiang, F.; Amir, H.; Kia, D.; Mandar, G. Steel surface defect detection based on self-supervised contrastive representation learning with matching metric. Appl. Soft Comput. 2023, 145, 110578. [Google Scholar] [CrossRef]
  49. Kim, J.; Oh, T.H.; Lee, S.; Pan, F.; Kweon, I.S. Variational prototyping-encoder: One-shot learning with prototypical images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9462–9470. [Google Scholar] [CrossRef]
  50. Zhou, Y.; Zhang, Y. SiamET: A Siamese based visual tracking network with enhanced templates. Appl. Intell. 2020, 52, 9782–9794. [Google Scholar] [CrossRef]
  51. Xia, X.; Pan, X.; Li, N.; He, X.; Ma, L.; Zhang, X.; Ding, N. GAN-based anomaly detection: A review. Neurocomputing 2022, 493, 497–535. [Google Scholar] [CrossRef]
  52. Wang, C.-Y.; Alexey, B.; Mark, L.H.-Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
  53. Alexander, K.; Eric, M.; Nikhila, R.; Hanzi, M.; Chloe, R.; Laura, G.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. arXiv 2023. [Google Scholar] [CrossRef]
  54. Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
  55. Ben Abdallah, H.; Jovančević, I.; Orteu, J.J.; Brèthes, L. Automatic inspection of aeronautical mechanical assemblies by matching the 3D CAD model and real 2D images. J. Imaging 2019, 5, 81. [Google Scholar] [CrossRef] [PubMed]
  56. Li, D.-C.; Lin, L.-S.; Chen, C.-C.; Yu, W.-H. Using virtual samples to improve learning performance for small datasets with multimodal distributions. Soft Comput. 2019, 23, 11883–11900. [Google Scholar] [CrossRef]
  57. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond empirical risk minimization. arXiv 2017. [Google Scholar] [CrossRef]
  58. Siu, C.; Wang, M.; Cheng, J.C. A framework for synthetic image generation and augmentation for improving automatic sewer pipe defect detection. Autom. Constr. 2022, 13, 104213. [Google Scholar] [CrossRef]
  59. Wang, X.; Shrivastava, A.; Gupta, A. A-Fast-RCNN: Hard positive generation via adversary for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2606–2615. [Google Scholar]
  60. Zhou, Q.; Chen, R.; Huang, B.; Xu, W.; Yu, J. DeepInspection: Deep learning based hierarchical network for specular surface inspection. Measurement 2020, 160, 107834. [Google Scholar] [CrossRef]
  61. Wang, C.; Ge, S.; Jiang, Z.; Hao, H.; Gu, Q. SiamFuseNet: A pseudo-siamese network for detritus detection from polarized microscopic images of river sands. Comput. Geosci. 2021, 156, 104912. [Google Scholar] [CrossRef]
  62. Chen, B.; Parra, A.; Cao, J.; Li, N.; Chin, T.-J. End-to-End Learnable Geometric Vision by Backpropagating PnP Optimization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8097–8106. [Google Scholar] [CrossRef]
  63. Xu, C.; Wang, J.; Tao, J.; Zhang, J.; Zheng, P. A knowledge augmented deep learning method for vision-based yarn contour detection. J. Manuf. Syst. 2022, 63, 317–328. [Google Scholar] [CrossRef]
  64. Xu, Y.; Qiao, W.; Zhao, J.; Zhang, Q.; Li, H. Vision-based multi-level synthetical evaluation of seismic damage for RC structural components: A multi-task learning approach. Earthq. Eng. Eng. Vib. 2023, 22, 69–85. [Google Scholar] [CrossRef]
  65. Dong, X.; Taylor, C.J.; Cootes, T.F. Defect Classification and Detection Using a Multitask Deep One-Class CNN. IEEE Trans. Autom. Sci. Eng. 2022, 19, 1719–1730. [Google Scholar] [CrossRef]
  66. Wu, H.; Li, B.; Tian, L.; Feng, J.; Dong, C. An adaptive loss weighting multi-task network with attention-guide proposal generation for small size defect inspection. Vis. Comput. 2023, 40, 681–698. [Google Scholar] [CrossRef]
  67. Wright, L.G.; Onodera, T.; Stein, M.M.; Wang, T.; Schachter, D.T.; Hu, Z.; McMahon, P.L. Deep physical neural networks trained with backpropagation. Nature 2022, 601, 549–555. [Google Scholar] [CrossRef] [PubMed]
  68. Bazighifan, O.; Cesarano, C. A Philos-Type Oscillation Criteria for Fourth-Order Neutral Differential Equations. Symmetry 2020, 12, 379. [Google Scholar] [CrossRef]
  69. Chang, A.; Zhang, Y.; Zhang, S.; Zhong, L.; Zhang, L. Detecting prohibited objects with physical size constraint from cluttered X-ray baggage images. Knowl.-Based Syst. 2022, 237, 107916. [Google Scholar] [CrossRef]
  70. Wang, X.; Peng, Z.; Kong, D.; Zhang, P.; He, Y. Infrared dim target detection based on total variation regularization and principal component pursuit. Imaging Vis. Comput. 2017, 63, 1–9. [Google Scholar] [CrossRef]
  71. Zhang, W.; Şerban, O.; Sun, J.; Guo, Y. Conflict-aware multilingual knowledge graph completion. Knowl.-Based Syst. 2023, 281, 111070. [Google Scholar] [CrossRef]
  72. Ge, Y.; Ma, J.; Zhang, L.; Li, X.; Lu, H. Trustworthiness-aware knowledge graph representation for recommendation. Knowl.-Based Syst. 2023, 278, 110865. [Google Scholar] [CrossRef]
  73. Li, X.; Liu, G.; Sun, S.; Bai, C. Contour detection and salient feature line regularization for printed circuit board in point clouds based on geometric primitives. Measurement 2021, 185, 109978. [Google Scholar] [CrossRef]
  74. Zhang, Q.; Liu, J.; Zheng, S.; Yu, C. A novel accurate positioning method of reference hole for complex surface in aircraft assembly. Int. J. Adv. Manuf. Technol. 2021, 119, 571–586. [Google Scholar] [CrossRef]
  75. Koch, C.; Neges, M.; König, M.; Abramovici, M. Natural markers for augmented reality-based indoor navigation and facility maintenance. Autom. Constr. 2014, 48, 18–30. [Google Scholar] [CrossRef]
  76. Vázquez Nava, A. Vision System for Quality Inspection of Automotive Parts Based on Non-Defective Samples. Master’s Thesis, Instituto Tecnológico y de Estudios Superiores de Monterrey, Monterrey, Mexico, 2021. Available online: https://hdl.handle.net/11285/648442 (accessed on 11 June 2021).
  77. Yuan, G.; Fu, Q.; Mi, Z.; Luo, Y.; Tao, W. SSRNet: Scalable 3D Surface Reconstruction Network. IEEE Trans. Vis. Comput. Graph. 2022, 29, 4906–4919. [Google Scholar] [CrossRef]
  78. Xing, C.; Rostamzadeh, N.; Oreshkin, B.; Pinheiro, P.O. Adaptive cross-modal few-shot learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Article number 436; pp. 4847–4857. Available online: https://dl.acm.org/doi/10.5555/3454287.3454723 (accessed on 11 June 2021).
  79. Du, F.; Kong, F.; Zhao, D. A Knowledge Transfer Method for Unsupervised Pose Keypoint Detection Based on Domain Adaptation and CAD Models. Adv. Intell. Syst. 2023, 5, 2200214. [Google Scholar] [CrossRef]
  80. Zhao, D.; Kong, F.; Du, F. Vision-based adaptive stereo measurement of pins on multi-type electrical connectors. Meas. Sci. Technol. 2019, 30, 105002. [Google Scholar] [CrossRef]
  81. Bergström, P.; Edlund, O. Robust registration of point sets using iteratively reweighted least squares. Comput. Optim. Appl. 2014, 58, 543–561. [Google Scholar] [CrossRef]
  82. Yang, S. A high-precision linear method for camera pose determination. In Proceedings of the 2010 IEEE International Conference on Mechatronics and Automation, Xi’an, China, 4–7 August 2010; pp. 595–600. [Google Scholar] [CrossRef]
  83. Leon, K.; Mery, D.; Pedreschi, F.; Leon, J. Color measurement in L* a* b* units from RGB digital images. Food Res. Int. 2006, 39, 1084–1091. [Google Scholar] [CrossRef]
  84. Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 3, 47–57. [Google Scholar] [CrossRef]
  85. Deng, C.; Wang, B.; Lin, W.; Huang, G.; Zhao, B. Effective visual tracking by pairwise metric learning. Neurocomputing 2017, 261, 266–275. [Google Scholar] [CrossRef]
  86. Li, P.; Chen, B.; Wang, D.; Lu, H. Visual tracking by dynamic matching-classification network switching. Pattern Recognit. 2020, 107, 107419. [Google Scholar] [CrossRef]
  87. Tsin, Y.; Kanade, T. A correlation-based approach to robust point set registration. In Computer Vision-ECCV 2004, Proceedings of the 8th European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004; Proceedings, Part III; Springer: Berlin/Heidelberg, Germany, 2004; pp. 558–569. [Google Scholar] [CrossRef]
  88. Myronenko, A.; Song, X. Point Set Registration: Coherent Point Drift. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 2262–2275. [Google Scholar] [CrossRef]
  89. Shi, X.; Zhang, S.; Cheng, M.; He, L.; Tang, X.; Cui, Z. Few-shot semantic segmentation for industrial defect recognition. Comput. Ind. 2023, 148, 103901. [Google Scholar] [CrossRef]
  90. Danzer, A.; Griebel, T.; Bach, M.; Dietmayer, K. 2D car detection in radar data with pointnets. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 61–66. [Google Scholar] [CrossRef]
  91. Zhao, Z.; Li, B.; Dong, R.; Zhao, P. A Surface Defect Detection Method Based on Positive Samples. In PRICAI 2018: Trends in Artificial Intelligence, Proceedings of the PRICAI 2018, Nanjing, China, 28–31 August 2018; Geng, X., Kang, B.H., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11013. [Google Scholar] [CrossRef]
  92. Fang, Y.; Zeng, T. Learning deep edge prior for image denoising. Comput. Vis. Image Underst. 2020, 200, 103044. [Google Scholar] [CrossRef]
  93. Park, S.; Bang, S.; Kim, H.; Kim, H. Patch-Based Crack Detection in Black Box Images Using Convolutional Neural Networks. J. Comput. Civ. Eng. 2019, 33, 04019017. [Google Scholar] [CrossRef]
  94. Tsai, D.M.; Wu, S.C.; Li, W.C. Defect detection of solar cells in electroluminescence images using Fourier image reconstruction. Sol. Energy Mater. Sol. Cells 2012, 99, 250–262. [Google Scholar] [CrossRef]
  95. Duan, J.; Liu, X.; Wu, X.; Mao, C. Detection and segmentation of iron ore green pellets in images using lightweight U-net deep learning network. Neural Comput. Appl. 2020, 32, 5775–5790. [Google Scholar] [CrossRef]
  96. Wang, J.; Bai, X.; You, X.; Liu, W.; Latecki, L.J. Shape Matching and Classification Using Height Functions. Pattern Recognit. Lett. 2012, 33, 134–143. [Google Scholar] [CrossRef]
  97. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef]
  98. Zhao, D.; Du, F. A novel approach for scale and rotation adaptive estimation based on time series alignment. Vis. Comput. 2020, 36, 175–189. [Google Scholar] [CrossRef]
  99. Herbert, B.; Andreas, E.; Tinne, T.; Luc, V.G. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
  100. Nikhila, R.; Jeremyand, R.; David, N.; Taylor, G.; Wan-Yen, L.; Justin, J.; Georgia, G. Accelerating 3D Deep Learning with PyTorch3D. arXiv 2020. [Google Scholar] [CrossRef]
Figure 1. The challenges that current vision-based inspection will face in future manufacturing models are already emerging in the appearance quality assurance of the complex product assembly in high-end equipment manufacturing industry.
Figure 1. The challenges that current vision-based inspection will face in future manufacturing models are already emerging in the appearance quality assurance of the complex product assembly in high-end equipment manufacturing industry.
Sensors 24 04120 g001
Figure 2. Overview of the proposed common knowledge-driven adaptive inspection framework. * abnormal.
Figure 2. Overview of the proposed common knowledge-driven adaptive inspection framework. * abnormal.
Sensors 24 04120 g002
Figure 3. Example of constraint-embedding strategy for pattern connection.
Figure 3. Example of constraint-embedding strategy for pattern connection.
Sensors 24 04120 g003
Figure 4. Structured task representation model.
Figure 4. Structured task representation model.
Sensors 24 04120 g004
Figure 5. Multi-granularity Pattern Alignment Pipeline.
Figure 5. Multi-granularity Pattern Alignment Pipeline.
Sensors 24 04120 g005
Figure 6. A complete joint optimization of objects with different granularities.
Figure 6. A complete joint optimization of objects with different granularities.
Sensors 24 04120 g006
Figure 7. A Knowledge-based adaptive inspection scheme for perceived multi-size objects.
Figure 7. A Knowledge-based adaptive inspection scheme for perceived multi-size objects.
Sensors 24 04120 g007
Figure 8. TESA knowledge improvement strategy. The marker × is used to indicate whether the observations in the corresponding feature space are complete.
Figure 8. TESA knowledge improvement strategy. The marker × is used to indicate whether the observations in the corresponding feature space are complete.
Sensors 24 04120 g008
Figure 9. System configuration and related datasets for each scenario.
Figure 9. System configuration and related datasets for each scenario.
Sensors 24 04120 g009
Figure 10. Object ID and Criterion evolution: (a) connection, alignment, and data accumulation; (b) initialization of inspection system combined with DL module; and (c) a complete solution that emphasizes reliability.
Figure 10. Object ID and Criterion evolution: (a) connection, alignment, and data accumulation; (b) initialization of inspection system combined with DL module; and (c) a complete solution that emphasizes reliability.
Sensors 24 04120 g010
Figure 11. Examples of the estimation performance of different groups. The red/blue boxes are used to visually display accuracy.
Figure 11. Examples of the estimation performance of different groups. The red/blue boxes are used to visually display accuracy.
Sensors 24 04120 g011
Figure 12. Inspection initialization, failure cases, and knowledge improvement: (a) mismatch caused by alignment deviation; (b) The difference between the reference pattern and the actual product (Red cable); (c) mismatch caused by local negligence in. alignment; and (d) errors may be overwhelmed by other objects during alignment, but it will be exposed in the inspection.
Figure 12. Inspection initialization, failure cases, and knowledge improvement: (a) mismatch caused by alignment deviation; (b) The difference between the reference pattern and the actual product (Red cable); (c) mismatch caused by local negligence in. alignment; and (d) errors may be overwhelmed by other objects during alignment, but it will be exposed in the inspection.
Sensors 24 04120 g012
Figure 13. Failure cases of first contact with abnormal patterns. In the DL part, rectangular boxes of different colors represent the recognition results of different groups, and the corresponding relationships are detailed in Table 6. * abnormal.
Figure 13. Failure cases of first contact with abnormal patterns. In the DL part, rectangular boxes of different colors represent the recognition results of different groups, and the corresponding relationships are detailed in Table 6. * abnormal.
Sensors 24 04120 g013
Figure 14. Localization and detection performance of mobile vision robot system with or without alignment. The dark blue mark was used to describe pose deviation, undetected normal objects were marked with bright green, and red and bright yellow indicate recognized and unrecognized defects, respectively.
Figure 14. Localization and detection performance of mobile vision robot system with or without alignment. The dark blue mark was used to describe pose deviation, undetected normal objects were marked with bright green, and red and bright yellow indicate recognized and unrecognized defects, respectively.
Sensors 24 04120 g014
Figure 15. Examples of different Connection strategies and their final AR projection results.
Figure 15. Examples of different Connection strategies and their final AR projection results.
Sensors 24 04120 g015
Figure 16. Validation on electronic product: (a) comparison of pin extraction performance and skewed pin inspection strategy, where the yellow box indicates the feasible sampling region for the skewed pin; and (b) tracking and inspection effects after migration to portable imaging, where the red box represents the recognition results.
Figure 16. Validation on electronic product: (a) comparison of pin extraction performance and skewed pin inspection strategy, where the yellow box indicates the feasible sampling region for the skewed pin; and (b) tracking and inspection effects after migration to portable imaging, where the red box represents the recognition results.
Sensors 24 04120 g016aSensors 24 04120 g016b
Table 1. The number of times an abnormal object can be recognized by the naked eye in the dataset.
Table 1. The number of times an abnormal object can be recognized by the naked eye in the dataset.
IDEB2EA6ED49ED50ED13ED14EC11ED43ED44EB5ED25ED26
D 2 589773454135884369775253
D 3 417236533130613548614139
Table 2. Framework configuration.
Table 2. Framework configuration.
ItemConfigurations
Significant elementsES0, ES1
Feature designProjected cuboid described by 8 + 1 2D points
ConstraintsAuxiliary recognition task
Approximate parallel constraint
Post processingWeight adjustment [82]
A1CPD [88]
A2Edge-based [96]
A3Weighted-ICP [81]
A4Shape fitting algorithm set
B1Color segmentation, K-means
B2SLIC [97]
B4HOG, Canny, Sobel
C3[98] with multiple templates
C4SURF [99] with parallel constraints
Table 3. Deployment and execution of detection experiments.
Table 3. Deployment and execution of detection experiments.
StageDescriptionDatasetsCriterionIndicators
AConnection and alignmentPSC10-(a) e ¯ R , e ¯ t , I o U ¯ E C , I o U ¯ E D , = P S C , P T
BInspection initialization
Knowledge improvement
Connection and alignment
PSC10-(b) e ¯ R , e ¯ t , I o U ¯ E C , I o U ¯ E D , and r ¯ n ,
= P S C , P T
CConnection, alignment and inspectionTO10-(b) I o U ¯ E C , I o U ¯ E D , = T O , r ¯ ,   = TO, PT
DKnowledge improvement
Connection, alignment and inspection
PSC, TO10-(c)all r ¯ ,   = ALL
Table 4. Performance of Stage-A of the experiment.
Table 4. Performance of Stage-A of the experiment.
ConnectionAlignmentGroup IGroup IIGroup IIIGroup IVGroup V
e ¯ R P S C ( ° )2.6780.8173.3571.5941.6132.1813.895
e ¯ t P S C (mm)142.19635.304163.62364.72562.01497.864165.074
I o U ¯ E C P S C 0.5530.8050.5150.7000.6880.6120.394
I o U ¯ E D P S C 0.3740.6430.3360.4690.4600.4050.177
e ¯ R P T ( ° )2.7711.3423.4291.7671.8382.4754.023
e ¯ t P T (mm)153.58459.641169.24069.80274.125114.523181.451
I o U ¯ E C P T 0.5470.7210.4970.6730.6620.5770.367
I o U ¯ E D P T 0.3690.4890.3140.4320.4190.3980.113
Table 5. Performance of Stage-B of the experiment.
Table 5. Performance of Stage-B of the experiment.
BeforeAfter Improvement
PSC r ¯ n P S C e ¯ R P S C e ¯ t P S C I o U ¯ E C P S C I o U ¯ E D P S C r ¯ n P S C
0.7780.51727.3040.8820.7300.894
PT r ¯ n r P T r ¯ n f P T e ¯ R P T e ¯ t P T I o U ¯ E C P T I o U ¯ E D P T r ¯ n r P T r ¯ n f P T
0.7120.5711.03452.3080.7550.4920.8450.597
Table 6. State recognition performance of Stage-C of the experiment.
Table 6. State recognition performance of Stage-C of the experiment.
r ¯ n T O r ¯ m T O r ¯ w T O r ¯ i T O r ¯ n P T r ¯ m P T r ¯ w P T r ¯ i P T
EA/BEDEBEDEA/BEDEBED
Proposed (Green)0.8560.6710.7360.5850.1040.2440.8450.4250.6420.5380.0490.270
DL1 (Red)0.7970.6650.7040.7240.2470.3870.7660.6810.6920.6700.2130.325
DL2 (Blue)0.8120.6970.8210.8050.2600.4010.7730.6640.8080.7140.2130.399
Yolo7 (Cyan)0.9140.7420.8050.8620.0260.0690.9010.7350.7830.8350.0160.074
Table 7. State recognition performance of Stage-D of the experiment.
Table 7. State recognition performance of Stage-D of the experiment.
r ¯ n T O + P S C r ¯ m T O r ¯ w T O r ¯ i T O r ¯ n P T r ¯ m P T r ¯ w P T r ¯ i P T
Proposed+0.9810.9840.9670.9080.9320.9360.9560.884
DL2+0.9170.9300.9270.7860.8250.9100.9450.741
Yolo7+0.9430.9620.9590.7210.9380.9440.9670.683
Table 8. Reasons for modifying Criterion.
Table 8. Reasons for modifying Criterion.
StageCriterion SchemeObject IDExpert Experience
AAt first, there were only a few samples, and the framework built through experience and expertise served as the agent for data acquisition.
(a)-AlignmentES0/ES1/EBx/EAxExtracting domain independent geometric information in the case of a small number of samples and no annotation.
EA3/EA18/EA16As the significant feature, color-region can reduce the amount of information received by A2.
ECxThe small silver-gray plug has the characteristics of low information volume and consistent texture. Simple shape allows us to pre-built a shape library.
EDxThe materials of standard parts are uniform. The small size of the fastener makes its features clustered and pattern single.
B, CAfter knowledge improvement, the state recognition pipeline is built according to the collected samples, and the general task model and various pattern libraries are enabled.
(b)-AlignmentES0/ES1/EBxA1 can be maintained under the constraint of Connection.
ECxIt is found that B2 is easy to gather the plug and cable (bracket) together, so B4 is adopted to extract the gradient, and A2’s shape library is enriched.
EDxA4 is added to prevent the shadow part from being miscalculated as part of the shape, resulting in centroid offset.
(b)-DetectionEA0/EA1/…/EA14D5 is used to mitigate noise. Another purpose of A2 besides matching is to enrich the shape Library.
EA5/EB3The texture is complex but regular, which makes the geometric matching prone to bias, but the analysis in gradient mode will benefit from it.
EA8/EA12/EB1/EB2Significant geometric features (box structure).
EA3/EA16/EA18Significant visual features (Color)
EA10/EA15/EA17Significant geometric features. (Quasi circular structure)
EB4/EB5The pattern change under visible conditions is single, so C3 is configured.
ECxC3 is added to prevent some missed detections.
EDxA2 replaces A4 used in alignment to determine the position of object, and C3 is assembled to assist in determining whether the object exists.
D(c)-AlignmentEBxBy observing Alignment of (b), it is found that the defect of EBx will cause large matching deviations, so C4 is configured to limit the search of A1.
EAx/ECxD2 is introduced to provide observation of semantic space to prevent local matching errors.
(d)-DetectionEA0/EA1/…/EA14Configure D3 to enrich observation space.
EA5/EB3Configure A1 to add position sensitivity.
EA12Due to the special shape and invisibility (based on experience), recognition can be completed only by C3.
EB4/EB5Although significant shapes are easy to fit, in practice, it is quite easy to make matching errors under the influence of cables and brackets, so D2 and D5 are added.
EB1/EB2Configure D3 to enrich observation space.
Table 9. Performance on a planning-based mobile vision robot system.
Table 9. Performance on a planning-based mobile vision robot system.
D 4 1 D 4 2 D 4 3 D 4 4 D 4 5
Potential deviation
( I o U ¯ E C , I o U ¯ E D )
(0.623, 0.403)(0.636, 0.407)(0.579, 0.364)(0.659, 0.411)(0.655, 0.412)
Visible NumberNormal5460615049
Defects48776
Group INormal4244353736
Defects33635
Group IIAlignment(0.907, 0.812)(0.918, 0.808)(0.911, 0.815)(0.933, 0.820)(0.929, 0.812)
Normal5258585047
Defects48766
Table 10. Validation on actual AR-assisted assembly cases.
Table 10. Validation on actual AR-assisted assembly cases.
e ¯ R D 5 ( ° ) e ¯ t D 5 ( m m ) I o U ¯ E B D 5 I o U ¯ E C D 5
Proposed_(a) D5 + C30.98643.4230.8740.561
Proposed_(a) D0 + EPnP1.02152.7860.8330.515
Proposed_(b)0.75337.6650.9270.722
Table 11. Performance comparison on the pin-tip extraction of multi-type electrical connectors.
Table 11. Performance comparison on the pin-tip extraction of multi-type electrical connectors.
I o U_1 ¯ E C D 6 I o U_2 ¯ E C D 6 I o U_1 ¯ E C D 7 I o U_2 ¯ E C D 7 I o U_e ¯ E C D 7
MTCI0.6250.8540.6910.8370.219
MTCI20.7230.8780.6480.8440.286
Proposed0.7050.9630.7220.9550.425
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, D.; Kong, F.; Lv, N.; Xu, Z.; Du, F. A Common Knowledge-Driven Generic Vision Inspection Framework for Adaptation to Multiple Scenarios, Tasks, and Objects. Sensors 2024, 24, 4120. https://doi.org/10.3390/s24134120

AMA Style

Zhao D, Kong F, Lv N, Xu Z, Du F. A Common Knowledge-Driven Generic Vision Inspection Framework for Adaptation to Multiple Scenarios, Tasks, and Objects. Sensors. 2024; 24(13):4120. https://doi.org/10.3390/s24134120

Chicago/Turabian Style

Zhao, Delong, Feifei Kong, Nengbin Lv, Zhangmao Xu, and Fuzhou Du. 2024. "A Common Knowledge-Driven Generic Vision Inspection Framework for Adaptation to Multiple Scenarios, Tasks, and Objects" Sensors 24, no. 13: 4120. https://doi.org/10.3390/s24134120

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop