3.1. General Overview
The detection of laminated panels can be framed as a multi-task learning problem, where both classification and bounding-box regression are optimized together. This approach helps to accurately identify and locate objects, as commonly performed in many high-performing single-stage detectors. Specifically, for an input image with input H × W × C, the single-stage detector is trained and reasoned to obtain a classification score of dimension N × 1, a confidence score of N × 1, and position coordinates of N × 2, where N denotes the total number of prediction frames.
As shown in
Figure 2, the whole detection pipeline consists of five parts: a backbone network with Adown and RepNCSPELAN4 modules for efficient processing and extraction of features; a feature-processing module with MRPE; YOLONeck for further processing of features; a bottleneck with a resolution-extended network (REN) for extracted features for high-resolution enhancement, stitching, etc.; and a detection head with six independent branches to output the final results. The feature maps at all scales share the same detection head to increase the efficiency of the additive measurements. The shared detection head has two components: regression and classification branches. Like other popular single-level detectors, YOLOv9’s loss consists of three components: bounding-box regression loss, confidence loss, and category classification loss. Assuming that the loss function of YOLOv9 is
L, we can express it as a weighted sum of the three main components:
where
is the bounding-box regression loss, and in our example, the IoU loss is used.
is the confidence loss, and the binary cross-entropy loss is applied.
is the category classification loss. Specifically,
represents the bounding box of the target. In object detection, the model needs to predict the position information of each target, usually represented by a rectangular box. This bounding box is usually represented by four coordinates, which are the positioning information of the target area. The model predicts these values through regression tasks. Conf is the confidence score, which represents the confidence that the predicted bounding box contains the target. It reflects the confidence level of the model as to whether the current prediction box contains a certain target. This value is usually between 0 and 1, where 1 indicates that the model is very certain that there is a target within the box, and 0 indicates that the model is almost certain that there is no target within the box. Usually, conf is calculated based on the probability of the target’s existence, and Cls is the class score, which represents the probability that each predicted box belongs to a different category. For multi-class object detection tasks, the model outputs a category score vector for each bounding box, representing the probability that the box belongs to each category. Since there are eight categories of objects to be distinguished, the multivariate cross-entropy loss is then utilized. The bounding-box regression loss (BBox loss), confidence loss (Conf loss), and category classification loss (Cls loss) are the three main components of the total loss function in an object detection model. The BBox loss is used to measure how accurately the model predicts the location and size of the object by comparing the predicted bounding box coordinates with the ground truth. Conf loss refers to the model’s ability to correctly predict the presence of an object within the bounding box. It is measured using binary cross-entropy, where the model learns to classify whether a given region contains an object or not. Cls loss is used to evaluate how well the model assigns the correct object category to each detected object. Since there are multiple object categories, this loss is computed using multivariate cross-entropy, which ensures the model classifies objects correctly. The coefficients
,
, and
serve to balance the contribution of each loss term in the total loss function. These coefficients adjust the importance of each loss type based on the specific needs of the task. In our case, the values are set to 5, 1, and 1, respectively, where the higher weight for the bounding-box regression loss emphasizes the importance of precise localization of the objects. This balancing is essential for achieving good performance in object detection, as it ensures that the model focuses adequately on all three aspects: localization, detection confidence, and classification.
In the original YOLOv9 architecture, the classification branch and the regression branch share the same structure. Specifically, YOLOv9 utilizes six interdependent detection branches, working together to produce the final detection results. In our approach, we retain most of the modules within these six detection branches, keeping them consistent with the primary YOLOv9 architecture. However, to enhance detection efficiency, especially in scenarios involving both large and extremely small objects, we introduce a novel framework called Multi-Category Perceptual Prior Enhancement (MRPE). This framework is designed to improve the detector’s ability to perceive and differentiate between various categories by incorporating prior information. Additionally, MRPE enhances the features extracted from images, thereby improving overall detection accuracy. We replace some of the Adown modules in YOLOv9 with the MRPE module. While functional, these Adown modules have been found to negatively impact execution speed in industrial settings, making them less suitable for high-performance applications.
Furthermore, using high-resolution input images has been shown to improve detection tasks in specialized datasets, particularly for identifying small-scale objects. However, directly introducing higher-resolution input images into the detection process significantly increases the computational load, which can degrade the overall performance of the detection algorithm. To address this challenge, we propose a resolution-extended network (REN), which integrates a super-resolution approach into the feature processing pipeline of the object detection model. REN achieves this while adding minimal computational overhead. In our implementation, we replace all the upsampling modules within the YOLOv9 network with REN. The original YOLOv9 upsampling relied on a nearest-neighbor interpolation strategy, which failed to consider contextual relationships within the image. In contrast, our REN module preserves these contextual relationships, leading to better feature resolution and, consequently, more accurate detection results. Detailed explanations of the MRPE and REN modules are provided in
Section 3.2 and
Section 3.3, respectively.
3.2. Multi-Class Perceptual Prior Enhanced Networks
In inspection tasks, the classification subtask focuses on finding a way to distinguish between the category being inspected and other background elements. For example, in a laminated panel inspection project, this means identifying categories like reinforcement bars, outer reinforcement bars, and other backgrounds such as the panel boundary. The main challenges here are dealing with dense occlusions and the small size of objects. Additionally, in traditional object detection tasks, features from different locations and scales do not equally contribute to the final result. This, along with the fact that much of the regular pattern information in a specific scene is often overlooked, leads to the network not fully utilizing important features.
These issues lead to varying levels of false detections, which can become more pronounced when using single-stage object detectors. This is because single-stage detectors omit the candidate anchor frame step, leading to less accurate results. To tackle these challenges, a common strategy is to use fine-tuned attention modules. This approach aims to enhance the efficiency of global information usage by adjusting the focus of the neuron’s receptive field.
However, this solution is not always effective for laminated panel detection tasks. The attention mechanism’s performance must be validated through extensive experiments with different models and datasets. In practice, mainstream attention-based methods have not shown significant improvements for tasks like laminated panel detection. Specifically, conventional attention modules have difficulty managing the large pixel gap between categories such as long rebar and line. And in some cases, these categories can even conflict with each other. As a result, regardless of merging the idea of attention, the detector often remains focused on localized information. The limitation can lead to false detections of objects with similar features, reducing the overall accuracy and performance of the detector. The persistent reliance on localized information means that important contextual features are not fully utilized, further impacting the detector’s effectiveness in distinguishing between different categories in complex scenarios.
Considering the above problems, we propose exploring more effective attention modules to guide the model towards better localization. Inspired by the concept of scene graph enhancement used in recurrent neural networks and the idea of utilizing prior information, we aim to improve feature representation by creating and aligning a scene graph with the training data. By integrating prior information and attention mechanisms, we enhance the model’s ability to capture and utilize key features.
Our approach involves generating prior information specific to the laminated panel scenario to assist in feature extraction, as illustrated in
Figure 3. The architecture includes a main feature processing module and an auxiliary branch. In the main branch, the original feature map is compressed to a one-dimensional vector using a fully connected (FC) layer, where fully connected layers are a common type of layer in neural networks, widely used in various deep learning models, especially in classification tasks. Its function is to linearly transform the input features and generate outputs. It can achieve feature fusion and dimensionality reduction in different scenarios. The pixelwise product involves element-wise multiplication between tensors or matrices to refine feature maps and focus attention on crucial image areas. The scale of these operations depends on the input image size, which in this case is 640 × 640 pixels, and the resulting predictions are made at this resolution. An activation function is then applied to provide a more flexible representation of the features.
In the auxiliary branch, we generate prior information to reflect the existing scene graph. This involves two types of prior constraints applied during training to improve detection accuracy. The first type aligns prior information with ground truth (GT) features, representing different categories as nodes in the scene graph. The second type represents relationships between objects in 2D space as edges in the scene graph. The scene graph is then fed into the Feature Aggregation module for further processing. This process results in enhanced feature representations that improve overall detection performance.
Without the aid of pre-generated candidate frames, the one-stage object detectors may be “less confident” in identifying similar features, especially in scenes with a single color and similar texture. On the other hand, the single-stage detector is limited to focusing on the local receptive fields and ignoring the relationships between contexts when it learns to annotate the features of a region, leading to a certain degree of false detection relationships between contexts, which leads to a certain degree of false detection. We design an attention module that focuses on channel information and introduces spatial information from the scene graph, aiming to allow the model to pay full attention to spatial feature relationships while attending to differences and similarities in information between multiple channels.
Specifically, we first change the format of the feature map by tensor variation while retaining all the feature information of the image; we further perform activation transformation and linear mapping on the feature map to bring in the spatial feature information, and then subsequently use the obtained results as weights to perform feature aggregation with the original features so that the model learns coarse but information-rich features. Based on this, we further reuse the above modules and use tighter parameter constraints to give the model a greater possibility to refine the features. During this training period, the learned features become more and more refined, and object features with large-scale differences become easier to distinguish. The strategy we employ allows the model to maintain a superior feature update direction, allowing objects with similar features but belonging to different classes to be well regressed, as the resulting bias is eliminated in the next layer of the more stringent refinement module. We name this object feature processing module, which is capable of applying prior information and is category-sensitive, the Multi-Class Perceptual Prior-Enhanced Network (MRPE).
The MRPE block adaptively recalibrates the channel feature responses, allowing the model to focus more on the relationship between local and global features, and recalibration refers to adjusting the feature map responses at different resolutions, allowing the model to prioritize important regions of the input image. Enhancing the CNN representation by modeling the interdependencies between channels, the MRPE block uses multiple activation and normalization operations to allow the model to learn similar features in the laminated panel detection scenario and improve generalization. We also invoke the branch of generating the scene graph to explore the impact of the module’s prior information learning processing logic on model accuracy. To be more specific, the recalibration process in machine learning models involves adjusting parameters, hyperparameters, and outputs to improve accuracy, particularly in tasks like object detection. It includes fine-tuning predicted bounding boxes, class probabilities, and confidence scores. Modeling refers to developing and training the model, selecting appropriate architectures, and validating performance. Normalization ensures input and intermediate data are scaled to a consistent range, enhancing training stability and convergence. The detection scenario defines the real-world conditions, such as lighting, object types, and environmental challenges, influencing model design. Lastly, prior information learning allows the model to leverage previously learned knowledge, enhancing its generalization and accuracy, especially when data are limited or conditions are complex. Together, these processes ensure the model is more accurate, stable, and adaptable to diverse real-world situations. We implement our MRPE by merging the attention branch and attaching the scene graph branch.
In order to make this obscure and difficult process easier to understand, it is obvious that the MRPE block is designed to improve the model’s ability to focus on relevant features by adaptively recalibrating the channel feature responses. This means it adjusts how the network weighs and processes different channels (features) in the CNN (Convolutional Neural Network). The MRPE block enables the model to focus on both local and global feature relationships, which is essential for tasks that require detailed, context-aware analysis, such as laminated panel detection.
By modeling the interdependencies between channels, the MRPE block enhances the CNN’s ability to represent more complex features. It achieves this by utilizing multiple activation and normalization operations, which help the model learn similar features more effectively, improving its generalization ability. This is crucial in real-world detection scenarios where variations across data are common, allowing the model to better adapt to new, unseen data.
Furthermore, the MRPE block includes a branch dedicated to generating a scene graph, which represents the relationships and context between objects in a scene. By exploring the impact of prior information—knowledge learned from previous tasks or data—the scene graph branch enhances the model’s decision-making and improves accuracy. The final implementation of the MRPE merges the attention mechanism (which focuses the model’s attention on relevant features) with the scene graph branch, allowing the model to learn not only about individual features but also about how these features relate to one another in a broader context, thus improving overall model performance. Specifically, we design three restriction rules as the prior information, as shown in
Figure 3, which we statistically and analytically, derived based on the two pieces of prior information that the laminated panels are mass-produced according to a uniform mold and that there is the same spacing between the outer reinforcement bars and the intersection points of the laminated panels, and most of the categories always exist in the interior of the bounding box; the categories of the neighboring outer reinforcement and the line always have the same spacing between them. To this end, an accurate structure, confronting specific tasks, must be designed for
. Therefore, a simple and delightful method is introduced. To be more specific, the prior information is digitized and normalized into the numbers of the matrix. During the update phase, prior information is regarded as edges within the scene graph, while objects are seen as nodes.
We first define a basic block, which consists of two parts:
where
represents the feature map generated by the main branch of MRPE and
H is the output of the auxiliary branch. Then,
is attached through a simple dot product of two feature maps.
can be formulated as:
where
X represents the original features, and
represents the fully connected layer. A fully connected layer (FC) is a layer where each input is connected to every output. It computes a weighted sum of inputs, followed by an activation function, and is typically used in the final stages of a neural network to make decisions or classifications based on the learned features.
represents the excitation function we designed in this module, which implements the learning of nonlinear relations of the network by improving the activation function of SiLU [
42]. The result of the main branch,
, is generated through these steps. Another process is carried out in the meanwhile, with the following formulation:
where
X is the initial feature map, and
is the prior enhancement function, which is achieved with the scene graph, and
A is the adjusted matrix in the GNN [
43]. The inner function of the GNN module can be formulated as the following equations:
Equations (
5)–(
8) represent the matrix generation, weight generation, message calculation, and feature update modules within the GNN network, respectively. In Equation (
5), the adjusted adjacency matrix
A is defined, which encodes the relationships between nodes in the scene graph. This matrix
A contains elements
that represent the similarity or dependency between nodes
u and
v, and it is used for information propagation in the subsequent GNN steps. In Equation (
6), the attention score
between nodes
u and
v is computed. First, the feature vectors
and
of the two nodes are combined using a weighted operation with a weight matrix
W. This combined result is then passed through a non-linear activation function, such as SiLU, to obtain a transformed feature representation. The softmax function is applied to normalize the attention scores for all of node
u’s neighbors, indicating how much importance should be given to each neighboring node in the information propagation. Next, Equation (
7) calculates the message
for node
v by aggregating information from its neighbors. The attention scores
calculated in the previous step are used to weight the feature vectors of neighboring nodes
u. These weighted feature vectors are summed to produce the message
that node
v will receive from its neighbors, thus aggregating important information for the update process. Finally, in Equation (
8), the updated feature vector
for node
v is computed by combining its original feature
with the aggregated message
This combined feature vector is then passed through a linear transformation using a weight matrix
and a bias term
, followed by an activation function like ReLU. This results in the updated feature vector, which captures both the local features of the node and the global context provided by its neighbors, much like a forward propagation step in traditional neural networks.
Together, these steps—matrix generation, attention-based weight calculation, message aggregation, and feature updating—enable the GNN to effectively propagate information across nodes and update their features based on both local and global context, which enhances the model’s performance in tasks such as detection and classification.
3.3. Resolution-Extended Network
The MRPE aims to enhance feature updates by leveraging both channel attention and spatial information from scene graphs to reduce false detections across different objects. This method focuses the model’s attention on specific regions of the image, which helps in minimizing incorrect detections. However, this approach has some drawbacks. Furthermore, objects that are small or have unusual shapes might not be well represented. Because the image is often resized to a fixed dimension to manage computational costs, some features can become elongated or distorted. This distortion particularly affects objects with small scales or extreme aspect ratios. In the context of laminated panel scenes, categories like rebar and outer rebar face significant challenges because their features can become too compressed to be accurately detected.
To address this issue, we propose the resolution-extended network (REN), which integrates high-resolution feature information at various scales into the original feature map to produce a more accurate image representation. The REN leverages features that have been fused by the MRPE and ELAN [
44] modules as input. It replaces the upsampling module in YOLOv9 to enhance upsampling performance.
The MRPE module captures instance-specific features in targeted regions, while the ELAN module preserves and integrates feature processing across different stages, enabling seamless collaboration among modules. This approach ensures that architectures designed for various detection tasks contribute rich feature information from diverse perspectives. Although some information loss is unavoidable, it is mitigated by the high-resolution upsampling provided by the REN.
Figure 4 illustrates the REN’s complete pipeline, showcasing its operation across three different resolution scales. In
Figure 4, the reducer downsamples feature maps using a kernel (e.g., 2 × 2 or 3 × 3) with a stride of 2, reducing spatial dimensions (e.g., from 640 × 640 to 320 × 320 pixels) while retaining essential information. The resolution adapter adjusts image or feature map resolution using interpolation (e.g., bilinear or nearest-neighbor) to match the model’s input size, such as resizing from 2688 × 1792 to 640 × 640 pixels. The shuffler enhances feature learning by permuting channels to improve feature diversity and generalization, especially in deeper network layers, helping the model capture complex patterns.
To reduce computational complexity, the original feature map is first scaled by the encoder to the size of
, where
represents the reduction factor, and in our experiment, it is fixed at 0.5. Instead of directly reshuffling all pixels into a super-resolution tensor, which would disregard the alignment between the initial and input image sizes, we utilize a resolution adaptor. This adaptor generates three intermediate feature maps by expanding the feature map in both horizontal and vertical directions: one with a size of
(horizontal expansion), another with
(vertical expansion), and a third with
(simultaneous horizontal and vertical expansion). In the detection scenario of composite panels, the objects to be detected, such as rebar and external rebar, have orientations in either the horizontal or vertical direction. Performing feature stretching in both directions allows the model to learn the key features of these categories more easily. Therefore, the adaptor can provide additional information in the horizontal or vertical directions to the detector before the low-resolution features are transformed into high-resolution features. Then, these intermediate features are then processed through a pixel muxer, which combines them at different levels of resolution to form a final feature map of size
. This approach enables the model to capture multi-scale information effectively and ensures that the alignment between the initial and final image sizes is preserved. In the final step, the upsampled feature map is pixel-wise accumulated with the original upsampled map, producing the final output. This pixel-level accumulation not only improves the spatial accuracy of the feature map but also maintains computational efficiency. To be more specific, in
Figure 4b,
represents the coefficient in the encoder, which is used to reduce the number of original channels in the encoder, thus achieving the effect of reducing computational complexity.
represents the upsampling coefficient, indicating the factor by which the final upsampled image is magnified.
The conventional upsampling module is implemented by multiplying the width and height by a simple multiplier, which can be formulated as:
Given the original feature
X, the width and height of the image are multiplied by a fixed number of times to achieve upsampling while maintaining the same number of channels. However, a large amount of feature information is not fully utilized in this process. For this reason, we design a resolution-extended module as shown in
Figure 4b: the
can be formulated as a result obtained by the pixel-by-pixel dot product of two super-resolution feature maps.
where
and
represent the feature maps obtained in different ways. Specifically,
is the same as the upsampling method used by most detectors, in which each element is given the weight values of the neighboring elements about itself, and then simple number multiplication is performed based on these weight values to achieve upsampling.
We have designed a whole process for the generation of feature
to allow it to fully read the image information while maintaining high performance. Broadly speaking, after the original feature
X is input into the encoder, three tensor transformations with different specifications are performed, two of the results obtained from the tensor transformations are selected for pixel extension and scale renormalization, and the last two tensors are pixel dot-produced to generate the feature
, which is used in the subsequent processing. It can be formulated as Equation (
10).
In Equation (
12),
represents the operation of pixel shuffle in the field of image super-resolution, which breaks up the feature image pixels of
and tiles them regularly in a one-dimensional tensor to enrich the spatial domain information of the image.
recombines the results of
into a specific size to meet the upsampling of the algorithm module’s requirements, and
and
represent two different tensor transformation results, which are generated as in Equation (
13).
Finally, H, W, and C represent the height, width, and number of channels of the feature image, respectively, and stands for the mode of the primary tensor transformation; the mode of the transformation can be interpreted as either multiplying the length by 2, multiplying the width by 2, or multiplying both the length and the width by 2. Thus, there are three cases of the generation of , and we will ablate these three modes in the subsequent experiments. In addition, denotes the process of dimensionality reduction of the original feature, i.e., reducing the number of channels of the feature input into the REN to reduce the amount of computation. When is generated, features and are processed by pixel-by-pixel dot product to obtain a new feature , which replaces the traditionally upsampled feature and is provided to the model for subsequent processing.