1. Introduction
Fabric defect detection is a crucial step in the quality control process in textile production, which aims to identify and locate defects occurring in fabrics [
1,
2]. Through fabric defect detection, fabric defects can be detected and repaired early, effectively improving the overall quality of the fabric, reducing resource waste and reducing labor costs [
3,
4]. According to the survey, fabric defects will cause major losses to companies. If a piece of fabric has defects, its value will decrease by approximately 45–65% [
5]. At the same time, fabric defects can lead to downtime, rework and repair in the production process, thereby increasing production time and reducing production efficiency. And the company may have to invest more human, material and financial resources to correct these deficiencies, resulting in higher production costs. In addition, hidden factors such as reduced product competitiveness and increased production risks due to fabric defects also negatively impact companies’ profitability. Hence, it is of utmost urgency and significance to put forward a method for detecting fabric defects automatically that is both highly efficient and precise [
6].
The two main categories of previous research on fabric defect detection are computer vision methods and traditional research methods. Visual detection [
7], optical detection [
8], and manual detection [
9] are the three main categories of traditional research methods. For example, early researchers were able to identify fabric defects using image processing and analysis techniques [
10]. Additionally, researchers can use optical instruments to detect defects in textiles, such as by using infrared, ultraviolet and other specific wavelengths of light to detect fabric defects. Manual detection involves examining textiles through visual observation and tactile assessment. However, these traditional methods have several limitations. Fabric image processing techniques are often designed and optimized for specific defect types or image conditions, which can result in poor performance under different defect types or changing conditions. Optical detection methods impose strict requirements on the light source, temperature, humidity and vibration, which increases the complexity and cost of detection. In manual inspection, inexperienced inspectors face problems such as low detection speed (only 12 m per minute) and low accuracy. Furthermore, inspector fatigue can lead to reduced efficiency and quality [
11,
12], and significantly reduce the overall production efficiency. Therefore, computer vision methods that enable timely and automated detection of fabric defects are widely used [
13].
Research in computer vision has focused on two main directions: traditional machine learning methods on the one hand and deep learning methods on the other. The traditional field of machine learning includes a variety of techniques, including Gray Level Co-Occurrence Matrix (GLCM), Local Binary Pattern (LBP), Support Vector Machine (SVM) and Artificial Neural Network (ANN) [
14,
15,
16]. Li et al. [
17] developed a pattern free fabric defect detection method based on GLCM, which adapts to different textile image features and has lower computational complexity. Ghosh et al. [
18] developed a pattern defect detection system to identify various types of fabric defects using a multi-class SVM algorithm and were able to achieve 98% testing accuracy and exceptional computational efficiency compared to other machine learning methods. Pourkaramdel et al. [
19] introduced a new version of LBP and completed a quadruple mode to extract image features of fabric defects, with a detection rate of 97.66%. Anami et al. [
20] used two different classifiers, SVM and ANN, to detect fabric images with and without defects and achieved an accuracy of 94% and 86.5%, respectively, with SVM performing better. However, traditional machine learning methods have limitations in that they rely on manually designed features and perform poorly on new types of defects, along with high computational demands, limiting their generalization abilities. To overcome these problems, the emergence of deep learning methods provides a more robust and automated way to handle fabric defect detection tasks [
21,
22].
There are two kinds of deep learning methods: two-stage deep learning methods and single-stage deep learning methods. Two-stage deep learning methods mainly include the “Regional Proposal” and “Defect Recognition” stages. The common two-stage deep learning methods are mainly R-CNN series models, such as traditional R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN. Ren et al. [
23] generated high-quality regional proposals through end-to-end training combined with Fast R-CNN for detection and achieved a frame rate of 5 fps on GPU with good results. Revathy et al. [
24] proposed Improved Mask R-CNN (IM-RCNN) for fabric defect detection and achieved a high precision of 0.978, which is higher than MobileNet-2, U-Net, LeNet-5, and DenseNet, respectively, by 6.45%, 1.66%, 4.70% and 3.86%. Zhou et al. [
25] optimized Faster R-CNN by integrating the Feature Pyramid Network (FPN), Deformable Convolution (DC) network and Distance IoU loss function and achieved 62.07% 
mAP and 97.37% AP50 in the DAGM 2007 record. However, two-stage deep learning methods rely on prior knowledge and require the classifier to be retrained for new types of errors or unbalanced samples, which increases the workload and complexity. In addition, these methods require multiple forward and backward propagations, which leads to high computational requirements and limits real-time performance and application scope. Therefore, flexible and fast single-stage deep learning methods have become an important approach in detecting fabric defects.
Single-stage object detection methods include the You Only Look Once (YOLO) series, Single Shot MultiBox Detector (SSD) and RetinaNet. In Cheng et al. [
26], instead of the original convolution layers in YOLOv3, depth-separable convolutions and residual blocks were used, improving the spatial pyramid pooling module for better detection results while reducing the parameters and processing load. However, YOLOv3 has some shortcomings, namely insufficient small target detection, slower detection speed, and high computational load in fabric defect detection, which requires further improvement and optimization. Jing et al. [
27] combined YOLOv3 with the k-means algorithm for dimension clustering of target frames, which improved its ability to detect small targets and made its application to plain weave fabrics more effective. At the same time, YOLOv5 further improved on YOLOv3, providing higher detection accuracy, faster detection speed and wider application range. Hu et al. [
28] evaluated the detection performance of the generated fabric defect and non-defect images and showed that the average precision of the recursive convolutional neural network model of YOLOv5 improved by 6.13% and 14.57%. However, for fabric defect detection, the direct application of the original YOLOv5 model does not work well due to the numerous defect types, sparse distribution and unbalanced samples, and the defects are usually small and difficult to detect. In view of the above deficiencies, Li et al. [
29] introduced FD-YOLOv5, an improved YOLOv5 fabric defect detection model, with an enhanced detection capacity for smaller defects. The accuracy of the improved model was increased by 8.3% and 3.2%, and the parameters, calculation amount and weight were reduced by 8.4%, 11.2% and 14.3%, respectively.
In addition, compared to YOLOv3 and YOLOv5, YOLOv7 has a more optimized network structure and parameter design, which enhances the effectiveness and speed of fabric defect detection even more. However, YOLOv7 still has limitations in detection performance and feature extraction. To solve these problems, Kang et al. [
30] developed the AYOLOv7-tiny network, in which the introduction of an empty depth layer followed by a stepless volume improved feature extraction ability, simplified model complexity, and reduced computational effort. Moreover, the appearance of YOLOv8 further improved the detection performance. Talaat and Zain Eldin [
31] found that the object detection method based on the YOLOv8 model had significantly improved accuracy and speed, and the accuracy rate in all categories was up to 97.1%.
Previous studies have contributed to the refinement of the YOLOv8 model. However, they did not sufficiently consider the variability of fabric defect types and overlooked the training inefficiencies caused by increased network depth and parameter number. Therefore, this paper proposes an improved YOLOv8n model named YOLO-BGS. First, by integrating BiFPN, YOLOv8n successfully achieves cross-scale feature integration, significantly improving the detection accuracy while maintaining model speed. Second, the integration of the SA mechanism combines spatial and channel attention, strengthening the feature extraction capabilities. Finally, the introduction of GAM improves the global interactivity and detection accuracy of the model, making it more appropriate for target detection on multiple scales.
Here are the remaining four parts of this paper. 
Section 2 elaborates on the improvement method of YOLOv8n. 
Section 3 introduces the data set, experimental environment, parameter setting, evaluation index and experimental results. 
Section 4 analyzes the experimental results. The entire text is summarized in 
Section 5, along with a plan for further research in this area.
  2. Theory and Methods
  2.1. YOLOv8n
YOLOv8, the most recent iteration of the object detection algorithms in the YOLO series, significantly improves detection performance and has proven its effectiveness as a fast single-stage detection method. This algorithm includes four network architectures: YOLOv8n, YOLOv8m, YOLOv8l and YOLOv8x, covering various application scenarios. Given the stringent hardware performance and real-time requirements for fabric defect detection tasks, the lightweight YOLOv8n model was selected and optimized in this study. As seen in 
Figure 1, the four primary components of the YOLOv8n model are the input layer, backbone layer, neck layer, and output layer.
The main task of the input layer is to accept the input data and pass it to the next layer. The input layer contains input nodes, each corresponding to a feature of the input data. In the image classification task, the input layer can convert the pixel values of the image into a tensor form that the neural network can process.
The backbone layer is responsible for feature extraction of the processed data. Conv, C2f and SPPF modules are commonly used structures. The Conv module expands the receptive field of the network through multiple convolution operations and performs nonlinear transformations through activation functions (such as ReLU) to extract high-level features. The C2f module is a special connection structure introduced by YOLOv4. While preserving low-level features, it increases the expressiveness of high-level features and improves the accuracy and speed of the network. The SPPF module is a spatial pyramid pool structure used to capture feature information of different scales. All three play an important role in YOLOv8n and together contribute to the performance improvement of the network.
The main function of the neck layer is to fuse the feature information. FPN and PAN are both network structures used to handle multiscale features. By adopting side-joining and path aggregation strategies, they can effectively merge different levels of feature information and improve the performance of object detection and semantic segmentation tasks.
The output layer is used to generate the final result of target detection, which usually includes three sub-layers: anchor layer, prediction layer and detection layer. The anchor layer is used to generate the default box, the prediction layer is used to predict the target category and location, and the detection layer is used to generate the final target detection result.
  2.2. Bi-Directional Feature Pyramid Network (BiFPN)
During the detection process, fabric defects pose significant challenges due to their varying scales and sizes as well as the fact that they can easily be affected by interference from neighboring areas. To solve these problems, YOLOv8n combines the Path Aggregation Network (PAN) and Feature Pyramid Network (FPN) features, improving the capability to locate and detect defects of different sizes. This approach not only effectively integrates multiscale feature information, but also successfully models contextual information at different levels within the image. However, the introduction of FPN and PAN also leads to an increase in network calculation complexity as well as the complexity of hyperparameter tuning during model design and model optimization. Therefore, this study used the Bi-Directional Feature Pyramid Network (BiFPN), as shown in 
Figure 2. BiFPN achieves an excellent balance between accuracy and efficiency through weighted feature fusion and bidirectional cross-scale connections [
32]. In addition, this network can be flexibly combined with various mainstream backbone networks (such as ResNet and EfficientNet), making it a powerful feature fusion network architecture for target detection tasks.
Compared to the PANet architecture, BiFPN’s basic modules have more comprehensive and flexible design features. To address the limitations in the effectiveness of feature fusion, BiFPN first optimized nodes with only single input or output paths during the design phase. This decision simplified the network structure and optimized the overall architecture of the bidirectional network, thereby increasing its efficiency. Next, a skip connection was added between the original input and output nodes. This design strategy greatly facilitated the fusion of cross-layer features and greatly improved the detection accuracy. Finally, through multiple iterations, BiFPN integrated features from different paths into a single feature layer, achieving the fusion of more complex features. 
Figure 3 shows the process.
BiFPN, with its unique bidirectional channel design, achieves cross-scale connections by integrating features from the feature extraction network with those from the bottom-up path. This design preserves surface semantic information while preventing excessive loss of deep semantic information. Unlike traditional feature fusion methods, BiFPN does not apply uniform weights to features of different scales. Uniform weighting can result in unbalanced output feature images when input features have different resolutions. Therefore, BiFPN assigns different weights depending on the importance of the input features and strengthens the feature fusion by repeatedly applying this structure.
During the weighted fusion process, BiFPN uses Fast Normalized Fusion, a technique that controls weights in the range of 0 to 1, thereby increasing training speed. The implementation of cross-scale connections is based on skip connections and bidirectional paths, which together achieve an optimal combination of weighted fusion and bidirectional cross-scale connections. Equation (1) illustrates this particular computation procedure:
        where 
 represents the output feature, 
 represents the input feature, and 
 represents the weight of each node in the network. Notably, the learning rate 
 is set to 0.0001 to ensure the stability of the training process and avoid unnecessary fluctuations in the results.
  2.3. Global Attention Mechanism (GAM)
Within the domain of fabric defect detection, many variables, including the type of fabric, the production process, and changes in the state of the machine, can affect the type and severity of defects. These factors are complex and closely related. The Squeeze and Excitation (SE) attention mechanism improves the model’s focus on critical channels by thoroughly exploring the internal connections of the channel attention modules, thereby strengthening the feature representation capabilities. However, it only considers the feature relationship in the channel dimension and may not be able to refine the information in the spatial dimension. To address this limitation, this paper introduces the Global Attention Mechanism (GAM), which significantly improves the perception of global features of the network by effectively capturing global context information. The Channel Attention Mechanism (CAM) and the Spatial Attention Mechanism (SAM) are the two main modules that make up GAM. A two-layer Multi-Layer Perceptron (MLP) is used by the channel attention submodule to reinforce dimensional channel-spatial dependencies and uses a three-dimensional arrangement strategy to ensure information integrity across three dimensions. 
Figure 4 depicts the complete processing flow, T, and the specific calculation process is shown in Equations (2) and (3):
The initial state of a given input feature map is denoted by  ∈ RC×H×W. After processing, the intermediate state is  and the result is . Channel and spatial attention maps are denoted by  and , respectively. The multiplication of elements is indicated by the symbol .
For CAM, the input feature map undergoes a dimensional transformation. Subsequently, the transformed feature map is introduced into the MLP, which returns its dimensions to their initial state and processes them using the sigmoid function to produce output. This process is shown in 
Figure 5.
GAM primarily uses convolutional processing, which bears some resemblance to the SE attention mechanism, for SAM. First, the quantity of channels and the computational load are reduced by convolution with the convolution kernel 7. Then, a convolution operation is performed again with a convolution kernel 7 to increase the quantity of channels and maintain the consistency of the quantity of channels. Finally, it uses the sigmoid function for the output. This process is shown in 
Figure 6.
  2.4. Shuffle Attention (SA)
Deep learning makes extensive use of traditional attention mechanisms to capture potential correlations between various locations in the input sequence, thereby significantly improving task performance. The Convolutional Block Attention Module (CBAM) is one of these mechanisms that helps the network focus more intently on target regions by efficiently extracting pertinent features in both spatial and channel dimensions. However, this mechanism requires high computational resources, which can increase the overall complexity of the network.
Therefore, this study presents a more efficient and lightweight Shuffle Attention (SA) mechanism. To increase computational efficiency, the SA mechanism divides the channel dimension into several sub-features and processes them simultaneously. Shuffle units are used by SA to accurately characterize dependencies in both spatial and channel dimensions while processing each sub-feature. Then, all sub-features are combined, and the channel shuffle operator is applied to enable data sharing and merging among various sub-features. This mechanism achieves better performance without increasing the computational cost.
As shown in 
Figure 7, the SA attention mechanism consists of four core sections:
(1) Feature Grouping: Regarding a given feature map X ∈ RC×W×H, the channel, width, and height of the feature map are denoted, respectively, by C, W, and H. The feature map X is first split up into G groups X = [X1, X2, …, XG], where Xi ∈ R(C/G)×W×H; each sub-feature Xi is, thereafter, further subdivided into two branches Xi1 and Xi2 ∈ R(C/G)/2×W×H along the channel dimension. A channel attention map is created by a branch that focuses on the connections between channels. The other branch builds a spatial attention map by focusing on the spatial relationships between features.
(2) Channel Attention: Traditional methods such as SE modules and ECA networks are not suitable for lightweight attention modules due to their excessive parameters. Therefore, this study optimized it. First, channel statistics were generated and global information was embedded using Global Average Pooling (GAP). Then, features were enhanced through a gated mechanism and sigmoid activation function. Equations (4) and (5) show the specific calculation procedure:
        where 
 denotes the mean pooling feature, 
 represents the global average pooling function, 
 indicates the sigmoid activation function, and 
 and 
 ∈ 
R((C/G)/2)×1×1 are used as scaling and shifting parameters, respectively.
(3) Spatial Attention: As an addition to channel attention, spatial attention can be considered. The first step in obtaining spatial statistics from 
 is to apply Group Normalization (
GN). In this process, it is necessary to evaluate the error level introduced by 
GN. This requires a combination of group partitioning strategies, accuracy of statistical estimates, adaptability to different network structures and tasks, and the ultimate impact on model generalization. This evaluation minimizes the error range and optimizes model performance. A linear fusion function is then used to improve 
. The fusion linear function can dynamically modify weights. It enhances feature integration and minimizes complexity. Simultaneously, this function ensures efficient computation and lowers the demands of model training. Finally, input features are mapped onto query, key and value vectors through linear projection, thereby capturing various aspects of the input. Note that weights are calculated as dot products over positions between queries and key vectors, thereby quantifying the relevance of information. These weights are reused by Softmax to normalize them into probabilities, which are then used to measure value vectors, create an output feature map weighted by all positions, and highlight important areas. Equation (6) illustrates the specific calculation procedure:
        where 
 represents the normalized feature and 
 and 
 ∈ 
R((C/G)/2)×1×1 are two parameters that can be continuously trained through the network.
(4) Aggregation: After completing the two attention learning and recalibration features, the two branches need splicing and aggregation. The resulting matrix, , is obtained by simple contact fusion. The sub-features are then combined and channel shuffle is applied to create an information flow between groups along the channel dimension. The SA’s final output matches the input size, which makes integrating the SA module into other structures straightforward.
  4. Discussion
Fabric defects such as broken yarn, holes, wear and tear, etc. directly affect the appearance and quality of products [
34], leading to a decrease in production efficiency and an increase in costs [
35]. In view of this, this paper presents an improved detection method based on YOLOv8n, which has certain advantages in fabric defect detection.
YOLOv8 is characterized by its flexible and efficient performance in the field of object detection. However, further optimization is required in terms of computational load, small object detection, and trade-off between real-time processing and accuracy. On this basis, this study chose YOLOv8n as a minimum weight model that reduces computation and storage requirements while maintaining a higher level of performance. The improved YOLOv8n incorporates an enhanced BiFPN structure to improve multi-scale feature fusion efficiency, thereby enhancing the detection ability for small objects and complex scenes. At the same time, GAM was added to improve the understanding of the global structure and relationship of the network, improve the ability of target perception at multiple scales, and improve the detection accuracy. In addition, by adding a SA mechanism, the model can process the input data more efficiently and reduce the amount of unnecessary calculations, thereby increasing the efficiency of object detection. The experimental data recorded in 
Table 1, 
Table 2 and 
Table 3 strongly support the excellent performance of the model developed in this study in detecting fabric defects.
In the comparative analysis discussed in this study, the Faster R-CNN model is not suitable for real-time fabric detection due to its lower accuracy and higher computational cost. The SSD model strikes a better balance between speed and precision compared to the original YOLO and Faster R-CNN. However, due to the reliance on fixed anchor frame dimensions and proportions to predict the target bounding box, it does not adapt well to defects of different sizes and shapes in the fabric. The YOLOv3-tiny model is easier to deploy on embedded or mobile devices due to its smaller size and faster inference speed, significantly reducing computational costs. However, due to its lightweight nature, there may be some challenges when dealing with small targets and complex scenes. YOLOv5 has a more optimized network structure compared to YOLOv3 and uses multi-scale feature maps for object detection, which can better handle complex scenes and various targets. YOLOv7 integrates advanced optimization strategies such as the Mish activation function and Cross-Stage Partial Network (CSP) structure to enhance the model accuracy and generalization ability. Nevertheless, balancing low computational cost with efficient and accurate object detection at multiple levels remains a challenge. The YOLO-BGS proposed in this paper has advantages in lightweight design, real-time processing, accuracy, flexibility and scalability, and integrates new backbone architectures and loss functions. These strengths enable YOLO-BGS to deliver excellent performance in various fabric defect detection scenarios.
However, this study has certain limitations. Although the integration of BiFPN significantly improves model performance in object detection tasks, BiFPN includes several hyperparameters, such as weights and fusion modes, between different scale features. The values of these hyperparameters have a large impact on model performance. Inappropriate hyperparameter settings can lead to reduced performance or instability. When introducing the GAM mechanism, care must be taken to ensure that the model focuses on global structures without neglecting local details. This requires careful consideration and adjustments in model design, requiring extensive experimentation and debugging work. In addition, although the SA mechanism effectively captures more sequence information and improves model capability, it primarily focuses on the relative positional relationships of elements within sequences, which may poorly model long-term dependencies. There are various disruptive factors when detecting fabric defects. For example, uneven lighting, material properties of the fabric, and physical, chemical and environmental factors (such as temperature, humidity and gas composition) are likely to cause defects to be falsely detected.
In future work, a diverse dataset comprising various types of fabric images under complex conditions will be collected. This dataset will facilitate the study of the correlation between fabric defect characteristics and their surrounding environment, thereby enhancing the model’s capabilities in defect detection and interference resistance. Additionally, the model will be selected and adjusted based on actual application scenarios to maximize its advantages and address its limitations. Improvements will focus on model optimization, lightweight design, real-time performance, generalization ability, and integration with automated equipment. These efforts aim to improve its fabric defect detection performance, bringing more innovation and value to the fabric manufacturing industry.