1. Introduction
In the context of global climate change and environmental protection, fossil fuel power generation faces significant challenges related to resource depletion and environmental degradation. Particularly under the push for carbon neutrality and peak carbon goals, there is a growing global demand for the development and utilization of renewable energy. Biomass fuels, as an important renewable resource (including wood, crop residues, and energy crops), have gained widespread attention due to their renewability and carbon neutrality characteristics [
1,
2]. Among these, straw is considered a key renewable energy source due to its abundant supply and low cost. As an agricultural powerhouse, China possesses rich crop straw resources. The rational development and utilization of these straw resources not only help reduce fossil fuel consumption and alleviate energy shortages, but also effectively lower environmental pollution and greenhouse gas emissions, thereby contributing to the achievement of peak carbon and carbon neutrality goals. Combined heat and power generation (CHPG) technology serves as an efficient utilization method for straw resources, significantly improving economic benefits while bringing positive ecological and social effects [
3,
4].
In the combined heat and power generation (CHPG) process, the heat used for power generation is obtained by burning different types of straw fuels, which are fed into the boiler, such as the circulating fluidized bed boiler (CFBB) shown in
Figure 1. The straw fuel is mixed with desulfurizing agents and introduced from the bottom of the boiler. At this point, there are many fluidized combustion materials in the boiler, which facilitates rapid fuel combustion. Subsequently, primary and secondary air are introduced into the boiler from the bottom and the sidewalls, respectively. An upward airflow is formed in the boiler, causing the materials to move toward the upper part of the boiler. Most of the fuel continues to burn in the dense phase region, while a smaller portion is carried into the dilute phase region along with the flue gas. In the dilute phase region, the fuel burns in a suspended state, releasing heat to the heating surfaces and water walls inside the boiler, raising their temperature. Under the influence of gravity and external forces, the movement of fuel in the dilute phase changes; as the velocity slows down, the direction of material movement gradually shifts away from the main airflow, forming a particle flow that adheres to the wall and is carried out of the airflow into a separator. These materials are then collected and sent back to the boiler for repeated combustion, ultimately achieving complete combustion through multiple cycles. Some particles enter the flue gas duct, releasing heat to the surfaces at the rear of the boiler. After cooling, the flue gas is treated to meet environmental protection requirements, such as dust removal and desulfurization, before being discharged into the atmosphere, completing the entire combustion process.
However, the heating value of straw fuel is influenced by various factors such as type, moisture content, and combustion state, leading to fluctuations in the heating value and causing uncertainty in the heat and power generated per unit time [
5]. This not only affects the stable operation of the CHPG system, but also directly impacts power generation efficiency and economic benefits. Additionally, since straw is often stacked in multiple layers in practical applications, it is difficult to estimate its heating value directly through traditional elemental composition analysis methods. Moreover, the moisture content of straw varies significantly, and the evaporation of moisture consumes some of the heat, further complicating the estimation of fuel heating value. In advanced control strategies for the CHPG process, real-time measurement of the fuel’s moisture content and heating value is crucial for ensuring smooth system operations under long delays and large time constants [
6,
7]. Therefore, proposing effective solutions for real-time estimation of the heating value in multi-fuel CHPG processes based on straw can not only improve the utilization efficiency of straw resources and promote the development of renewable energy, but also significantly reduce carbon emissions, contributing to the fight against global climate change. Additionally, accurate heating value estimation can enhance the economic benefits of the CHPG system, ensuring the stability and sustainability of energy supply. Furthermore, with the advancement of carbon neutrality goals, the adoption of renewable energy has become a pressing priority, making real-time heating value estimation an important support for sustainable development goals.
Traditional methods for estimating the heating value of multi-fuel straw mainly rely on chemical analysis [
8]. These methods typically require manual classification of the straw, collection of water vapor produced during the drying process to determine moisture content, and the use of chemical reagents to analyze the composition of the dried straw. Subsequently, a heating value model is established for the current batch of straw to calculate the heating values of various fuels. Although this offline estimation method is relatively accurate, it cannot respond promptly to fluctuations in heating value within the CHPG system and requires significant human effort, is time-consuming, and incurs high costs. Estiati et al. [
9] utilized artificial neural networks to obtain approximate analytical data for estimating heating values based on different types of biomass residues.
With the developments of optical sensing technology, methods for estimating heating values based on digital image processing have received increasing attention. By performing precise semantic segmentation on fuel images to obtain the proportions of different types of biomass fuels and combining this with relevant parameters provided by sensors, there is a high demand for accuracy in image semantic segmentation to accurately estimate the fuel’s heating value. Traditional image segmentation methods mainly rely on gray value thresholds and similarity, such as threshold segmentation [
10,
11,
12], edge detection [
13,
14], and region-based methods [
15,
16]. However, due to the complexity of object structures and textures, as well as the influence of environmental lighting and image noise, these methods require parameter adjustments for different images, limiting their applicability. The diversity of straw, variations in lighting, similarities in color and texture, and the complex characteristics of stacking and fragmentation, pose challenges for traditional image segmentation methods in accurately distinguishing between different types of straw.
In recent years, deep learning has made significant advancements in image processing. Convolutional Neural Networks (CNNs), in particular, have been widely applied in semantic segmentation of images [
17,
18,
19]. Semantic segmentation involves segmenting different objects at the pixel level from an image. CNN-based semantic segmentation architectures use convolution operations to capture image features. For instance, Evan et al. [
20] proposed the Fully Convolutional Network (FCN), which utilizes CNNs as one of its modules to generate hierarchical features, replacing fully connected layers with convolutional layers to restore the feature maps to the same size as the input. Thus, FCNs can be applied to inputs of arbitrary sizes, generating predictions for each pixel and retaining the spatial information of the original input for pixel-wise classification. Yao et al. [
21] introduced the Adaptive Deep Convolutional Neural Network (ADCNN) for object detection and semantic segmentation tasks in specific scenarios. This method first employs transfer learning to select effective convolution kernels from a general CNN classifier, and then effectively learns local and global contextual information in monitored scenes through a specific architecture, thereby improving the accuracy of object location predictions. Zheng et al. [
22] converted Conditional Random Fields (CRFs) [
23] into Recurrent Neural Networks (RNNs) [
24] and connected them to the last convolution layer of FCN. This method addresses the issues of large receptive fields and weak edge constraints in FCN, refining the results to achieve more specific and smooth segmentation. Badrinarayanan et al. [
25] proposed the SegNet model, which features an encoder–decoder structure similar to FCN. Unlike FCN, SegNet uses max pooling for upsampling, thereby reducing the number of training parameters and memory usage. Ronneberger et al. [
26] built upon FCN to propose the U-Net model, extending the idea of fusing high-level and low-level features in FCN and adding skip connections to concatenate feature maps from the encoder with those from the decoder, preserving more detailed information and enhancing classification accuracy.
Transformers [
27], due to their excellent global context modeling capabilities, have seen widespread development and application in the field of image processing in recent years, and real-time semantic segmentation methods based on Transformers have also begun to attract attention. Zheng et al. [
28] proposed SETR, which treats semantic segmentation as a sequence-to-sequence prediction task, using pure Transformers as the encoder to encode images and employing a simple decoder. This approach avoids the convolution operations and resolution reduction found in traditional fully convolutional networks, significantly improving segmentation performance by leveraging the global context modeling capabilities of Transformers. Wang et al. [
29] introduced the Pyramid Vision Transformer (PVT), which incorporates Feature Pyramid Networks (FPN) [
30] into Transformers to capture features at different scales, enhancing the model’s ability to process multi-scale information.
Based on the above discussion, this paper proposes a real-time system for estimating the calorific value of mixed straw fuels based on improved U-Net semantic segmentation model. The study utilizes Python (version 3.8.0) as the primary programming language, and employs the PyTorch framework to construct and train the improved U-Net model. Compared to existing semantic segmentation methods, the proposed improved U-Net network provides key contributions in the following areas:
- (1)
It introduces a self-attention mechanism in the skip connections to enhance the extraction of key information from deep features;
- (2)
It replaces traditional convolutions with depthwise separable convolutions to reduce the model’s computational complexity and improve inference speed;
- (3)
It substitutes the bottleneck layer with a Transformer encoder to leverage the global modeling capabilities of Transformers, allowing the model to better understand contextual information within the image.
The structure of this paper is organized as follows: First,
Section 1 introduces the background and related work, focusing on real-time heating value estimation methods and existing image segmentation techniques.
Section 2 will provide a detailed description of the design and implementation of the real-time estimation system for mixed straw fuel heating values based on the improved U-Net semantic segmentation model, as well as the structure of the improved U-Net model.
Section 3 will present the experimental setup and analysis of the results, including comparisons with existing methods, and will include ablation studies to validate the impact of each component. Finally,
Section 4 will summarize the research findings and discuss future research directions.
2. Straw Calorific Value Estimation Method
2.1. Real-Time Heating Value Estimation System
In the combined heat and power generation (CHPG) process, the heterogeneity of the multi-fuel straw fed into the Circulating Fluidized Bed Boiler (CFBB) presents significant challenges for real-time estimation of the calorific value of the current straw batch. To effectively address this issue, a configuration system has been designed, as shown in
Figure 2, aimed at real-time estimation of the calorific value of multiple fuels. This system comprises several key components, including an industrial camera, image analysis host, server, industrial network bus, moisture detector, and quality sensor. The industrial camera is responsible for capturing high-resolution images of the multi-fuel straw for subsequent processing, providing essential data for later analysis.
During the image processing phase, an improved U-Net semantic segmentation algorithm is applied to precisely segment the captured straw images. This algorithm effectively identifies and calculates the proportions of different types of straw, thereby providing detailed information about the raw material composition. In this way, the distribution of various straws in the current batch can be accurately reflected, laying the groundwork for calorific value estimation. Simultaneously, the system integrates a moisture detector and quality sensor, which can measure the moisture content and overall quality of the multi-fuel in real time. These parameters are crucial for calorific value calculation, as the moisture content of the straw directly affects its combustion characteristics and calorific value. By combining these real-time measurement data with the results of image analysis, the system can more comprehensively determine the elemental composition of the straw.
Finally, the server estimates the calorific value based on the measured elemental composition and its relationship with the calorific values of each dry element. This process not only enhances the accuracy and timeliness of calorific value estimation, but also provides robust support for the optimized operation of the CHPG system. Through real-time monitoring and analysis, the system can dynamically adjust fuel usage strategies, achieving a more efficient and environmentally friendly power generation process.
2.2. Improved U-Net Semantic Segmentation Network
We propose an improved U-Net semantic segmentation model, as illustrated in
Figure 3. This model consists of encoder layers, a bottleneck layer, and decoder layers, To improve the real-time performance of the model without compromising the accuracy of semantic segmentation, we set the model’s input size to 3 × 256 × 256. This size retains sufficient spatial information to ensure the model can extract key features. At the same time, the smaller input size significantly reduces computational load and memory usage, allowing the model to run efficiently while maintaining strong performance (detailed parameter information is listed in
Table 1). To reduce the number of parameters without significantly sacrificing performance, we replace traditional Convolution by Depthwise Separable Convolution in both the encoder and decoder. This method, which is derived from MobileNet [
31], decomposes traditional convolutions into depthwise and pointwise convolutions, significantly lowering computational complexity while maintaining effective feature extraction capabilities.
Additionally, to enhance the model’s ability to capture critical features, we introduce a self-attention mechanism in the skip connections of the final layer of the encoder. This self-attention mechanism enables the model to focus on the relationships between different regions in the image when processing feature maps, thereby more effectively extracting important information, especially in the segmentation of straw images against complex backgrounds.
Furthermore, we replace the U-Net bottleneck with a Transformer encoder to improve the extraction of key features. The Transformer architecture is better suited for modeling global contextual information, allowing the model to more accurately understand the structure and characteristics of highly heterogeneous straw. In this process, we concatenate the output from the previous layer with the output from the Transformer to preserve the original key features, avoiding information loss and enhancing the model’s expressive capability.
To further optimize the model, we removed the max pooling layers and replaced them with convolutions that have a stride of 2. This change not only maintains spatial resolution, but also enhances the model’s ability to capture and retain details during feature extraction, thereby contributing positively to more accurate segmentation results. This design ensures the model’s efficiency and accuracy in complex environments, providing strong support for real-time calorific value estimation.
To verify the impact of various modules on model performance, we designed and conducted ablation experiments. By gradually removing or modifying different components of the model (such as depthwise separable convolutions, skip connections, and the Transformer Encoder-based bottleneck), we analyzed the effects of these changes on the model’s accuracy and computational efficiency in the semantic segmentation task. This allowed us to quantitatively assess the contribution of each module to the model’s performance, further optimizing the model architecture.
2.2.1. Encoder
The encoder structure is shown in the left half of
Figure 3, and uses Depthwise Separable Convolution (DSC) to replace traditional convolution. Compared to traditional convolution, DSC features smaller convolution kernels and lower computational complexity. It significantly reduces the number of parameters and improves computational speed by decomposing the convolution operation into Depthwise Convolution (DW) and Pointwise Convolution (PW). Depthwise Convolution applies a convolution kernel independently to each input channel to extract spatial features. Although this process greatly reduces the number of parameters and computational complexity, it processes information only on a single channel, and cannot interact between channels. Pointwise Convolution uses a 1 × 1 convolution kernel to perform a weighted sum of all channels at each spatial location, enabling the fusion of information between channels. This compensates for the lack of channel interaction in Depthwise Convolution, and enhances feature representation capability.
Additionally, the Softplus activation function is introduced after DSC. Its mathematical expression is . Compared to ReLU, , the Softplus function is smoother and avoids the “dead zone” problem that can occur with ReLU. When combined with DSC, Softplus further enhances the network’s non-linearity, making the model more robust in handling complex patterns and detailed information, while providing more stable gradient flow during training and accelerating model convergence.
We also use convolutions with stride = 2 instead of max pooling operations, which will not only allows the model to perform downsampling while maintaining feature learning capabilities, but also enables the model to retain more contextual features, thereby optimizing the feature extraction ability of the model.
2.2.2. Bottleneck
In the U-Net model, the bottleneck section is typically composed of convolutional layers. While convolutional networks excel at handling local features, they have limitations in capturing long-range dependencies and global context. To enhance the model’s ability to process complex backgrounds or detailed images, we replaced the traditional convolutional bottleneck with a Transformer encoder, as shown in
Figure 4. This encoder consists of three layers of attention encoding modules. The Transformer encoder effectively utilizes global context information to extract key features, allowing it to capture more complex feature relationships and long-range dependencies. Through the self-attention mechanism, the Transformer encoder focuses attention on the most relevant input areas, improving the understanding of complex backgrounds and detailed images.
The computational process is as follows: The computational process is as follows: First, we need to compute the values of the
,
, and
tensors by Equation (
1), and then we could calculate the corresponding attention scores by Equation (
2). To capture different representations of the input, a multi-head attention mechanism is used to obtain various relationships and features within the input sequence, and the final self-attention output tensor can be obtained from Equation (
3).
where
X is input tensor,
,
, and
are learnable weight matrices, and
is the dimension of the
. The similarity between the queries and keys is computed using
. After normalizing the result with the softmax, it is multiplied by the values V to generate the attention output.
is the linear projection matrix used to perform a linear transformation on the concatenated output of the multi-head attention, mapping it back to the original dimension. The output of the self-attention obtained from Equation (
3). We can also obtain the output tensor after residual connection and normalization by using Equation (
4).
The output of the decoder can be obtained from Equation (
5), where the FFN processes the output through a feed-forward neural network, enabling the model to learn the non-linear mapping from input to output in order to further transform the data and extract more complex features. The formula is
, where
and
are the learnable weight matrices, and
and
are the learnable bias vectors.
While leveraging the advantages of Transformers and retaining original features, we introduced a feature concatenation strategy into the model. Specifically, we concatenate the output of the Transformer encoder with the output from the previous layer of the U-Net encoder. This concatenation operation not only preserves the local features extracted by traditional convolutional layers, but also integrates the global contextual information captured by the Transformer, resulting in a richer information representation during the feature fusion phase. This combination allows the model to understand image content at a higher level while avoiding the loss of feature information, thus enhancing overall segmentation performance.
Through this design, we have increased the model’s sensitivity to complex and fine-grained features while maintaining the original features, thereby improving the recognition capability of key regions in the image. This approach not only enhances the model’s expressive power but also demonstrates higher accuracy and robustness in practical applications.
2.2.3. Decoder
The decoder structure is shown in the right half of
Figure 3. To improve efficiency and reduce computational costs, the first four layers of the decoder use the same skip connection approach as U-Net, which maximizes the preservation of multi-scale feature information passed from the encoder. These skip connections effectively merge local and global feature information by transferring high-resolution feature maps to the corresponding layers in the decoder. To further enhance the model’s ability to capture important information, we introduced a self-attention mechanism in the skip connection of the last encoder layer. This mechanism adaptively adjusts features from different spatial positions, ensuring that the model captures more discriminative features across the entire image, thereby improving the final segmentation accuracy. The last layer of the encoder contains the deepest semantic information and plays a crucial role in semantic feature extraction. By applying the self-attention mechanism, the model can more precisely capture these deep semantic features, reducing the risk of information loss during upsampling. This approach is particularly effective for complex scenes and fine details. In contrast, using self-attention on shallow feature maps with higher resolution would consume a large amount of computational resources, and yield less significant improvements. Therefore, we chose to apply the self-attention mechanism to the last encoder layer to maximize the model’s efficiency and effectiveness. By combining skip connections with the self-attention mechanism, the decoder not only effectively integrates features from the encoder, but also enhances the ability to extract and retain key information. This approach maintains a relatively small number of model parameters while increasing feature richness and segmentation accuracy, achieving a balance between performance and efficiency.
2.2.4. Loss Function
The goal of semantic segmentation is to assign a category label to each pixel, making it a pixel-level multi-class task. To measure the difference between the model’s predicted probability distribution and the actual category distribution, we use cross-entropy as the loss function. The cross-entropy loss function if defined as Equation (
6), which could quantifies the discrepancy between the target probability distribution and the predicted probability distribution. By minimizing this loss function, the model can better fit the target distribution, thereby improving the performance of semantic segmentation.
where
represents the true probability distribution,
represents the predicted distribution, and represents the number of classes. Then, the gradients of the loss function with respect to each parameter are calculated through backpropagation, as shown in the following equation:
where
represents the average loss over all samples in the model,
denotes the gradient vector with respect to the model parameter vector
, and m represents the number of samples. We could obtain the gradient of the loss function by using Equation (
7), an optimization algorithm is used to update the model’s parameters to improve recognition accuracy. This process is repeated over all samples until the loss function converges.
2.3. Calorific Value Estimation
The composition of the fuel and the moisture content of each material have a significant impact on the overall calorific value. If the straw fuel contains a large amount of moisture, the water will evaporate first during combustion. The evaporation process requires a substantial amount of energy, which is absorbed from the heat released by the fuel, thus reducing the available heat. Therefore, the moisture content of the straw fuel directly affects its effective calorific value. The calorific value of dry straw can be calculated using the following formula [
32,
33,
34]:
where
is the carbon content in the straw (%),
is the hydrogen content (%),
is the sulfur content (%),
is the nitrogen content (%), and
is the oxygen content (%). These components are closely related to the type of straw, and the compositional proportions of different types of straw are shown in
Table 2 [
35]. In practical applications, straw fuel typically contains a certain amount of moisture and, therefore, the calorific value needs to be adjusted based on the moisture content using the following formula:
Here,
w represents the moisture content of the straw-based fuels (%), and
is the effective calorific value of dry straw, which could be obtained from Equation (
9). It is necessary to determine the proportion of each type of straw in the mixed fuel based on the results of semantic segmentation of straw images before using Equation (
8). Suppose there are
n types of straw fuels fed into the circulating fluidized bed boiler (CFBB). The proportion of each straw type in the mixed fuel can be calculated from the segmentation results of the straw images. The proportion of each type of straw can be calculated using the following formula:
where
n is the number of straw types, and
represents the number of pixels corresponding to the
i-th type of straw. Subsequently, the total mass
G of the mixed straw fuel can be obtained using a mass sensor. Based on the proportion data from Equation (
10), the mass of each type of straw can be determined as
.
Finally, by combining the mass of each fuel and its effective calorific value, the total calorific value of the entire fuel can be calculated as , where represents the total calorific value of the straw fuel, and n is the number of straw types.
Through this method, using the segmentation results of straw images, data from the mass sensor, and the calorific value formulas for each type of straw, the overall calorific value of the mixed straw fuel can be accurately calculated. This method provides a scientific and effective technical approach for real-time estimation of the calorific value of mixed fuels, offering an important basis for optimizing the utilization of straw fuel and improving the efficiency of combined heat and power systems.
3. Experiments and Analysis
The experiments in this paper were conducted on an Ubuntu 20.04 operating system using the PyTorch 2.2.0 framework. The hardware configuration of the experimental platform is as follows: Intel Core™ i7-13790F processor with a clock speed of 3.4 GHz, 32 GB of RAM, NVIDIA RTX 4070 Ti Super GPU with 16 GB of VRAM. The image resolution for training and testing was fixed at 256 × 256, and the optimization algorithm used was AdamW [
36] with a linear learning rate scheduler that includes a warm-up phase. The maximum learning rate was set to 1 × 10
−5, the batch size was set to 50, and the training was conducted for 100 epochs.
3.1. Data Augmentation
The straw images were captured by the MV-GE1400C-T industrial camera (MindVision, Shenzhen, China) under natural light conditions. The camera has a maximum resolution of 4384 × 3288 pixels, a shooting speed of 8 fps, and is equipped with a 6 mm fixed-focus lens. The straw dataset consists of images with a resolution of 4032 × 3024 pixels, each pixel-level annotated with four categories, such as Wheat, Corn, sesame, and background. Since the original dataset is relatively small and the size of each image is too large to direct use, it is necessary to augment the original dataset.
Data augmentation is a common technique that has been proven beneficial for training deep learning models. Proper augmentation can accelerate model convergence, improve robustness, help avoid overfitting, and enhance generalization capabilities. In this study, we expanded the dataset by performing random cropping, adding noise, and applying affine transformations such as rotation, vertical flipping, and horizontal flipping to the straw images. As a result, we obtained a new dataset consisting of 12,000 images with a size of 256 × 256, which was then divided into training, validation, and test sets in an 8:1:1 ratio.
3.2. Evaluation Metrics
We use pixel accuracy (PA) and mean intersection over union (mIoU) to evaluate the segmentation effect of U-Net. In general, PA indicates the proportion of pixels correctly segmented to the image, which is defined by Equation (
11), and mIoU indicates the proportion of pixels correctly segmented in the intersection of labeled and predicted, which is defined by Equation (
12),
where
k is the number of classes,
is the pixels correctly segmented,
is the pixels belonging to class
i, but predicted to class
j, and
is the pixels belonging to class
j, but predicted to class
i.
3.3. Comparison Experiments
To validate the effectiveness of the proposed model, we conducted experiments on images captured from a real power plant and evaluated our model using a real dataset. We adopted two main evaluation metrics: Mean Intersection over Union (mIoU) and Pixel Accuracy (PA). As shown in
Table 3, our proposed model performed excellently in both PA and mIoU. The large version of our model outperformed other models in both metrics, while the small version had a PA slightly lower than U-Net but still surpassed other comparative methods, such as SegNet, DDRNet [
37], and SmaAt-UNet [
38].
Although the large version of the proposed model excelled in PA and mIoU, its parameter count was only 26% of U-Net, and its inference speed was 79% of U-Net’s speed. In contrast, the large version, which did not incorporate Depthwise Separable Convolution (DSC), had a parameter count of 109% of U-Net. Moreover, the accuracy of the proposed model was only 0.21% lower than that of U-Net, and mIoU was also reduced by just 0.2%. Considering both inference accuracy and computational complexity, the segmentation model proposed in this paper is more suitable for real-world applications in estimating the heating value of cogeneration.
3.4. Ablation Studies
To further validate the effectiveness and improvements of the proposed model, we designed ablation experiments and used Pixel Accuracy (PA) and Mean Intersection over Union (mIoU) to assess the impact of different model configurations on overall performance. The experimental results are shown in
Table 4. We analyzed the following configurations: the baseline U-Net model, U-Net with the self-attention mechanism (UNet-Atten), U-Net combining self-attention and Transformer (UNet-Atten-Trans), and models that incorporate Depthwise Separable Convolutions (DSC).
From the results, it can be observed that after introducing DSC, the model’s parameter count was significantly reduced to 19% of the original U-Net, and the inference speed improved by approximately 23%, but the accuracy decreased by 2.2%. After introducing the self-attention mechanism in the final layer’s skip connection, although the model’s parameter count increased by 1 MB, the model’s PA improved by 0.94%, mIoU increased by 2.1%, and the inference speed remained largely unchanged. When the bottleneck layer was replaced with a Transformer encoder, the model’s accuracy further increased by 0.95%, and the mIoU improved by 2.4%, but this also led to a significant increase in parameter count and computational complexity.
To validate the effect of introducing the self-attention mechanism in the skip connection and using the Transformer encoder as the bottleneck, we performed tests based on the original U-Net. The results showed that by introducing the self-attention mechanism and replacing the bottleneck layer, both PA and mIoU of the model achieved significant improvements, but the accompanying computational load and parameter count also increased substantially.
Overall, incorporating the self-attention mechanism in the skip connection and replacing the bottleneck layer with a Transformer encoder indeed enhances the model’s segmentation accuracy, but it also increases the computational load and parameter count. By combining DSC, the self-attention mechanism, and the Transformer encoder, the model achieves a good balance between accuracy and computational complexity, making it more suitable for real-time segmentation applications.
3.5. Discussion
The modified U-Net-based model proposed in this paper demonstrates strong performance and the potential for application in the estimation of calorific value in straw fuel. Through a comparative analysis of different models, we can clearly see the differences in performance across models in terms of pixel accuracy (PA) and mean Intersection over Union (mIoU). The large version of the model, which does not incorporate Depthwise Separable Convolutions (DSC), outperforms the smaller version with DSC in both metrics, and also shows significant advantages compared to other mainstream models such as U-Net, SegNet, DDRNet, and SmaAt-UNet. While the smaller version sacrifices some accuracy compared to the larger version, it has lower inference speed and computational complexity, making it more suitable for real-time deployment.
In the ablation study, introducing DSC significantly reduced the model’s parameter count and computational complexity, while improving inference speed, though it led to a slight decrease in accuracy. This indicates that when aiming for efficiency, a trade-off between accuracy and complexity is necessary. The combination of self-attention mechanisms and Transformer encoders provided strong support for performance improvement, though it also increased computational complexity. Through in-depth analysis of different model configurations, we found that appropriate architecture design can achieve a good balance between accuracy and computational complexity.
Additionally, as shown in
Figure 5, part of the visual results demonstrates that the segmentation performance of our model is very close to that of U-Net, indicating that our approach effectively captures details and boundary information while maintaining lower computational complexity and parameter count. The figure also shows the classification results of other models, and while some models perform well under specific conditions, they are overall outperformed by the model proposed in this paper. Overall, our model demonstrates both efficiency and good classification accuracy, while reducing computational complexity and improving real-time performance in practical applications.
Future research could further explore ways to reduce computational complexity while maintaining or improving model accuracy. These investigations will provide more solutions for improving the estimation of calorific value in combined heat and power generation (CHPG) systems using straw combustion, promoting the development and utilization of renewable energy.