1. Introduction
According to the 2024 annual statistics released by the International Renewable Energy Agency (IRENA), the cumulative installed capacity of renewable energy in global power systems reached 3870 GW by the end of 2023, accounting for 86% of the total new power generation facilities installed that year. Photovoltaic (PV) power generation technology, which converts solar radiation directly into electrical energy by utilizing semiconductor materials to generate voltage under illumination, has become a crucial component of the clean energy conversion system [
1,
2]. With the continuous improvement in PV module conversion efficiency and large-scale application, the demand for intelligent operation and maintenance as well as real-time monitoring systems for PV power stations has become increasingly prominent [
3,
4]. The current wildfire early warning mechanism for power transmission systems primarily relies on satellite infrared remote sensing monitoring technology. However, PV arrays can reach surface temperatures of up to 85 °C under normal operating conditions, and their thermal radiation characteristics exhibit significant similarity to the infrared spectral response of initial wildfires, leading to a false alarm rate of up to 23–35% in satellite monitoring systems (IRENA, 2024). Notably, the deployment of large-scale PV power stations in China exhibits significant geographical heterogeneity [
5,
6], with differentiated application scenarios such as floating PV systems on water bodies, building-integrated PV, agriculture–PV complementation, and distributed PV in mountainous regions. These scenarios result in complex spatial distribution patterns of PV modules in remote sensing imagery, posing multidimensional technical challenges for target detection algorithm design. Especially in low-resolution remote sensing imagery (<5 m), PV modules exhibit high similarity in geometric shape and texture features to artificial structures such as vehicles and greenhouses. Traditional feature extraction methods based on HSV color space and edge detection show significant limitations. When processing panchromatic satellite imagery, the absence of multispectral information and the abundance of rectangular artificial buildings in urban environments make it difficult to achieve effective differentiation based solely on grayscale gradient features, leading to high average miss detection and false detection rates in target recognition systems. Therefore, there is an urgent need to develop efficient and accurate PV panel recognition technology for satellite imagery to improve the accuracy of automatic PV module identification in complex environments. This is of great value for optimizing energy management systems and improving ecological security monitoring systems.
Current research on the automatic detection of PV arrays mainly relies on constructing machine learning models using time-series imagery datasets collected by aerial platforms or unmanned aerial vehicles (UAVs) [
7,
8,
9,
10,
11,
12]. Although these methods can achieve a detection accuracy of over 80%, their technical framework has dual limitations: on the one hand, they depend on very high-resolution (VHR) imagery with a pixel size of 0.3–0.8 m as the input data source; on the other hand, they require a manually annotated PV array template library for supervised training, which significantly constrains the generalization ability and engineering applicability of the models. Notably, implementing the dynamic monitoring of PV facilities based on aerial platforms across mainland China, which spans 9,596,961 square kilometers, poses dual challenges in terms of economic cost and timeliness. Breakthroughs in satellite remote sensing technology provide a solution to this need—high-spatial-resolution remote sensing (HSRRS) systems (such as WorldView-4 and GF-7) have achieved sub-meter (0.3–0.5 m) spatial resolution [
13,
14,
15], and their multispectral channels (400–1040 nm) can effectively capture the characteristic reflection bands of PV modules [
16,
17], laying a data foundation for constructing high-precision PV array extraction algorithms. This validates the necessity of remote sensing imagery analysis technology from a techno-economic perspective.
In recent years, digital image processing methods based on feature engineering [
18,
19,
20,
21,
22,
23,
24] and data-driven deep learning segmentation techniques have constituted the two major technological pathways for PV array extraction. In traditional methods, researchers primarily rely on prior knowledge of the physical characteristics of PV modules to construct extraction models. For example, Grimaccia et al. established an RGB-HSV color space conversion framework, analyzing the chromaticity distribution characteristics of PV panels in the HSV space to construct a dynamic threshold segmentation model for array identification [
19]. Carletti et al. proposed a multi-stage geometric feature extraction algorithm, using the Canny edge detection operator and Hough transform algorithm to identify linear structures in the imagery, followed by morphological screening based on the hexagonal geometric priors of PV panels [
20]. However, traditional methods inherently lack generalization ability in complex ground object scenarios. When processing remote sensing imagery containing interfering targets such as industrial buildings and agricultural greenhouses, extraction models based on dual constraints of color and shape can produce false detection rates. Semantic segmentation methods based on deep learning demonstrate stronger feature representation capabilities [
25,
26,
27,
28,
29]. The Sizkouhi team integrated the feature extraction layers of the VGG16 Deep Convolutional Neural Network (DCNN) [
30] with the instance segmentation module of Mask R-CNN [
31] to construct an end-to-end PV array detection framework, featuring a multi-scale anchor design in the Region Proposal Network (RPN) [
26]. Jie et al. proposed a lightweight solution [
27], utilizing an efficient network DCNN [
22] and U-Net architecture, which improved the extraction accuracy of small-scale PV arrays.
Current work relies on pixel-level image segmentation. While effective for precision tasks like detecting defects on photovoltaic panels, these methods entail high computational costs and slow inference speeds, making the real-time processing of large-scale remote sensing data impractical for timely inspections. YOLO models, based on single-stage detection, directly regress target locations and classes without two-stage proposal generation and refinement, enabling faster inference. They allow for the block-wise detection of large images with sliding window compatibility, reducing VRAM usage and suiting large-scale image processing. Supporting TensorRT and ONNX export, these models run in real-time on devices like Jetson, facilitating lightweight deployment on edge devices to reduce hardware costs. They efficiently handle TB-scale remote sensing data, support extensive inspections, and ensure low latency responses. While newer versions like YOLOv8 and YOLOv10 promise performance improvements, they also impose higher computational requirements. These demands may not be suitable for embedded devices where resource efficiency is crucial. Most critically, newer algorithms like YOLOv8 may not offer stable operation, which is their most significant drawback, especially in industrial applications. The newer versions have not yet been widely adopted in the industry and can sometimes be unstable or require more fine-tuning, as the novel technologies and optimizations they incorporate may not yet be fully validated in all industrial settings. YOLOv5’s performance and lower resource demands make it a more stable choice for certain industrial applications. In large-scale inspections of photovoltaic panels, stability is more important than marginal accuracy gains.
The primary motivation of this study is to explore structural factors conducive to PV array feature extraction. To this end, we optimized both the local and overall network architecture of the C3 structure, which is the basic feature extraction module of the YOLO v5 model. The core objective of this improvement was to enhance the model’s ability to extract the features of PV panels in different scenarios by adding variable weight functionality to the inputs of different parts of the model. Evaluation metrics for the experimental results included recall, precision, F1 score, and mean Average Precision (mAP), which served as the basis for assessing the effectiveness of the experiments. The optimized model surpassed the original YOLOv5 model in all evaluation metrics, with a 6.13% increase in recall, a 3.06% increase in precision, a 5% increase in F1 score, and a 4.6% increase in mAP. To accurately assess the model’s performance in PV panel detection tasks under different background conditions, we compared the detection effects of the model when processing black-and-white and color images and examined the differences in performance between natural and residential environments. The improved model achieved significant improvements in the high error rates of small target detection in black-and-white images and complex environments.
2. Models and Improvements
The model presented in this study is an improvement based on YOLOv5. YOLOv5 is renowned for its high detection accuracy and rapid detection speed, capable of performing object detection without generating candidate regions. The model is able to output feature maps of various sizes, enabling the prediction of targets of different sizes. It not only predicts the object category within each bounding box but also accurately outputs its location information. Subsequently, the bounding boxes are filtered by calculating the Intersection over Union (IoU) values between the predicted and ground truth bounding boxes. Ultimately, the model outputs the category, confidence score, and precise location information of the bounding boxes.
By leveraging the strengths of YOLOv5, our improved model introduces additional functionalities and optimizations tailored for PV array feature extraction. The C3 structure is a key component of the model’s backbone, and improving the C3 structure can enhance accuracy while maintaining real-time performance. The modifications to the C3 structure and the introduction of variable weight functionality enhance the model’s ability to adapt to different scenarios and improve its performance in detecting PV panels, especially in complex environments and with small targets. The evaluation results demonstrate that our optimized model outperforms the original YOLOv5 in terms of recall, precision, F1 score, and mAP, highlighting the effectiveness of our improvements.
2.1. Brief Overview of the YOLOv5 Model
The architecture of the original YOLOv5 model (
Figure 1) processes feature maps of varying sizes through its backbone and neck structures, ultimately generating multiple output layers to obtain the final predictions. The YOLOv5 model employs convolutional layers, akin to those in Convolutional Neural Networks (CNNs), to construct its backbone for feature extraction, utilizing skip connections to fuse feature information from different layers. Furthermore, the neck structure is responsible for integrating feature maps of different scales to enable multi-scale object detection: lower-level feature maps focus on capturing small objects, while higher-level feature maps are more suited for detecting large objects. During the detection phase, the model maps each pixel in the feature maps to the corresponding class label, confidence score, and anchor box coordinates. Subsequently, through a series of subsequent refinement steps, the model further hones the location coordinates of the predicted bounding boxes to ensure the accuracy of the detection results.
The backbone network primarily employs a combination of convolutional layers and C3 modules to progressively reduce the size of the feature maps while enhancing the depth of the features. To enhance the spatial and positional invariance of the model to the input data, thereby improving recognition capabilities, the SPPF (Spatial Pyramid Pooling Fast) operation is introduced in this section. The core of the SPP (Spatial Pyramid Pooling) module lies in pooling technology, which applies different sizes of receptive fields to the same image, effectively capturing multi-scale feature information, as illustrated by the SPP in
Figure 1. This module first performs pooling operations of various sizes on the input feature maps, generating a series of feature maps of different sizes. Subsequently, these feature maps are concatenated together and passed through a fully connected layer for dimensionality reduction, ultimately outputting a fixed-size feature vector.
The Neck section adopts a Feature Pyramid Network (FPN) structure (
Figure 1, Neck), which is a multi-object detection technique designed to efficiently handle targets of different scales and sizes. This structure fulfills its function by fusing feature maps from different stages of the backbone network. Utilizing upsampling and downsampling techniques, it integrates feature maps from various levels to construct a multi-scale feature pyramid, capturing richer feature information. The fusion process employs a bidirectional design: the path from right to left combines fine-grained features from higher layers with coarse-grained features from lower layers through upsampling operations, achieving effective fusion of features at different levels. Meanwhile, the path from left to right integrates feature maps from different levels through the application of convolutional layers. This bidirectional integration strategy enables the model to synthesize feature information from various levels, generating feature maps that contain multi-scale details, thereby enhancing the accuracy of object detection.
The Head is responsible for the final integration and output of detection results, including the location, size, confidence score, and category information of bounding boxes. This section employs a coupled head design to share features and enhance computational efficiency. This structure not only ensures the accuracy of detection but also optimizes the utilization of computational resources, enabling the model to accurately predict the relevant attributes of targets while maintaining high performance efficiency.
2.2. YOLOv5 Model Improvements
The C3 module is a crucial component in the YOLO v5 network, spanning the entire model. Its primary role is to increase the depth and receptive field of the network, thereby enhancing its feature extraction capabilities. To improve the model’s stability and generalization performance, and to ensure that the activation function has a smooth curve near zero while preventing gradient explosion or vanishing gradients during training, each convolutional block (consisting of Conv1 × 1 and Conv3 × 3) includes a Batch Normalization (BN) layer and a
SiLU (Sigmoid Linear Unit) activation function:
The C3 module serves distinct roles in different components of the network. In the Backbone, the C3 module halves the size of the feature maps to enhance the network’s receptive field while simultaneously reducing computational load. This enables the network to focus more on global object information, thereby improving the efficacy of feature extraction. Conversely, in the Neck component, the C3 module maintains the spatial resolution of the feature maps, preserving the local information of objects more effectively. Additionally, it facilitates further feature extraction, augmenting both the depth and receptive field of the network. Regarding the optimization of the C3 module, our enhancement strategies encompass two dimensions: firstly, adjusting the internal structure of the module to a C3_k configuration, as illustrated in
Figure 2, and secondly, optimizing the module’s role within the entire network from a holistic perspective.
The improved C3_k module, through the introduction of a dynamic feature weight allocation mechanism, can be formalized as follows:
where
is the feature after the direct single-layer convolution before concat,
is the feature before the residual connection,
is the feature after the layer convolution processing in parallel with the residual connection, and
and
are learnable weight parameters (originally fixed at
,
in C3). By allowing
<
, the model can autonomously reduce the direct transmission of original features (suppress redundant information) and enhance the expression ability of higher-order features, which conforms to the principle of “neurons strengthen connections through synchronous activation” in the Hebbian learning rule. From the perspective of the neuroscience basis of optimized residual connections, residual connections are essentially “highways” for cross-layer information transmission. However, the fixed weights in the original C3 may lead to gradient confusion. The C3_k module achieves a dynamic balance between short-term memory (current layer features) and long-term memory (cross-layer features) by changing the weights, alleviating feature collapse in deep networks. The weight coefficients
and
can be regarded as a parameter-free attention, and their adjustment process is similar to the channel attention in SENet but has a lower computational cost.
The C3_k structure primarily enhances the feature extraction mechanism of the C3 module by emphasizing the reduction in participation weights for raw features, thereby augmenting the prominence of processed features. The core of this improvement lies in modulating the weight allocation of features from different sources during the feature extraction process to optimize the handling of information flow. This involves the following: (1) Optimizing the weight distribution of residual connections to ensure a more effective fusion of features from previous layers with those of the current layer, thereby enhancing feature integration efficiency and model representation capability. (2) Adjusting the feature weights before and after convolutional operations, aiming to strengthen the model’s information capture ability at various stages of feature extraction through weight optimization of features both pre- and post-convolution.
In the process of skip connections and convolutional fusion, we introduce a weight parameter k to regulate the degree of participation of raw features, enabling the model to focus more on features that have undergone complex processing. This further improves feature utilization and reduces the model’s sensitivity to noise and details in training data, thereby mitigating the potential for overfitting. Through this weight adjustment, we aim to enhance the model’s generalization ability, enabling it to maintain good performance on unseen data. If the model’s performance remains relatively unchanged with reduced weights, it may indicate that the model is less sensitive to weight adjustments or can compensate for such reductions through other means. Conversely, if the model’s performance improves as a result, it suggests that reducing the weight of raw features in subsequent feature extraction processes helps to discard certain redundant features, thereby enhancing the model’s focus on extracting effective features. Additionally, this result may imply that by reducing reliance on raw features, the model can more effectively screen out features crucial to the task, ultimately improving overall detection performance.
From a global perspective, as shown in
Figure 3, in the optimization of our model, we apply weight adjustments to each section where the C3 modules are involved. Considering the potential issue of feature reduction due to weight adjustments, particularly the reduction in effective features after numerous training epochs, we adopted the following three distinct weight adjustment strategies to explore their differentiated impacts on photovoltaic panel detection performance:
- (1)
Equal Weights in Backbone Only: Special attention is given to the backbone (Backbone) part of the model, and an equal weight adjustment strategy is implemented. Specifically, all weight parameters in the Backbone are set to be equal, i.e., based on empirical selection, we set kb1 = kb2 = kb3 = 0.75 within the C3_k module in the Backbone part, and = 0.75 and = 1. This strategy aims to uniformly enhance the basic feature extraction capability, ensuring that the model treats feature information from different layers evenly at the initial stage of feature extraction. Meanwhile, we keep the weights in the neck (Neck) part unchanged, i.e., kn = 1, within the C3_k module in the Backbone section, and = 1 and = 1, to focus on improving the basic feature extraction efficiency of the Backbone without interfering with the feature fusion and transmission processes.
- (2)
Globally Increasing Weights: A global weight increment adjustment strategy is adopted, aiming to gradually increase the proportion of raw features in the overall feature processing as the model depth increases. Specifically, starting from the initial weight kb1 = 0.75 in the backbone (Backbone), we incrementally adjust the subsequent weight parameters kb2, kb3, and the weight parameters kn1, kn2, kn3, and kn4 in the neck (Neck) part, with each stage’s weight increased by 2.5% compared to the previous stage; as the number of layers l in the model increases, the parameter in the C3_k module is adjusted to = 0.75 + 2.5%(l-1) and = 1. As the model layers deepen, the importance of raw features gradually increases, thereby comprehensively enhancing the feature processing capabilities at all stages of the model. Through gradual increases in weights across the global scope, potential losses of feature information are reduced, and we assess whether the model can continue to learn rich and hierarchical features throughout the training process. In this way, we expect the model to maintain higher sensitivity and stronger feature expression ability in complex data environments, promoting more refined and in-depth feature processing at different levels, and ultimately achieving superior performance in the final detection task.
- (3)
Global Equal Weights: A strategy of global weight uniformization is adopted, where all C3 modules in the model have equal weights, i.e., kb and kn are all equal to kb1 = 0.75 in the C3_k modules throughout the entire model, and = 0.75 and = 1. No specific C3 module is given higher priority, ensuring that each module has the same weight during the feature extraction and fusion processes. We evaluate the detection performance of the model without the influence of weight bias. At the same time, this is also the approach with the smallest total global weight, used to verify whether the model’s feature extraction capability is more efficient when the raw feature weights are relatively small.
3. Experiments and Methods
3.1. Data Preprocessing
The data were sourced from Google Satellite Maps, captured in July 2022, uniformly scaled to 640 × 640 pixels, and bilinear interpolation was used to maintain geometric consistency. The images comprised full-color RGB images. A total of 2100 raw images were used as the dataset. The photovoltaic panels in these images were manually annotated using the LabelImg tool to create dataset labels. Out of these, 1900 datasets were allocated for training, while 200 were set aside for testing the model. The experiment was conducted on a platform built with a Kunpeng 920 CPU and equipped with four NVIDIA Tesla A100 GPUs, each possessing 40 GB of graphics memory.
Photovoltaic panels typically occur in clusters, yet their shapes and sizes vary considerably due to the diversity of their environments and the clarity of remote sensing imagery, encompassing landscapes such as urban areas, forests, the Loess Plateau, and farmland. Furthermore, some photovoltaic panels resemble other objects detectable by remote sensing, such as vehicles, rooftops, and greenhouses, making it difficult to distinguish them based on features like color and shape, especially in low-resolution imagery or when dealing with small targets. This situation not only significantly elevates the complexity of feature extraction but also poses substantial challenges for data annotation tasks. To address this, we segment the clusters of photovoltaic panels within an image into multiple detection bounding boxes, treating those that are relatively far apart as independent detection targets. This approach minimizes the proportion of background within the detection bounding boxes while also reducing the morphological discrepancies among objects within each box, thereby enhancing the degree of compatibility.
Regarding the unification of images and labels, the input images are uniformly resized to dimensions of [640, 640]. Concurrently, the coordinates of the label detection bounding boxes (xmin, ymin, xmax, ymax) are adjusted according to the resized image to obtain corresponding new coordinates (xmin_new, ymin_new, xmax_new, ymax_new):
where
iw and
ih represent the width and height of the original image, respectively, while
nw and
nh denote the width and height of the adjusted image. Subsequently, the coordinates, initially expressed in the form of minimum and maximum values (
xmin,
ymin,
xmax,
ymax), are transformed into a format indicating the center point, width, and height, namely (
x,
y,
w,
h):
The experiment employs a single-stage detection method, which leverages the automatic adjustment of bounding boxes during the training process to better accommodate targets of varying sizes and aspect ratios. To achieve this, an adaptive anchor box approach is adopted, where the sizes and shapes of anchor boxes are automatically adjusted according to the target dimensions and aspect ratios. Prior to inputting images into the model, an initialization calculation for adaptive anchor boxes is conducted on the detection boxes, involving the initialization of prior boxes tailored to targets of different sizes, as illustrated in
Figure 4. This ensures that the model can generate appropriate anchor boxes when dealing with targets of varying sizes, thereby enhancing detection accuracy and efficiency.
3.2. Loss Function
Given that the model is specifically tailored for identifying solar photovoltaic panels, the loss contribution related to classification is not pivotal for model optimization. Consequently, in the design of the loss function, the weight assigned to the classification loss is correspondingly decreased, thereby enabling the bounding box loss and confidence loss to assume a more prominent role in the overall loss.
For the bounding box loss, the Lossgiou method is employed to quantify the discrepancy between the predicted bounding box and the ground truth bounding box. This method comprehensively evaluates the positioning accuracy of the predicted bounding box by considering both the intersection and union of the predicted and ground truth bounding boxes, as well as incorporating the area of the smallest enclosing box that encompasses both the predicted and ground truth bounding boxes:
where
represents the area of the predicted bounding box,
denotes the area of the ground truth bounding box,
is the area of their intersection, and
signifies the area of the smallest enclosing box.
For the evaluation of confidence and classification, we utilized the Binary Cross-Entropy Loss (BCELoss) method to measure the discrepancy between the true values and the predicted values. The accuracy of the predictions was assessed by computing the cross-entropy between the true values and the predicted values. Specifically, for the confidence loss, BCELoss was employed to evaluate the probability that a predicted bounding box belongs to the category of solar photovoltaic panels, i.e., the confidence level of the model’s detection of solar photovoltaic panels. Conversely, for the classification loss, BCELoss is used to assess the accuracy of the model’s predictions regarding the category of solar photovoltaic panels:
where
y represents the binary label, which can be either 0 or 1,
p(
y) denotes the probability that the output belongs to the label, and
N signifies the number of groups of objects that the model predicts.
3.3. Detection Box Adjustment
During the model’s computation process, a series of predicted bounding boxes are obtained based on feature maps of varying sizes. These predicted bounding boxes encapsulate the location and size information of the targets predicted by the model, defined by the x and y coordinates of the target center, as well as its width w and height h. In the task of solar photovoltaic panel detection, multiple candidates bounding boxes may be generated for the same target, potentially overlapping and leading to redundant bounding boxes. Therefore, to enhance detection accuracy and efficiency, we employ the Non-Maximum Suppression (NMS) algorithm to determine the optimal target bounding boxes and minimize redundant bounding boxes. For the predicted bounding boxes at each layer, the NMS process is illustrated in
Figure 5, which details how the final bounding boxes are determined by comparing the scores and overlapping areas of different predicted bounding boxes.
4. Experimental Results
Systematic experiments were conducted in this study to evaluate the performance of the original YOLOv5 model, the globally progressive weight-modified model, the locally equal weight-modified model, and the globally equal weight-modified model in the task of solar photovoltaic panel detection. The evaluation metrics employed to assess model performance ultimately included recall, precision, F1 score, and mean Average Precision (mAP), with an Intersection over Union (IoU) threshold of 0.5, as presented in
Table 1. Recall reflects the proportion of actual solar photovoltaic panel samples that are correctly detected by the model among all actual samples. Precision, on the other hand, measures the proportion of samples predicted as solar photovoltaic panels that are actually solar photovoltaic panels. The F1 score, being the harmonic mean of recall and precision, provides a comprehensive indication of the model’s overall performance in the task of solar photovoltaic panel detection.
To evaluate the performance of the proposed model, we selected the C2f structure of YOLOv8 and Mask R-CNN and U-Net models as comparison models. C2f (cross-stage partial networks with feature fusion) is an efficient network design in YOLOv8. It uses cross-stage partial networks and feature fusion technology to achieve the goal of maintaining a high detection accuracy while reducing computational complexity. By comparing the model with the C2f structure of YOLOv8, the advantages and disadvantages of the proposed model in the target detection task are intuitively demonstrated. Although the proposed model is not as accurate as YOLOv8, it shows advantages in recall rate. The experimental results of Mask R-CNN and U-Net are inferior to the proposed model. This result shows that the proposed model can relatively reduce the cases of missed detection in the photovoltaic panel detection task and thus perform better, ensuring that the photovoltaic panel is fully identified.
The experimental results demonstrate that regardless of whether it is the globally progressive weight model, the locally equal weight model, or the globally equal weight model, reducing the weights of the original features leads to varying degrees of improvement in the overall performance of the models. This suggests that by decreasing the participation weights of the original features, the models can focus more on extracting effective features, thereby exhibiting enhanced performance in the task of solar photovoltaic panel detection. Among these models, the globally equal weight model demonstrates the best overall performance across various indicators, indicating that this weight configuration strategy not only aids the model in learning more critical feature information during training but also helps maintain good performance stability when processing complex data. The improvement in performance validates the importance of appropriately reducing the weights of original features during the training process.
Regarding the specific experimental situation, overall, reducing the involvement of original weights had a positive impact on the entire experiment. Specifically, whether it was local equality or global equality, the relative equality of weights had a greater influence on the experimental results. In the early stages, the weights were smaller, and in the later stages, equal weight distribution significantly improved the accuracy of photovoltaic panel detection. This might be due to the dynamic compensation of model involvement: equal weights balance the feature transmission path, allowing original features to continuously participate in fusion at different levels and avoiding information loss caused by excessively high weights at certain stages. Meanwhile, having smaller weights at deeper layers further enhanced accuracy. This could be because excessively large weights at deeper layers might suppress details at shallower layers, while most photovoltaic panel targets are small and highly dependent on high-resolution features at shallower layers. Shallow features avoid being suppressed by deep features during the fusion process, ensuring that details are continuously transmitted to the detection head. The locally equal weight model increased weight involvement in the neck part, where the excessive reinforcement of deep features might amplify misjudgments caused by background disturbances (such as shadows, regular textures), increasing the risk of false detections.
Regarding recall rate, under the condition of smaller weight involvement, global equality might lead to a decrease in the weights of shallow features, causing edges, textures, and other details to be suppressed, increasing false negatives and decreasing the recall rate. In contrast, in the locally equal weight model, shallow details were preserved: a shallow weight of 0.75 retained some detailed features, avoiding excessive noise interference while enhancing deep semantics. This strengthened the transmission of high-level semantic features (such as the overall shape and arrangement pattern of photovoltaic panels), improving the detection capability for challenging samples (such as occlusions, small targets), and reducing false negatives.
4.1. Color and Black-and-White Image Recognition
Comparisons of solar photovoltaic panel identification on both color and black-and-white images were conducted on validation sets of 500 and 1000 iterations, respectively. The aim was to conduct an in-depth observation and analysis of the feature recognition capabilities of solar photovoltaic panels in different image formats and to verify whether the model can accurately identify solar photovoltaic panels on black-and-white images.
During the mid-stage of training, as shown in
Figure 6, the model exhibited good recognition rates for solar photovoltaic panels when processing color images. However, when confronted with black-and-white images, although the model was able to identify small target solar photovoltaic panels, the error rate was relatively high. These misidentifications often occurred in sections with intermediate lines, which might be falsely identified as solar photovoltaic panels. It was observed that the improved models exhibited relatively lower error rates on black-and-white images. The globally equal weight model demonstrated higher accuracy in processing similar objects in black-and-white images, indicating that the model no longer relies solely on gap features to determine whether a target is a solar photovoltaic panel. Instead, it reduces the generation of redundant bounding boxes by analyzing features more comprehensively.
After 1000 iterations, a significant reduction in the number of false positives and redundant bounding boxes can be observed across the models, as illustrated in
Figure 7. Although the models still retain some false positives when processing black-and-white images, these are primarily targets that are highly similar to solar photovoltaic panels. In contrast, the YOLO v5 model, in addition to retaining false positives that are highly similar to solar photovoltaic panels, also generates individual redundant bounding boxes and a relatively larger number of detection boxes when processing black-and-white images. This may be due to the limited ability of the YOLO v5 model to extract and distinguish features between solar photovoltaic panels and other similar objects in black-and-white images, leading to the generation of false positives. However, both the locally weighted model and the globally progressive weight model, when processing black-and-white images, retain a relatively smaller number of false positives that are highly similar to solar photovoltaic panels. The globally equal weight model retains the smallest number of false positives when processing black-and-white images, indicating that it is more effective in reducing the generation of redundant and false bounding boxes, thereby exhibiting higher performance in the task of solar photovoltaic panel detection.
This also confirms that by reducing weights, the model can effectively discard non-critical redundant features and focus more on those that are crucial for the task of solar photovoltaic panel detection. Especially when processing black-and-white images, the model can more accurately identify solar photovoltaic panels while reducing misidentifications of other similar objects.
4.2. Identification of Photovoltaic Panels in Natural Environment and Residential Environment
During the experimental process, the performance of the model under different environmental conditions was analyzed by comparing the recognition of solar photovoltaic panels in residential and natural environments on validation sets of 500 and 1000 iterations, respectively. In the 500-iteration experiment, as shown in
Figure 8, the model was able to identify small target solar photovoltaic panels in residential environments but with a relatively high error rate. This may be due to the model’s tendency to rely on shape features for discrimination, leading to many gray–black or rectangular-shaped objects being falsely identified as solar photovoltaic panels. Such errors may stem from the blurriness of remote sensing images in residential environments and the loss of target’s three-dimensional features, thereby increasing the error rate. In natural environments, the model tends to falsely identify gaps with shadows in darker areas such as solar photovoltaic panels. In natural settings, the model tends to prioritize color features for identification, which may result in large areas of black objects with light and shadow gaps being falsely recognized as solar photovoltaic panels.
In the 1000-iteration experiment, as shown in
Figure 9, the number of false positives for small target detection in residential environments significantly decreased, with only individual errors occurring. This indicates that with deeper training, the model’s ability to recognize small target solar photovoltaic panels in residential environments has been significantly improved. A comprehensive analysis of the errors reveals that in cluttered residential backgrounds, the model still makes errors when colors or shapes are extremely similar and difficult to distinguish. In natural environments, YOLOv5 still exhibits a relatively high number of false detections, while the improved models have effectively enhanced their ability to recognize solar photovoltaic panels. The models are able to more accurately distinguish between shadows and solar photovoltaic panels, thereby reducing false positives. This improvement may be attributed to the models’ better understanding of light and shadow variations and shadow features in natural environments, as well as their more accurate extraction of color features of solar photovoltaic panels.
This observation further confirms that by reducing weights, the model can focus more on the features that are crucial for the task of solar photovoltaic panel detection, thereby extracting more effective features in both natural and residential environments. This weight adjustment strategy not only optimizes the feature extraction process of the model but also helps improve its detection performance in complex environments.
5. Conclusions
Addressing the key issues of high false alarm rates and poor adaptability to complex scenes in the remote sensing monitoring of photovoltaic arrays, an intelligent recognition method for solar photovoltaic components in satellite imagery based on an improved YOLOv5 model is proposed. By optimizing the structural design of the C3 feature extraction module of the model and introducing a weight-adaptive adjustment mechanism, the model’s ability to represent features of multi-scale and multi-type solar photovoltaic components is significantly enhanced. The experimental results show that compared to the original model, the improved model achieves improvements of 6.13%, 3.06%, 5%, and 4.6% in recall, precision, F1 score, and mean Average Precision (mAP), respectively. This effectively addresses the confusion between solar photovoltaic components and artificial structures in low-resolution panchromatic imagery and reduces the error rate for small target detection in black-and-white images and complex scenes by 19.8%. Compared to traditional methods that rely on very high-resolution aerial imagery and manual annotation templates, the technical solution proposed in this study demonstrates significant advantages in terms of data acquisition costs, model generalization ability, and engineering applicability. It provides a feasible technical path for the dynamic monitoring of large-scale photovoltaic facilities.