The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency

Al-Humairi, Ali; Khalis, Enmar; Al Hemyari, Zuhair A.; Jung, Peter

doi:10.3390/pr13041195

Open AccessArticle

The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency

¹

Department of Communication Technology, Duisburg-Essen University, 47057 Duisburg, Germany

²

Computer Science Department, Faculty of Engineering and Computer Science, German University of Technology in Oman, Muscat 130, Oman

³

Department of Mathematical and Physical Science, College of Science, University of Nizwa, Nizwa 616, Oman

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(4), 1195; https://doi.org/10.3390/pr13041195

Submission received: 2 March 2025 / Revised: 7 April 2025 / Accepted: 10 April 2025 / Published: 15 April 2025

(This article belongs to the Special Issue Solar Technologies and Photovoltaic Systems)

Download

Browse Figures

Versions Notes

Abstract

:

This study investigates the impact of data augmentation on predictive maintenance machine learning models for robot solar panel cleaning. Data augmentation techniques like synthetic data generation, time-series transformation (shifting, interpolation, and resampling), and extreme condition simulation were used to enhance data diversity and model generalization. Machine learning algorithms, including logistic regression, support vector machines, deep learning, and ensemble learning, were compared to identify their sensitivity to these techniques. Our experimental findings show that ensemble models (stacking and boosting) show the maximum improvement in predictive accuracy with the added benefit of higher diversity and strength in features. Deep learning models show moderate gains primarily in feature extraction, and simple models such as logistic regression show little impact, indicating the model-dependent effectiveness of data augmentation. Despite better generalization, ensemble methods are at the expense of increased computational cost, indicating a trade-off between accuracy and efficiency. The study employs widely used machine learning frameworks and libraries for data preprocessing, augmentation, model training, and evaluation, ensuring robust and scalable implementation.

Keywords:

data augmentation; machine learning; predictive maintenance; solar panel cleaning; ensemble learning; deep learning; logistic regression; time-series transformation

1. Introduction

Renewable Energy Sources (RESs), particularly photovoltaic (PV), have emerged as a leading source of energy today, due to their cost-effectiveness and environmental advantages [1]. PV systems provide a viable solution for electrification, particularly for rural and far-flung areas where access to conventional electricity is typically restricted. These under-served groups tend to rely on traditional, non-sustainable energy forms for lighting and cooking, which results in environmental destruction and health risks [2,3,4,5]. PV energy systems supply a sustainable, economically feasible electrification solution in both urban and rural areas. In developing countries, the integration of solar energy can become the centerpiece for enhancing education, healthcare facilities, and economic growth. In concrete terms, the agricultural sector, a major sector of economic output in most of the developing world, will be significantly enhanced by the integration of solar energy. Farming in rural areas suffers because there is no reliable electricity. This makes it hard to use modern irrigation and other helpful farming tools. Through the use of solar PV power, farmers can adopt more sophisticated irrigation techniques, increase their productivity, and strengthen national food security and economic resilience [3].

In addition to its economic and social benefits, solar PV power plays a crucial role in promoting environmental sustainability. As opposed to the use of fossil fuels for power generation, solar power does not produce any greenhouse gases or noise or result in the excess use of water. Furthermore, where deforestation is driven mainly by fuel demand in rural regions, solar energy access can play a key role in conserving forests and wildlife [2]. These advantages concur with world sustainability targets, promoting the application of PV in large quantities. While promising, the installation of solar PV systems in large quantities is marred by several challenges, particularly in developing countries. One of the largest challenges involves the deposition of dust (soiling) and other impurities in the atmosphere on solar photovoltaic panels. This effect causes a significant decrease in electricity generation. Temperature variations, humidity, air quality, and solar panel tilt angles also influence the efficiency of solar panels. Of these factors, dust accumulation is a source of significant concern, especially in dry and desert regions where the scope for solar power is higher. The problem is intensified by seasonal changes, leading to the loss of efficiency, which calls for periodic cleaning and maintenance [4].

Another study proposes Mean Gaussian Noise (MGN), a new data augmentation method that enhances model performance by generating global synthetic time series data. Experimental results on the SWAN-SF dataset indicate that MGN can enhance data diversity, increase classification accuracy, and preserve competitive computational efficiency [5]. Another study proposes an integrated generative data augmentation framework to improve solar panel soiling localization by addressing unbalanced datasets. The created Translucent dataset improves segmentation accuracy by 14.59% with a more realistic and denser dataset for improving solar energy efficiency and applying it to predictive maintenance for more applications [6].

Traditional cleaning methods, which are resource-intensive and water resource-thirsty, are not feasible for large solar farms. The solution to this issue lies in developing intelligent, automated cleaning systems that are optimized for resource-saving and offer high energy efficiency. The current research proposes a new smart robotic cleaning system based on artificial intelligence (AI) to maximize the efficiency of solar panels by integrating machine learning (ML) algorithms with robotic automation. This research proposes to analyze environmental and operational parameters to determine the effectiveness of solar panels, design energy-efficient robotic cleaning systems with various environmental parameters, and evaluate the system’s proposed performance through AI-based optimization techniques for lowering operational costs and increasing energy yield. The rest of this paper is structured as follows: Section 2 overviews the previously developed automated solar panel cleaning systems, their importance, and their limitations. Section 3 outlines the research approach involving problem formulation, data collection, preprocessing, and investigating important environmental factors such as dust and temperature. It also discusses data augmentation techniques and how they help improve the performance of predictive models in cleaning optimization.

Section 4 describes the data preparation process, such as anomaly detection, feature engineering, and correlation analysis for analyzing the dynamic use of machine learning models in a way that optimizes cleaning schedules to maximize energy efficiency and reduce operation cost. Section 5 provides a comparative evaluation of the cleaning models in terms of their accuracy, precision, recall, and training time, as well as an analysis of their strengths and weaknesses. Section 6 summarizes the paper with the key recommendations, a critical assessment of the proposed methodology, and an analysis of implementing continuous learning methods in subsequent solar panel cleaning systems. Section 7 identifies the current study limitations, for example, data sparsity, sensor unreliability, and environment variability, and highlights avenues for future research, for example, expanding the dataset over different locations, integration with real-time weather forecasts, and deployment of more adaptive and self-improving AI models to further enhance system performance and scalability.

By addressing the major problems of solar panel maintenance through AI-driven automation, this study contributes to the cost-effective and sustainable development of solar photovoltaic energy. The proposed approach is designed to provide solar energy to developing countries on a large scale, ultimately contributing to world sustainability and the transition to clean energy [7].

2. Related Work

Technical and environmental factors that affect PV energy production have been extensively reviewed. To keep the literature up to date and not duplicative, the review presented herein establishes research directly related to the topic of this study: the impact of soiling on solar PV efficiency and how machine learning is applied to the optimization of automated cleaning systems.

Numerous research projects have examined the effects of environmental parameters and technical factors on solar PV panel efficiency. The ambient temperature [8,9], humidity [10,11], and solar irradiation [7,12] are significant environmental parameters, while technical parameters such as panel orientation, spectral losses [13,14], and solar spectrum variations have significant effects on photovoltaic output. In addition, the inherent manufacturing variability in solar modules also creates performance variability, with manufacturers generally guaranteeing power variation within 3–10%, though actual variation exceeds these values. These constraints need to be overcome to provide consistent and effective energy generation.

Among these challenges, soiling or dust deposition is a severe issue affecting solar photovoltaic performance [15,16,17,18]. Atmospheric dust deposition on solar panels has the potential to decrease energy conversion efficiency by blocking sunlight, inducing two forms of shading, hard shading, in which discrete patches of dust lower the module’s open-circuit voltage, and soft shading, in which a uniform dust layer lowers the short-circuit current. Dust accumulation also raises cell temperatures, indirectly lowering efficiency and raising long-term degradation rates. Several studies have demonstrated that soiling-related losses are a prevalent driver of photovoltaic performance degradation, particularly in desert and heavy-dust environments [19,20].

Several cleaning methods have been established to counteract these impacts, including manual cleaning, dry and wet cleaning, electrostatic dusting, self-cleaning surfaces, and robot cleaning systems. The technique selection to be used is influenced by geographical location, climate, plant size, and economic viability. However, optimal intensity and cleaning frequency remain complex [21]. Automated robot cleaning has also been identified as a viable alternative to manual cleaning, providing a scalable and low-cost method for large-scale PV installations.

Data augmentation is one technique that improves the training set by generating new samples through a series of transformations and alterations. This increases the size and heterogeneity of the dataset and finally enhances the model’s generalization ability [22]. A study shows that noise injection is a simple yet effective data augmentation technique that enhances the generalization and robustness of machine learning models by adding controlled random noise, i.e., Gaussian, to the training data. It compels models to learn significant features rather than overfitting noise-free and ideal data representations [23]. Another approach based on using real-time data infusion to improve signal processing involves different sources of data, such as sensors and satellite images. Such systems show better performance, faster response times, and enhanced reliability [24].

When applied in solar parameter datasets, usually composed of solar radiation, temperature, humidity, wind speed, noise injection, and real-time infusion, the system Is particularly advantageous. Solar datasets tend to be plagued with measurement inaccuracies and variability owing to environmental contamination, sensor inefficiencies, and changing atmospheric properties. By including noise injection during solar energy forecasting models or photovoltaic performance assessment models, training makes them stronger against real-world defects. In addition, real-time infusion of sensor data increases model adaptability by continuously updating predictions with new sensor values, increasing responsiveness to sudden changes in atmospheric conditions. These approaches together enable models to realize high prediction accuracy for solar power generation, offering better adjustment to atmospheric fluctuations and enhancing the reliability of solar power forecasting and decision-making in energy management systems.

3. Techniques and Approaches Developed

Because of the importance and impact of dirt and dust on photovoltaic energy production, this paper reviews three major methodologies for measuring the cleaning efficiency of solar panels, involving both traditional and advanced methods. The first approach quantifies power loss due to the soiling effect, thus providing a direct measure of efficiency degradation caused by dust accumulation on the panels [25]. The second approach involves the determination of the soiling ratio by comparing the current output of soiled PV cells to that of unsoiled cells, thus providing a straightforward indicator of the impact of soiling on performance [26]. The third approach simulates this soiling process and all of the losses involved, thus opening a predictive framework to illustrate how different factors affect weather conditions and panel orientation over the accumulation of dust and dirt [17,27,28]. However, accurately estimating soiling loss remains a challenging task due to the involvement of various complex factors such as local weather patterns, particle composition and distribution, surface quality, and panel tilt angle, all of which have an impact on cleaning needs [18,29,30].

ML techniques have emerged as a promising solution to enhance energy efficiency and cost-effectiveness in cleaning solar PV panels. ML techniques can allow for faster and more cost-effective cleaning processes while improving the long-term performance of solar energy systems [30,31]. Combining optimization and data mining techniques with ML techniques can significantly improve decision-making about cleaning schedules and methods [32]. ML algorithms have already been implemented in robotic cleaning systems that automatically detect and remove dust from solar PV panels. These systems typically use cameras and sensors to identify areas where dust accumulation is significant, and cleaning actions, such as spraying nozzles or using brushes, are then deployed. These ML-driven systems have demonstrated improved performance over traditional cleaning methods [33].

Various ML algorithms have been explored for the modeling of solar panel cleaning tasks, including decision trees [34], random forest classifiers (RFCs) [35], neural networks [36], logistic regression (LR) [37], Naïve Bayes (NB) [37], and K-nearest neighbors (KNN) algorithms. Moreover, several works have applied Artificial Neural Networks (ANNs) to predict PV soiling based on environmental factors, for example, wind speed, relative humidity, and air pollution [28]. More significantly, the ANN models suggested in [38,39] calculate soiling losses and solar power generated efficiently using irradiance signals and pollutant characteristics. Another innovative technique includes the usage of Deep Residual Neural Networks to detect abnormal dust concentrations on PV panels. Pre-processing raw training images would clear the distortions due to perspective translation, including those caused by silver grid lines and segmentation, hence locating cleaning needs with much more accuracy. A stacking ensemble classifier was also proposed to classify pollution sources on the PV surface. It obtained an accuracy of 97.37% [40]. Furthermore, some researchers have found that data augmentation, which makes new training samples from limited available data, can improve the model’s performance, particularly in handling rare or extreme environmental conditions. While effective, this technique requires continuous tuning and incurs higher costs, which may limit its practical applicability unless optimized for specific tasks [41,42].

This paper explores several ML models, such as LR, support vector machines (SVMs), Naïve Bayes, KNN, hybrid, and ensemble methods, to determine their respective strengths and trade-offs regarding predictive power and computational efficiency in solar panel cleaning. All of these methods generally share the same aim: to improve the model’s performance, thus improving its predictive accuracy. Of the models discussed, logistic regression still offers the best overall trade-off between simplicity and suitability for real-time application and is thus ideal for resource-constrained environments. In contrast, while offering high precision and robustness, ensemble methods and SVM models involve higher computational resource requirements; hence, they are unsuitable for real-time applications. The variety of these trade-offs, in turn, underlines the importance of choosing an optimum model given the specific requirements and constraints imposed on a robotic cleaning system.

This study aims to enhance the energy efficiency of solar panels by exploring the potential and impact of data augmentation in improving machine learning models for optimizing robot cleaning operations. It discusses soiling problems and explores state-of-the-art modeling techniques for enhancing cleaning effectiveness. The proposed methods are machine learning models, convolutional neural networks (CNNs) along with traditional machine learning models, and ensemble techniques such as Bagging (for example, Random Forest), Boosting (for example, AdaBoost and Gradient Boosting), and Stacking techniques.

4. Methodology

This section discusses the dataset’s sources, characteristics, and preprocessing steps, followed by data augmentation procedures such as synthetic data creation and the introduction of noise. ML or deep learning (DL) model choice and training are presented with well-known architecture and optimization techniques. Finally, the experimental setup includes a comparison of baseline and augmented data and defining the hyperparameters and evaluation metrics used to compare model performance.

4.1. Dataset Description

The data for this study was collected from the Shams Solar Outdoor Facility at the German University of Technology in Oman (GUtech). The installed PV system comprises 18 solar panels, operating as a single unit, with inbuilt temperature and wind speed sensors interfaced with MetaControl for real-time data extraction and analysis. Twenty-four-hour sensor readings log temperature, humidity, irradiance, dust concentration, and wind speed, providing full environmental data relevant to the performance of solar panels. Maintenance logs were also collected to log historical cleaning data and panel conditions. Figure 1 is a satellite image of the PV system location.

The solar robotic cleaner features a rugged 2040 aluminum extrusion frame (6063-T5 alloy) with a modular upper, middle, and lower assembly design. Accurate cleaning is facilitated by four motorized drive wheels and a counter-rotating brush. Independent operation is furnished by an onboard battery replenished by an onboard solar panel. Autonomous operation and navigation are implemented by a control box with microcontrollers and sensors, as illustrated in Figure 2.

4.2. Data Processing and Feature Engineering

Integrating diverse data from different sources is one of the most important foundational steps for improving solar PV panel cleaning robots through AI-driven solutions. Such data would provide critical insights into environmental conditions, panel performance, and external factors affecting energy output: weather station data such as temperature and irradiance, sensor data such as dust and track sensors, and performance metrics such as inverter data and power output. Data were extracted with MetaControl.

Statistical overviews of the tables provide captivating conclusions regarding the impact of environmental and electrical conditions on system performance. Table 1 presents air pressure and humidity readings, while Table 2 provides DC, voltage, and power values. Table 3 addresses irradiance and temperature fluctuations according to solar panel configurations, and Table 4 presents irradiance values in W/m². Table 5 summarizes temperature, wind direction, and wind speed, and Table 6 shows soiling levels as per the Dust-IQ sensor. Table 7 summarizes the data in one row, and Table 8 lists the feature importance of the data.

These findings are also elaborated upon by the accompanying figures. Figure 3 shows the two-year statistical summary of air pressure and humidity, while Figure 4 compares DC, voltage, and power trends over the same period. Figure 5 shows irradiance and temperature changes, and Figure 6 specifically presents irradiance measurements in W/m². Figure 7 shows temperature, wind speed, and wind direction changes, and Figure 8 specifies the soiling levels reported by the Dust-IQ sensor.

These figures and tables together provide an overall assessment of the system’s operating efficiency and environmental performance over the study period.

The extracted datasets were merged, as observed in Table 7, to serve as a comprehensive and consistent data source for analysis and modeling. When integrating various datasets, factors such as temperature, wind speed, humidity, air pressure, solar irradiance, and power output can be considered simultaneously. Their integration allows for identifying patterns, improved forecasts, and rational decision-making. Merging was needed to align data from multiple sources over time and make them consistent, reducing gaps and errors. An example of a correctly merged dataset enhances machine learning models by providing a richer set of features, leading to more accurate predictions in areas like solar energy performance, maintenance planning, and environmental monitoring. The numbers present a notable seasonality with fluctuating levels of humidity and temperature. Power derived from electricity also fluctuates greatly, with its alteration most likely due to variations in irradiation and climate conditions. The relatively stable air pressure indicates that it has a practically zero effect when exiting the data. In contrast, its substantial variability in solar irradiance and temperature exerts significant effects on panel behavior and also on energy yield.

Also, the missing values are treated using sophisticated imputation techniques, including linear interpolation and predictive modeling. In contrast, outliers are detected by applying methods such as the interquartile range (IQR) to identify extreme values or the z-score analysis to measure standard deviation anomalies. These methods were chosen because of their robustness and adaptability to the diverse nature of the dataset.

It should be mentioned that alignment and synchronization are essential for merging data from different sources. All data are aggregated to a common time frame to ensure the consistency of the data points for analysis. Sensor and weather station malfunctions/downtime can cause some missing values or lead to outliers, which are also detected and handled. Missing values were placed or removed and aligned. Once preprocessing is complete, the dataset goes through several additional steps to prepare it for machine learning algorithms, such as:

Standardization: Numerical features are scaled into comparable magnitudes.

Encoding: Categorical features are encoded and processed so that they can be used by predictive models.

Correlation Analysis: The relationships between features and target variables are studied, aiding in identifying features that most strongly impact the results, whether those results pertain to panel cleanliness or efficiency.

4.3. Data Augmentation Techniques

To achieve improved modeling accuracy, the solar panel dataset was put through a rigorous data augmentation process to make the model more robust and better simulate real-world variations. The augmentations introduced greater diversity into the dataset, enabling the model to generalize well across changing environmental conditions and optimize the cleaning schedules. Among the key augmentation techniques employed was time shifting, where data points along the time dimension were shifted to cater to seasonal trends and day-night variations. This process taught the model to recognize the impact of irradiance and temperature changes on the cleanliness of panels so that it would be flexible for different times of the year. Gap-filling interpolation was yet another crucial augmentation process that addressed incomplete or missing sensor data. Linear interpolation and nearest-neighbor methods were utilized to impute missing values, preserve data integrity, and prevent any compromise in predictive accuracy.

To add real-world sensor variability, random noise injection was employed to introduce natural variation in environmental conditions. This step added robustness to the model by rendering it insensitive to small, inherent variations in sensor data so that it would not overfit noise-free data. Synthetic data generation was also employed to introduce extreme weather conditions, such as dust storms and heavy rainfall. By exposing the model to low-frequency yet high-impact events, this approach reinforced its resilience to unexpected environmental challenges with a significant effect on the cleanliness of solar panels and the requirement for maintenance.

Solar parameter data, solar radiation, temperature, humidity, wind speed. and sensor readings frequently have errors resulting from environmental disturbances and hardware restrictions. Noise injection addresses these errors by adding controlled randomness to the training process to avoid overfitting and enhance generalization. By adding noise, models learn to ignore minor variations in solar data, improving stability and reducing sensitivity to small fluctuations. It is necessary to handle sensor errors based on Gaussian noise with zero mean and variance

σ^{2}

.

The noisy data point is represented as

x^{'} = x + ϵ, where ϵ \sim N (0, σ^{2})

where x is the original sensor readings (e.g., solar radiation, temperature, humidity, or wind speed).

x′ is the noisy version of the original data point after noise injection;

ϵ is the injected noise, which follows a Gaussian (normal) distribution;

N(0,

σ^{2}

) is a normal distribution with:

The Mean 0 (ensuring that the noise has no bias and does not systematically increase or decrease values).

Variance

σ^{2}

(controlling the spread of the noise, where higher values introduce more variation in the data).

The real-time infusion approach is particularly desirable to address concept drift, wherein solar distributions shift due to factors such as cloud cover, seasonality, and changes in the atmosphere. It ensures the reliability of predictions by including new measurements as they become available. Mathematically, this is typically followed by the use of adaptive filtering methods such as the Kalman filter.

{\hat{x}}_{t} = {\hat{x}}_{t - 1} + K_{t} (Z_{t} - H \times {\hat{x}}_{t - 1})

where

\hat{x} ₜ

is the updated estimate at time t.

{\hat{x}}_{t - 1}

is the previous estimate at time t − 1.

K_{t}

is the Kalman gain at time t.

Z_{t}

is the latest sensor measurement at time t.

H is the observation matrix, which adjusts the predicted value.

The updated estimate of the solar parameter at time t, denoted as

\hat{x} ₜ

, is equal to the previous estimate

{\hat{x}}_{t - 1}

plus the Kalman gain

K_{t}

multiplied by the difference between the latest sensor measurement

Z_{t}

and the predicted value based on the previous estimate, which is adjusted by the observation matrix H.

Cumulatively, these augmentation methods raised dataset variability to allow the robotic cleaning system to adjust dynamically and perform well in diverse environmental conditions. Through this, the dataset was enlarged from 235,584 to 374,260 samples with 24 features each, effectively augmenting diversity without compromising key data characteristics in a computationally viable way. The augmented dataset was found to be extremely appropriate to train machine learning models as well as to make detailed analyses.

Balancing the imbalances in augmented features is critical to maintaining dataset integrity and preventing biased or erroneous model predictions. Resampling techniques were employed to remedy this, modifying the sample distribution to balance feature representation. Specifically, upsampling was used to increase the number of underrepresented samples such that rare but crucial environmental conditions were well-represented, and downsampling was used to reduce the overrepresentation of usual conditions so that the model would not be overly biased toward everyday conditions. Validation steps were taken post-resampling to analyze its impact on dataset distribution and feature consistency. This involved contrasting descriptive statistics before and after resampling, graphing feature distributions with histograms or density plots, and conducting hypothesis tests to determine the statistical significance of alterations.

Since resampling could add unwanted distortions, iterative refinement was necessary to adjust the parameters and maintain the optimal balance between variability and data representativeness. This iterative treatment maintained the soundness and appropriateness of the augmented dataset for subsequent modeling and analysis. By carefully applying augmentation, validating its effect, and iteratively refining the dataset, we ensured that the machine learning model that was trained on this data was robust, adaptive, and efficient in handling diverse real-world conditions.

The most important consideration of data augmentation is examining how it has altered feature distributions. The examination determines if the expanded dataset has maintained the underlying organization of the original raw data or made extreme changes that could affect future model accuracy. For instance, as shown in Figure 9, the Humidity_Relative feature tends to be largely consistent following augmentation, i.e., small distributional changes. Conversely, Figure 10 and Figure 11 indicate a profound change in Air_Pressure_Relative, indicative of a substantial change in structure. Equally, Figure 12 and Figure 13 show a normalized distribution for Si-South_BM_Temperature_T1 according to the effect of some augmentation methods on the feature.

4.4. DataFrame Correlation

To investigate these changes further, a heatmap correlation analysis was conducted, which compares pre- and post-augmentation feature relationships. This step identifies whether feature interactions are consistent or whether augmentation has added unexpected dependencies. A preserved correlation structure is a good sign that data integrity has been maintained in the augmentation process, whereas drastic changes may require further revision to ensure that the dataset remains sufficient for modeling.

Comparing the original dataset (Figure 14) with the augmented dataset (Figure 15) provides a general visual impression of the effect of augmentation on data structure and feature correlation. This is crucial for establishing whether the augmented dataset functions well to enhance model performance in preserving representativeness. By ensuring that the augmented data aligns with the original dataset’s fundamental features, this test validates its suitability for application in decision-making and predictive modeling.

4.5. Feature Importance

Following the examination of feature relationships using the correlation heatmap, feature importance determination is the next step to ascertain which variables most strongly affect the performance of the Solar PV Panel Cleaning Intelligent Robot. This is necessary to guarantee that the system is optimally set up for maximization and effectiveness of the cleaning strategy. Feature importance can be determined using various methods. Correlation analysis helps establish linear relations between target variables and features, such as the efficiency of solar panels or cleanliness levels, so the robot can plan cleaning according to panel location, orientation, and weather conditions. Furthermore, machine learning-based importance scores quantify the predictive accuracy contribution of each feature so that the robot can examine historical data and real-time sensor input to determine which parameters, dust accumulation rate, humidity, or irradiance, contribute most significantly to cleaning needs. Moreover, incorporating domain knowledge about solar panel technology, weather, and maintenance requirements tunes feature importance rankings so that the robot can dynamically adjust cleaning schedules. Together, these methodologies enable the robot to make data-driven decisions, continually optimize its cleaning techniques, and provide peak solar panel performance while maximizing energy production and minimizing operational costs, as illustrated in Figure 16 and Table 8, depending on the findings of the analysis from the correlation heatmap presented in Figure 15, which shows the features that have the most significant impact in the DataFrame.

4.6. Dimensionality Reduction for Optimized Cleaning Strategies

After identifying the most influential traits in the Solar Panel Cleaning Intelligent Robot, the second step is to improve data handling by reducing dimensions. An approach like Principal Component Analysis (PCA) helps streamline sensor data that are gathered when cleaning.

PCA transforms the high-dimensional feature space to a lower one without discarding the most critical information. Data simplification makes it more convenient to understand and visualize solar panel cleanliness and efficiency trends through this compression. PCA enhances the robot’s decision-making process by applying the most critical features and discarding redundant or less informative data, leading to more efficient and responsive cleaning operations.

In addition, PCA improves the performance of machine learning algorithms by reducing overfitting and computation overhead. Thus, the system is improved to manage data more efficiently and responsively to changes in the external environment. Ultimately, through the leveraging of PCA’s advantages, the Solar Panel Cleaning Intelligent Robot can plan optimized cleaning schedules, enhance energy returns, and extend solar panel lifespan, all contributing to a greener, more efficient solar power system.

In PCA, the First Principal Component Analysis (FPCA) captures the maximum variance in the data. In contrast, the Second Principal Component Analysis (SPCA) captures the maximum remaining orthogonal variance compared to the FPCA. These two components often provide a good approximation of the dataset’s overall structure and relationships. A scatter plot is a graphical representation that uses Cartesian coordinates. The scatter plot displays values for typically two variables for a dataset. Here, the scatter plot visualizes the relationship between FPCA and SPCA.

In this scatter plot, the “soiling loss difference” indicates how much energy production is affected by dirt accumulation on solar PV panels. Different levels of soiling loss difference are represented by different colors in the scatter plot, as shown in Figure 17.

From the distribution of the points, we notice that lower soiling losses (light pink) are more concentrated in the lower-left region, and higher soiling losses (dark purple) are clustered in the top region of the plot. This suggests that certain patterns in the dataset, which are picked up using PCA, are related to soiling loss variability.

This PCA plot helps us understand what affects soiling loss. It shows trends within the data and groups similar points together, which can help improve decisions to reduce soiling and enhance performance.

5. Results and Discussion

In this study, we analyze the impact of data augmentation on predictive modeling through a comparison between models trained on original and augmented data. We aim to check whether data augmentation improves the prediction accuracy and stability of the model. We use augmentation techniques such as synthetic data generation and transformations based on a real-world dataset to expand the training set. We then train and evaluate models on both the original and augmented data. Our experiments show that models trained on augmented data are more accurate and have better generalization, which confirms the effectiveness of data augmentation for improving predictive accuracy in various applications.

Different models were developed using different ML algorithms, DL algorithms, hybrid modeling, and ensemble. Models trained on augmented data generally show marginal gains in accuracy, precision, recall, and AUC, with LR and SVM performing the best overall. However, training time significantly increases for SVM and ensemble models with augmentation, indicating a trade-off between performance and computational cost, as presented in Table 9. Those results will be presented in detail in the following paragraphs.

A confusion matrix is used as a performance evaluation tool. The confusion matrix shows the number of correct and incorrect predictions a classification model makes. The following analysis compares the confusion matrices for all models that were proposed for this case study.

The confusion matrices presented in Figure 18 compare how well the KNN model performed on original and augmented data. The model accurately classified 19,164 positive examples in the augmented dataset, compared to only 9615 in the original dataset. The number of true negatives also went up enormously from 37,147 in the original dataset to 74,469 in the augmented dataset, showing better classification of negative examples. The number of false positives remained almost the same, with 181 in the augmented and 182 in the original.

Therefore, the effect of data augmentation in this aspect remains indistinguishable. The false negatives, on the other hand, showed a slightly higher value, from 173 in the original data to 186 in the augmented data and thus indicate a slight loss in sensitivity. Overall, the KNN model performed better with augmented data, demonstrating higher accuracy and generalization by correctly identifying more positive and negative instances. Despite the slight increase in false negatives, the overall improvement suggests that data augmentation enhances the model’s predictive capabilities.

The confusion matrices in Figure 19 show the impact of data augmentation on SVM performance. With augmentation, the number of true negatives increased from 37,254 to 74,528, and the number of true positives rose from 9722 to 19,243, demonstrating improved generalization. However, the number of false positives and false negatives increased slightly, from 75 to 122 and 66 to 107, respectively. Despite this slight compromise, the overall increase in accurately classified instances demonstrates the power of data augmentation in improving SVM’s predictive accuracy. However, further fine-tuning might be required for the best results.

Confusion matrices are employed to compare logistic regression performance on original and augmented data, as shown in Figure 20. The original data possess great accuracy with less misclassification (15 false positives and 12 false negatives). The augmented data provide additional correctly classified instances (74,631 negatives and 19,310 positives) with slightly more misclassification (19 false positives and 40 false negatives). Augmentation improves model robustness, especially in identifying positive cases.

The confusion matrices in Figure 21 show how well NB performs on the original and augmented datasets. With the original data, the model correctly identifies 34,702 negative and 9334 positive cases but produces 2627 false positives and 454 false negatives. After data augmentation, the number of correct classifications increased to 69,217 negatives and 18,539 positives. However, the number of false positives (5433) and false negatives (811) also rose. This means that the model becomes better at finding positive cases but also makes more mistakes.

The confusion matrices, presented in Figure 22, compare the performance of a hybrid CNN-based model on original and augmented datasets. With the original data, the model correctly classifies 37,132 negative and 9572 positive cases, with 197 false positives and 216 false negatives. After augmentation, the number of correct classifications increased significantly to 74,604 negatives and 19,302 positives, while the number of false positives (46) and false negatives (48) decreased. This indicates that data augmentation improves the model’s accuracy, reducing errors and making it more reliable in distinguishing between classes.

The confusion matrices, as presented in Figure 23, compare the performance of the Ensemble model on original and augmented datasets. In the original data confusion matrix, the ensemble achieves 37,059 true negatives, 38 false positives, 16 false negatives, and 37 true positives. After augmentation, it records 74,011 true negatives, 33 false positives, 19 false negatives, and 18,132 true positives. In general, both matrices show strong performance with minimal misclassifications.

As shown in the confusion matrices, the ensemble model demonstrated high accuracy and minimal misclassifications on both the original and augmented data. The training efficiency was examined for all of the proposed models for further investigation. While the confusion matrices highlight the predictive effectiveness, they do not reveal the computational effort required by each model. Figure 24 provides that information, comparing the training times of multiple classifiers (KNN, SVM, LR, NB, hybrid, and ensemble) for both datasets.

The chart shows that KNN experiences a significant increase in training time when moving from the original to the augmented dataset, suggesting sensitivity to larger sample sizes. In contrast, SVM shows a notable reduction in training time after augmentation, possibly due to the nature of the new data or optimization techniques. Logistic regression and Naïve Bayes exhibit relatively modest changes, indicating consistent computational demand. Both the hybrid and ensemble approaches see higher training times on augmented data, but their accuracy benefits (as evidenced in the confusion matrices) may justify the extra overhead. Overall, this comparison underscores the trade-off between performance gains from data augmentation and the associated computational costs, highlighting the importance of balancing accuracy with training efficiency.

Figure 25 displays the accuracy scores of various machine learning models applied to the original and augmented datasets. The LR model achieved the highest accuracy on both datasets, with scores of 0.999427 on the original and 0.999372 on the augmented dataset. In contrast, the Naïve Bayes model recorded the lowest accuracy, scoring 0.934610 on the original dataset and slightly lower at 0.933574 on the augmented dataset. Other models, including KNN, SVM, hybrid, and ensemble, showed accuracy scores between these extremes. Notably, the accuracy scores for most models were slightly higher on the augmented dataset than the original, suggesting that data augmentation may contribute to improved model performance.

The precision scores illustrated in Figure 26 of the various machine learning models applied to both original and augmented datasets. The logistic regression model achieved the highest precision scores, with 0.998468 on the original dataset and a slightly higher 0.999017 on the augmented dataset. Conversely, the Naïve Bayes model recorded the lowest precision scores, with 0.780370 on the original dataset and a slightly lower 0.773361 on the augmented dataset. Other models, such as KNN, SVM, hybrid, and ensemble, demonstrated precision scores between these extremes. A key observation is that precision scores for all models were generally higher on the augmented dataset compared to the original, suggesting that data augmentation may have positively influenced the models’ precision performance.

The recall scores presented in Figure 27 display various machine learning models in a classification task, with recall measuring the proportion of true positives correctly identified by the model. The ensemble model achieved the highest recall score of 0.999070, indicating its superior ability to identify positive instances, likely due to its combination of multiple models to enhance performance. In contrast, the NB model had the lowest recall scores, with 0.953617 and 0.958088 in different instances. This lower performance may stem from its assumption of feature independence, which might not align with the dataset’s characteristics, causing it to miss more positive instances compared to other models.

6. Conclusions and Recommendations

6.1. Conclusions

This study extensively tested the impact of data augmentation on machine learning models used in robotic solar panel cleaning predictive maintenance. The experiments were carried out on a variety of models, including logistic regression, support vector machines (SVMs), Naïve Bayes (NB), K-nearest neighbors (KNN), hybrid CNN-based models, and ensemble methods. The results suggest that data augmentation significantly improves model generalization and performance, though its effectiveness varies depending on the specific algorithm used.

The dataset was increased from 235,584 to 374,260 samples, which effectively augmented data diversity with computational practicability. The experimental output showed that models trained on augmented data performed with improved accuracy, recall, and precision, particularly for complex models. For instance, the accuracy of SVM increased from 99.70% to 99.75%, while logistic regression was able to maintain a high accuracy level of 99.94% both before and after augmentation. The ensemble model had the best recall score of 99.91%, further establishing its superior ability for correct positive instance classification. Precision scores were also enhanced for all models, with logistic regression achieving 99.90% on the augmented dataset as opposed to 99.84% on the native dataset.

These performance gains had some downsides, even with data augmentation. Computational cost increased, with SVM training time increasing from 1100 s to 3584 s, which is an indication of the added inconvenience of handling a larger dataset. Similarly, the ensemble model showed a significant rise in training time, which means that although augmentation enhances predictive accuracy, it requires more computational power. In addition, Naïve Bayes, even with some improvement in recall (95.36% to 95.81%), experienced some reduction in accuracy (93.46% to 93.35%), implying that augmentation may not always benefit all models.

These findings confirm the necessity of selecting augmentation methods based on model type and computing resources. While ensemble learning and deep learning models benefit a great deal from data augmentation, simpler models like logistic regression and Naïve Bayes see marginal gains. Balancing augmentation strategies to optimize model performance while maintaining computational efficiency is something that future work must tackle, particularly for real-time use cases. In addition, researching parallel computing approaches and model compression will counterbalance the longer training time by using ensemble models to make them more viable to be utilized for actual solar energy implementation.

6.2. Recommendations

Logistic regression is the best fit for the condition when the computational resources are restricted, as it offers a good level of precision, accuracy, and efficiency while training. In cases when predictive power and stability are of the utmost importance, and there are computational resources, ensemble models are employed because ensemble models offer higher accuracy, recall, and AUC scores.

Data augmentation is very useful in increasing model generalization, especially with richer datasets. However, its computational burden should be kept under strict management-based strategies considering specific needs and accessible capacity.

Some machine learning methods, KNN, SVM, LR, Naïve Bayes, and combined CNN-based approaches, exhibited stable performance for solar energy generation forecasting. Due to their strengths, these approaches are recommended for other renewable power systems.

7. Limitations and Future Work

7.1. Limitations

While data augmentation improves the accuracy of predictions, it raises computational costs significantly and renders ensemble and hybrid models unsuitable for real-time application. Hybrid deep learning-based models are computationally intensive and cannot be used on systems with low capacities.

The model’s performance is data-dependent to a great extent. Augmentation can enhance generalization but introduce noise or bias if misused. Additionally, the renewable energy sector is faced with low government backing and high investment costs, especially in developing countries, which hinder the large-scale adoption of photovoltaic projects.

7.2. Future Work

Future research will focus on ensemble model optimization for real-time application and transfer learning with diverse datasets to improve model robustness. Techniques of data augmentation beyond conventional data manipulation can also mitigate dataset quality problems and improve model reliability.

Increased transparency and interpretability of AI models will foster trust and usability for predictive solutions. Continuous learning approaches, such as reinforcement learning, can enable adaptive solar panel cleaning and maintenance systems for extended periods.

Lastly, incorporating geospatial analysis can help in strategic planning for the installation of solar PV panels, allowing the incorporation of sustainable energy in limitation and future work in PO-age urban and rural regions.

Author Contributions

A.A.-H., principal investigator, methodology, resources, conceptualization, analysis, and writing—review and editing. E.K., methodology, analysis, and writing—draft preparation. Z.A.A.H., analysis, writing—review and editing, and investigation. P.J., supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. This research was supported by Duisburg-Essen University and the German University of Technology in Oman.

Data Availability Statement

The data presented in this study are available from the corresponding author upon request.

Acknowledgments

We acknowledge Francis Andrew for his English review.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Saleem, S.; Farhan, M.; Raza, S.; Awan, F.G.; Butt, A.D.; Safdar, N. Power Factor Improvement and MPPT of the Grid-Connected Solar Photovoltaic System Using Nonlinear Integral Backstepping Controller. Arab. J. Sci. Eng. 2022, 48, 6453–6470. [Google Scholar] [CrossRef]
Rivera, N.M.; Ruiz-Tagle, J.C.; Spiller, E. The health benefits of solar power generation: Evidence from Chile. J. Environ. Econ. Manag. 2024, 126, 102999. [Google Scholar] [CrossRef]
Al Humairi, A.; El Asri, H.; Al Hemyari, Z.A.; Jung, P. Assessing the features of PV system’s data and the soiling effects on PV system’s performance based on the Field Data. Energies 2024, 17, 4419. [Google Scholar] [CrossRef]
Pourasl, H.H.; Barenji, R.V.; Khojastehnezhad, V.M. Solar energy status in the world: A comprehensive review. Energy Rep. 2023, 10, 3474–3493. [Google Scholar] [CrossRef]
Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Wen, J.; Angryk, R.A. Class-Based Time Series Data Augmentation to Mitigate Extreme Class Imbalance for Solar Flare Prediction. arXiv 2024, arXiv:2405.20590. [Google Scholar]
Go, S.-E.; Kim, J.-H.; Chuluunsaikhan, T.; Choi, W.-S.; Choi, S.-H.; Nasridinov, A. Unified Generative Data Augmentation for Efficient Solar Panel Soiling Localization. Electronics 2024, 13, 4859. [Google Scholar] [CrossRef]
Yadav, P.; Davies, P.J.; Sarkodie, S.A. The prospects of decentralised solar energy home systems in rural communities: User experience, determinants, and impact of free solar power on the energy poverty cycle. Energy Strategy Rev. 2019, 26, 100424. [Google Scholar] [CrossRef]
Tobnaghi, D.M.; Naderi, D.J. The effect of solar radiation and temperature on solar cells performance. Extensive J. Appl. Sci. 2015, 3, 39–43. [Google Scholar]
Sargunanathan, S.; Elanngo, A.; Mohideen, S.T. Performance enhancement of solar photovoltaic cells using effective cooling methods. Renew. Sustain. Energy Rev. 2016, 64, 382–393. [Google Scholar] [CrossRef]
Kazem, H.A.; Chaichan, M.T. Effect of humidity on photovoltaic performance based on experimental study. Int. J. Appl. Eng. Res. 2015, 10, 43572–43577. [Google Scholar]
Touati, F.A.; Al-Hitmi, M.A.; Bouchech, H.J. Study of the effects of dust, relative humidity, and temperature on solar PV performance in Doha: Comparison between monocrystalline and amorphous PVS. Int. J. Green Energy 2013, 10, 680–689. [Google Scholar] [CrossRef]
Islam, M.N.; Rahman, M.Z.; Mominuzzaman, S.M. The effect of irradiation on different parameters of monocrystalline photovoltaic solar cell. In Proceedings of the 2014 3rd International Conference on the Developments in Renewable Energy Technology (ICDRET), Dhaka, Bangladesh, 29–31 May 2014. [Google Scholar]
Sharma, M.K.; Bhattacharya, J. Dependence of spectral factor on angle of incidence for monocrystalline silicon-based photovoltaic solar panel. Renew. Energy 2022, 184, 820–829. [Google Scholar] [CrossRef]
Dos Santos, Í.P.; Rüther, R.J. Limitations in solar module azimuth and tilt angles in building integrated photovoltaics at low latitude tropical sites in Brazil. Renew. Energy 2014, 63, 116–124. [Google Scholar] [CrossRef]
Dirnberger, D.; Blackburn, G.; Müller, B.; Reise, C. On the impact of solar spectral irradiance on the yield of different PV technologies. Sol. Energy Mater. Sol. Cells 2015, 132, 431–442. [Google Scholar] [CrossRef]
Santoni, F.; Piergentili, F.; Candini, G.P.; Perelli, M.; Negri, A.; Marino, M. An orientable solar panel system for nano spacecraft. Acta Astronaut. 2014, 101, 120–128. [Google Scholar] [CrossRef]
Saidan, M.; Albaali, A.G.; Alasis, E.; Kaldellis, J.K. Experimental study on the effect of dust deposition on solar photovoltaic panels in desert environment. Renew. Energy 2016, 92, 499–505. [Google Scholar] [CrossRef]
Zaihidee, F.M.; Mekhilef, S.; Seyedmahmoudian, M.; Horan, B. Dust as an unalterable deteriorative factor affecting PV panel’s efficiency: Why and how. Renew. Sustain. Energy Rev. 2016, 65, 1267–1278. [Google Scholar] [CrossRef]
Darwish, Z.A.; Kazem, H.A.; Sopian, K.; Al-Goul, M.; Alawadhi, H. Effect of dust pollutant type on photovoltaic performance. Renew. Sustain. Energy Rev. 2015, 41, 735–744. [Google Scholar] [CrossRef]
Al Humairi, A.; El Asri, H.; Al Hemyari, Z.A.; Jung, P. Modelling the Performance of Photovoltaic Systems and Studying the Soiling Effects: Insights Based on Field Data of Environmental Factors of Solar Panel Systems. ASI 2025, 8, 25. [Google Scholar] [CrossRef]
Zhang, J. Classification and Comparison of Data Augmentation Techniques. Trans. Comput. Sci. Intell. Syst. Res. 2024, 6, 180–187. [Google Scholar] [CrossRef]
Khadka, N.; Bista, A.; Adhikari, B.; Shrestha, A.; Bista, D.; Adhikary, B. Current practices of solar photovoltaic panel cleaning system and future prospects of machine learning implementation. IEEE Access 2020, 8, 135948–135962. [Google Scholar] [CrossRef]
Al Hashmi, H.S.; Al Jassasi, I.S.; Al Humairi, A.; Bulale, Y.; Al Salmi, M.; Yazdi, P.G.; Husain, A.; Al Azzawi, M.; Jung, P. Design and building of a solar robotic cleaner chassis. Int. J. Des. Eng. 2024, 13, 50–91. [Google Scholar] [CrossRef]
Maghami, M.R.; Hizam, H.; Gomes, C.; Radzi, M.A.; Rezadad, M.I.; Hajighorbani, S. Power loss due to soiling on solar panel: A review. Renew. Sustain. Energy Rev. 2016, 59, 1307–1316. [Google Scholar] [CrossRef]
Paudyal, B.R.; Shakya, S.R. Dust accumulation effects on efficiency of solar PV modules for off-grid purpose: A case study of Kathmandu. Sol. Energy 2016, 135, 103–110. [Google Scholar] [CrossRef]
Pranav, S.; Kumar, S.; Biju, S.; Monachan, L.; Joy, J.; Boby, B. An Integrated System for Monitoring & Control of Solar Panel Using IoT & Machine Learning; AIJR: Balrampur, India, 2023; pp. 411–424. [Google Scholar]
Chouder, A.; Silvestre, S. Automatic supervision and fault detection of PV systems based on power losses analysis. Energy Convers. Manag. 2010, 51, 1929–1937. [Google Scholar] [CrossRef]
Gostein, M.; Duster, T.; Thuman, C. Accurately measuring PV soiling losses with soiling station employing module power measurements. In Proceedings of the 2015 IEEE 42nd Photovoltaic Specialist Conference (PVSC), Hyatt Regency, New Orleans, 14–19 June 2015. [Google Scholar]
Goossens, D.; Van Kerschaever, E.J. Aeolian dust deposition on photovoltaic solar cells: The effects of wind velocity and airborne dust concentration on cell performance. Sol. Energy 1999, 6, 277–289. [Google Scholar] [CrossRef]
Pavan, A.M.; Mellit, A.; De Pieri, D. The effect of soiling on energy production for large-scale photovoltaic plants. Sol. Energy 2011, 85, 1128–1136. [Google Scholar] [CrossRef]
Javed, W.; Guo, B.; Figgis, B. Modeling of photovoltaic soiling loss as a function of environmental variables. Sol. Energy 2017, 157, 397–407. [Google Scholar] [CrossRef]
Piliougine, M.; Cañete, C.; Moreno, R.; Carretero, J.; Hirose, J.; Ogawa, S.; Sidrach-De-Cardona, M. Comparative analysis of energy produced by photovoltaic modules with anti-soiling coated surface in arid climates. Appl. Energy 2013, 112, 626–634. [Google Scholar] [CrossRef]
Cano, J.; John, J.J.; Tatapudi, S.; TamizhMani, G. Effect of tilt angle on soiling of photovoltaic modules. In Proceedings of the 2014 IEEE 40th Photovoltaic Specialist Conference (PVSC), Denver, CO, USA, 8–13 June 2014. [Google Scholar]
García, M.; Marroyo, L.; Lorenzo, E.; Pérez, M. Soiling and other optical losses in solar-tracking PV plants in Navarra. Prog. Photovolt. Res. Appl. 2011, 19, 211–217. [Google Scholar] [CrossRef]
Venkatesh, K.; SankaraNayanan, S.; Arjun, P.; Kannan, K. AI Based Solar Panel Cleaning Robot. Int. J. Eng. Technol. 2023, 7, 313–318. [Google Scholar] [CrossRef]
Derakhshandeh, J.F.; AlLuqman, R.; Mohammad, S.; AlHussain, H.; AlHendi, G.; AlEid, D.; Ahmad, Z. A comprehensive review of automatic cleaning systems of solar panels. Sustain. Energy Technol. Assess. 2021, 47, 101518. [Google Scholar] [CrossRef]
Aqel, D.; Al-Zubi, S.; Mughaid, A.; Jararweh, Y. Extreme learning machine for plant diseases classification: A sustainable approach for smart agriculture. Clust. Comput. 2022, 25, 2007–2020. [Google Scholar] [CrossRef]
Ganthia, B.P.; Hanumanthakari, S.; Gudimindla, H.; Anandaram, H.; Ramkumar, M.S.; Mohanty, M.; Gopal, S.R.; Sarojwal, A.; Hadish, K.M. Machine Learning Strategy to Achieve Maximum Energy Harvesting and Monitoring Method for Solar Photovoltaic Panel Applications. Int. J. Photoenergy 2022, 2022, 4493116. [Google Scholar] [CrossRef]
Heinrich, M.; Meunier, S.; Samé, A.; Quéval, L.; Darga, A.; Oukhellou, L.; Multon, B. Detection of cleaning interventions on photovoltaic modules with machine learning. Appl. Energy 2020, 263, 114642. [Google Scholar] [CrossRef]
Rodriguez-Galiano, V.; Sanchez-Castillo, M.; Chica-Olmo, M.; Chica-Rivas, M. Machine learning predictive models for mineral prospectivity: An evaluation of neural networks, random forest, regression trees, and support vector machines. Ore Geol. Rev. 2015, 71, 804–818. [Google Scholar] [CrossRef]
Pavan, A.M.; Mellit, A.; De Pieri, D.; Kalogirou, S. A comparison between BNN and regression polynomial methods for the evaluation of the effect of soiling in large scale photovoltaic plants. Appl. Energy 2013, 108, 392–401. [Google Scholar] [CrossRef]
Bahel, V.; Pillai, S.; Malhotra, M. A comparative study on various binary classification algorithms and their improved variant for optimal performance. In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020. [Google Scholar]

Figure 1. A satellite image of PV panels in GUtech University in Oman.

Figure 2. Smart Robot GUtech [43].

Figure 3. Data distribution for the features in Table 1.

Figure 4. Data distribution for the features in Table 2.

Figure 5. Data distribution for the features in Table 3.

Figure 6. Data distribution for the features in Table 4.

Figure 7. Data distribution for the features in Table 5.

Figure 8. Data distribution for the features in Table 6.

Figure 9. Similar characteristics of data. Augmentation of the feature Humidity_Relative.

Figure 10. The impact of the characteristics of data Augmentation for the feature Air_Pressure_Relative before adjustment.

Figure 11. The Adjusted characteristics of the data. Augmentation of the feature Air_Pressure_Relative.

Figure 12. The impact of the characteristics of data Augmentation for the feature Si-South_BM_T1 before adjustment.

Figure 13. Adjusted characteristics of the data. Augmentation of the feature Si-South_BM_T1.

Figure 14. Correlation heatmap of the original DataFrame.

Figure 15. Correlation heatmap of the augmented DataFrame.

Figure 16. Correlation heatmap of the DataFrame.

Figure 17. Principal Component Analysis (PCA) of soiling loss difference.

Figure 18. Confusion matrices for KNN on original and augmented data.

Figure 19. Confusion matrices for SVM on original and augmented data.

Figure 20. Confusion matrices for logistic regression on original and augmented data.

Figure 21. Confusion matrices for Naïve Bayes on original and augmented data.

Figure 22. Confusion matrices for the hybrid CNN based on original and augmented data.

Figure 23. Confusion matrices for the Ensemble on original and augmented data.

Figure 24. Comparison of model training times on original and augmented data.

Figure 25. Training time comparison of various models on original vs. augmented data.

Figure 26. Comparison between original and augmented data precision using different models.

Figure 27. Comparison between original and augmented data recall using different models.

Table 1. Statistical summary of humidity and air pressure.

Statistic	Humidity Relative [%]	Air Pressure Relative [hPa]	Humidity Absolute [g/m³]	Air Pressure Absolute [hPa]
Count	199,034	199,034	199,034	199,034
Mean	52.887217	1005.109456	16.651684	1005.609456
Std	18.382446	6.703837	5.947081	6.703837
Min	3.182000	989.104000	1.838000	989.604000
25%	41.370000	999.660000	12.278000	1000.160000
50%	54.684000	1006.340000	15.820000	1006.840000
75%	66.358000	1010.656000	20.566000	1011.156000
Max	95.290000	1021.820000	33.642000	1022.320000

Table 2. Statistical summary of DC current, voltage, and power measurements.

Statistic	DC Current (A)	DC Voltage (V)	DC Power (W)
Count	180,817	180,817	180,814
Mean	1.858420	328.489983	1125.249204
Std	2.681206	307.896021	1606.051573
Min	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000
50%	0.000000	557.370000	0.000000
75%	3.764000	607.882000	2336.565000
Max	9.526000	772.888000	5925.558000

Table 3. Statistical summary of irradiance and temperature measurements across different solar panel setups.

Statistic	Irradiance [W/m²] (330 W 13 Tilt 17)	Temperature [°C] (330 W 13 Tilt 17)	Irradiance [W/m²] (330 W 14 Tilt 17)	Temperature [°C] (330 W 14 Tilt 17)	Irradiance [W/m²] (Si South BM-2-44)	Temperature [°C] (Si South BM-2-44)
Count	199,219	199,219	199,219	199,219	198,998	198,998
Mean	244.018341	35.293188	248.412665	35.811234	235.497894	36.368744
Std	329.356771	13.604029	334.914771	14.043997	319.270785	15.099531
Min	0.000000	12.528000	0.000000	12.138000	0.000000	12.070000
25%	0.000000	24.189000	0.000000	24.300000	0.000000	24.154000
50%	3.550000	31.694000	3.494000	31.674000	3.508000	31.838000
75%	508.690000	45.720000	520.320000	47.230000	488.063500	47.870000
Max	1153.666000	77.936000	1169.780000	76.800000	1128.428000	79.764000

Table 4. Statistical summary of irradiance [W/m²].

Statistic	Irradiance [W/m²]
Count	55,407
Mean	248.76
Std	334.15
Min	−24.51
25%	−4.91
50%	−0.154
75%	530.17
Max	1070.36

Table 5. Statistical summary of temperature, wind speed, and wind direction.

Statistic	Temperature [°C]	Wind Speed [m/s]	Wind Direction [°]
Count	199,034	199,034	199,034
Mean	28.81	1.46	167.46
Std Dev	6.05	0.90	91.01
Min	13.57	0.00	0.00
25%	24.00	0.77	80.60
50%	28.69	1.33	161.98
75%	33.31	1.99	254.49
Max	46.13	10.81	343.53

Table 6. Statistical summary of the Soiling Dust-IQ Sensor.

Statistic	Temperature 1 [°C]	Soiling Ratio 1 [%]	Soiling Loss 1 [%]	Soiling Ratio 2 [%]	Soiling Loss 2 [%]
Count	71,632	71,632	71,632	71,632	71,632
Mean	33.68	92.33	7.67	93.03	6.97
Std Dev	9.98	7.63	7.63	8.89	8.89
Min	12.03	74.56	−0.84	71.90	−2.00
25%	28.20	87.82	1.20	86.86	0.00
50%	32.62	94.50	5.50	97.80	2.20
75%	38.86	98.80	12.18	100.00	13.14
Max	69.27	100.84	25.44	102.00	28.10

Table 7. Statistical Summary for the Merged Data Row.

Statistic	Humidity Relative [%]	Air Pressure Relative [hPa]	Humidity Absolute [g/m³]	Air Pressure Absolute [hPa]	Month(s)
Count	235,296	235,296	235,296	235,296	235,296
Mean	52.887217	1005.109456	16.651684	1005.609456	5.277846
Std	16.906725	6.165661	5.469657	6.165661	3.350947
Min	3.182000	989.104000	1.838000	989.604000	0.000000
25%	44.330000	1000.780000	12.920000	1001.280000	2.000000
50% (Median)	52.887217	1005.109456	16.651684	1005.609456	5.000000
75%	64.098000	1010.014000	19.370000	1010.514000	8.000000
Max	95.290000	1021.820000	33.642000	1022.320000	11.000000
Statistic	DC Current (A)	DC Voltage (V)	DC Power (W)	SMP11_BM_Irradiance_SRAD (W/m²)	Trina_330W_Irradiance_SRAD_13 (W/m²)
Count	235,296	235,296	235,296	235,296	235,296
Mean	1.858420	328.489983	1125.249204	248.760655	244.018341
Std	2.350403	269.908248	1407.887535	162.151008	303.057150
Min	0.000000	0.000000	0.000000	−24.510000	0.000000
25%	0.000000	0.000000	0.000000	248.760655	0.000000
50% (Median)	1.468000	328.489983	923.665000	248.760655	113.027000
75%	2.004000	596.750000	1260.441000	248.760655	392.349500
Max	9.526000	772.888000	5925.558000	1070.356000	1153.666000
Statistic	Trina_330W_Temperature_T1_13	Trina_330W_Irradiance_SRAD_14	Trina_330W_Temperature_T1_14	Si_South_BM_Irradiance_SRAD	Si_South_BM_Temperature_T1
Count	235,296.000000	235,296.000000	235,296.000000	235,296.000000	235,296.000000
Mean	35.293188	248.412665	35.811234	235.497894	36.368744
Std	12.517728	308.171335	12.922563	293.613551	13.886103
Min	12.528000	0.000000	12.138000	0.000000	12.070000
25%	25.596000	0.000000	25.618000	0.000000	25.588000
50% (Median)	34.754000	114.582000	34.990000	108.028000	35.124000
75%	42.178000	400.456000	43.330500	375.482000	43.798500
Max	77.936000	1169.780000	76.800000	1128.428000	79.764000
Statistic	Temperature	Wind_Speed	Wind_Direction	Cleaning_Needed
Count	235,296.000000	235,296.000000	235,296.000000	235,296.000000
Mean	28.806195	1.460570	167.456839	0.205928
Std	5.561065	0.830588	83.705962	0.404379
Min	13.568000	0.000000	0.000000	0.000000
25%	24.844000	0.858000	91.585500	0.000000
50% (Median)	28.806195	1.460570	167.456839	0.000000
75%	32.446000	1.854000	242.000500	0.000000
Max	46.130000	10.814000	343.526000	1.000000

Table 8. Feature importance ranking for features in the DataFrame.

Feature	Importance
Si (South)—Irradiance [Wh/m²]	0.903870
SMP11 BM-1-51—Irradiance [W/m²]	0.067434
Soiling ratio 2 [%]	0.009855
Soiling ratio 1 [%]	0.007951
Soiling loss 2 [%]	0.005414
Soiling loss 1 [%]	0.004487
Soiling loss difference	0.000478
Air pressure relative 1 [hPa]	0.000195
Air pressure absolute 1 [hPa]	0.000181
Si (South)—Temperature 1 [°C]	0.000136
DC Current String 2	0.000000

Table 9. Performance comparison of machine learning models on original and augmented datasets.

Model	Dataset Type	Accuracy	Precision	Recall	AUC	Training Time
KNN	Original	0.992466	0.981423	0.982325	0.988725	0.014758
SVM	Original	0.997007	0.992345	0.993257	0.995624	1100.642630
LR	Original	0.999427	0.998468	0.998774	0.999186	2.770148
NB	Original	0.934610	0.780370	0.953617	0.941621	0.087232
Hybrid	Original	0.991235	0.979834	0.977932	0.986327	607.544244
Ensemble	Original	0.999024	0.996939	0.998365	0.998781	1046.400224
KNN	Augmented	0.996096	0.990644	0.990388	0.993981	0.028432
SVM	Augmented	0.997564	0.993700	0.994470	0.996418	3584.190991
LR	Augmented	0.999372	0.999017	0.997933	0.998839	4.768173
NB	Augmented	0.933574	0.773361	0.958088	0.942654	0.168433

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Humairi, A.; Khalis, E.; Al Hemyari, Z.A.; Jung, P. The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency. Processes 2025, 13, 1195. https://doi.org/10.3390/pr13041195

AMA Style

Al-Humairi A, Khalis E, Al Hemyari ZA, Jung P. The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency. Processes. 2025; 13(4):1195. https://doi.org/10.3390/pr13041195

Chicago/Turabian Style

Al-Humairi, Ali, Enmar Khalis, Zuhair A. Al Hemyari, and Peter Jung. 2025. "The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency" Processes 13, no. 4: 1195. https://doi.org/10.3390/pr13041195

APA Style

Al-Humairi, A., Khalis, E., Al Hemyari, Z. A., & Jung, P. (2025). The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency. Processes, 13(4), 1195. https://doi.org/10.3390/pr13041195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Impact of Data Augmentation on AI-Driven Predictive Algorithms for Enhanced Solar Panel Cleaning Efficiency

Abstract

1. Introduction

2. Related Work

3. Techniques and Approaches Developed

4. Methodology

4.1. Dataset Description

4.2. Data Processing and Feature Engineering

4.3. Data Augmentation Techniques

4.4. DataFrame Correlation

4.5. Feature Importance

4.6. Dimensionality Reduction for Optimized Cleaning Strategies

5. Results and Discussion

6. Conclusions and Recommendations

6.1. Conclusions

6.2. Recommendations

7. Limitations and Future Work

7.1. Limitations

7.2. Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI