1. Introduction
Peppers are fruits that belong to the nightshade family. They are excellent low-calorie sources of vitamins A and C, antioxidants, potassium, folic acid, and fibre, and their consumption is associated with a reduced risk of certain chronic illnesses, such as cancer and heart disease. In addition to being a popular ingredient in many dishes, peppers can be enjoyed raw, making them a highly versatile component of a well-balanced diet. In addition, peppers contain various bioactive compounds that are of interest for pharmaceutical and cosmetic applications [
1,
2].
Peppers are warm-season crops harvested under glass for an average of two months but require delicately balanced growing conditions. Hence, their cultivation can be viewed from two interrelated perspectives: maintaining high productivity and nutritional value while minimising pesticide use. Although dangerous to human health, pesticides are still indispensable in the fight against viruses, bacteria, fungi, and other harmful organisms known to cause various diseases in pepper plants, affecting multiple parts of the plant [
3]. Some of those diseases particularly target the leaves, leading to distinct and noticeable visual signs and symptoms [
4]. Common diseases affecting the leaves of greenhouse peppers include mildew, mites, worms, aphids, and blight. Among these, aphids and worms are particularly detrimental.
A timely diagnosis is essential for timely treatment to prevent the spread of the disease. Yet, another vital advantage of the timely diagnosis is preventing unnecessary and over-spraying. Conventional methods such as observations, laboratory kits, and biochemical tests have traditionally been used in agriculture to detect greenhouse pepper diseases. However, diagnosis using these methods is, for various reasons, time-consuming and costly. At this stage, AI, which enables machines to mimic human-like behavior, seems to be a potential problem-solver for this and similar multi-parametric tasks [
5].
In recent years, various DL models have become very popular due to their high effectiveness and practicality. With the development of science and technology, they have outperformed many traditional models in various fields of industry and medicine [
6,
7,
8,
9]. In addition, DL algorithms, known for their high performance in object detection and classification, have been widely tested in agricultural diagnostics [
10,
11,
12], and some of those are focused specifically on healthy leaves and leaves with bacterial spots of pepper with accuracy up to 83.89% [
12,
13]. Referring to previous studies, DL algorithms, such as AlexNet, SqueezeNet, and Modified ResNet50, have been tested in disease diagnosis of various plants with good results. Still, few of them are related to pepper diseases [
14]. Both the limited number of datasets in this issue and the limited number of images in existing sets are important obstacles. Refers to dataset classification using the VGGNet architecture, which has also been investigated in several studies, with conspicuous accuracy results achieved [
15]. One of those has been obtained with an extensive open dataset comprising approximately 48,331 images of healthy and diseased plant leaves [
16]. Notable results have been reported using pre-trained AlexNet and VGGNet models. The most widely tested DL algorithms typically consist of the input, hidden, and output layers. These algorithms are primarily applied to images from the PlantVillage public dataset [
17].
It is notable that, differently from the studies mentioned above, this study focuses on diagnosing pepper leaf diseases using the modified VGG16Net algorithm, trained on images from a private dataset. VGG16Net is chosen due to its high accuracy in handling complex datasets. To optimise the training process, the hidden layers of VGG16Net were frozen, and three additional layers were introduced to fine-tune, effectively reducing the duration of the training. Unlike previous studies, this study incorporated user-controlled adjustments to evaluate the algorithm’s performance. Several contributions of this study should be highlighted. Firstly, this study proposes a modified VGGNet-based model that diagnoses six of the most common pepper diseases with specific spots on the leaves. Although the method of catching the spots is much more comprehensive, it is largely avoided due to difficult-to-distinguish colour transitions. For more successful feature extraction, the colour enhancement is set to emphasise the characteristic green and yellow tones through an algorithm designed for this purpose. This is the first work on diagnosing aphid and caterpillar classes via DL models. Finally, the results are a promising step towards more conscious and effective pesticide use.
The rest of the paper is organised as follows:
Section 2 briefly reviews previous studies in retrospective analysis of the dataset, algorithms, and methodologies to detect leaf diseases.
Section 3 is devoted to the method on which the analysis is based, while
Section 4 summarises the results.
Section 5 emphasises the discussion. Finally,
Section 6 presents the conclusion and future work.
2. Related Works
The literature survey focuses on prior studies on detecting plant diseases, which were published in the last decades, employing DL algorithms, such as AlexNet, DenseNet, VGGNet, MobileNet, and ResNet. Ten studies, chosen because they resemble our work, are chronologically summarized below.
Since it was first proposed in 1998 by LeCun et al. for image recognition, DL has been significantly developed and enriched through various studies and research [
18]. However, paving the way for agricultural tasks was a challenge, mainly due to the lack of an adequate database. Until recently, DL studies in agriculture, especially in plant disease diagnosis [
19], were almost non-existent except for sporadic practical applications. In the last couple of years, researchers have focused on improving the accuracy of plant disease detection by employing various algorithms and applying different databases to a wide range of plant species.
A crucial milestone in AI applications for agriculture tasks was the release of the PlantVillage database, which comprises 38 classes of 14 plant species and remains one of the most extensive datasets in the field. Sardogan et al. utilised a CNN model to identify tomato leaf diseases using the PlantVillage database [
20,
21,
22]. To further enhance the classification process, they implemented the learning vector quantisation (LVQ) algorithm, which achieved an accuracy rate of 88%. Demonstrating its effectiveness with such accuracy paved the way for the broadened usage of CNN-based classification techniques for detecting leaf diseases. In the study, tomato leaves are classified into five classes, including four disease classes and one healthy class.
The same year, Rangarajan and his team employed pre-trained AlexNet and VGGNet models to categorise tomato leaf images from the same dataset into seven distinct classes [
23]. Their experimental results demonstrated a notable accuracy rate of 97.4%, primarily attributed to the AlexNet architecture. Despite the methodological similarities between studies, their study achieved a significantly higher accuracy than other studies’ findings. Another study based on VGGNet was conducted by Ferentinos and colleagues, who evaluated and compared the performance of various architectures trained on an open database comprising approximately 87,000 photographs of healthy and diseased plant leaves for 25 plant species, including peppers [
24]. Among the architectures compared, VGGNet achieved an accuracy rate of 99.53%, surpassing other models in this dataset.
The following year, Kaya et al. proposed a combined VGGNet–AlexNet methodology tested on four databases, including the PlantVillage database [
25]. Different types of plants, including peppers, were classified as healthy and diseased. Instead of plant-based accuracy, the accuracy rates are given separately in a more general form for each dataset. The best score obtained for binary classification on PlantVillage is 99.80%. Pepper leaves from the same dataset were also used by Wu and colleagues, who worked on detecting one of the most common bacterial diseases, Xanthomonas campestris [
13]. Their study utilised a triple-classifier VGG16-based model to classify leaf images into healthy, mildly infected, and strongly infected groups.
Das proposed an alternative classification method for diagnosing infected pepper leaves by detecting bacterial spot patterns [
26]. Two VGGNet models, VGG16 and VGG19, were employed separately as binary classifiers of 2475 images to distinguish healthy and infected leaves with an accuracy of around 96% and 97%, respectively. In general, advances in DL in the last decade have led to tremendous progress in classification tasks, so the fertile field of recognizing pepper leaf disease is no exception. Over time, basic binary classification gave way to more complex multiclassifications. Several five-class classification models encompassing healthy and four diseased categories have recently been reported [
27].
Rababa et al. also identified bacteria-infected pepper leaves by conducting four DL models, including VGGNet [
28]. The results obtained were compared according to standard criteria, and the best accuracy was found to be 58%. Begum et al. obtained much better results for accuracy and other parameters [
29]. In their study with a novel technique for parameter tuning, the proposed model was compared with four DL models, including VGG16. All models were applied as binary classifiers on 1855 pepper leaves to distinguish infected from healthy leaves. AUC values achieved by the proposed and the VGG16 models were 0.99 and 0.92.
It could be inferred from a comprehensive literature review that most of the studies employing the VGGNet conducted in this field have been focused on disease diagnosis as healthy/diseased classification, regardless of plant type. However, among the algorithms employed, the VGGNet image processing algorithm has demonstrated exceptional efficacy on complex and large datasets, consistently achieving superior accuracy rates. This study aims to assess the accuracy of the VGG16 model using a dataset that contains seven different classes. The primary objective of the research is to optimize performance and minimize time loss by leveraging the pre-trained VGG16 architecture.
3. Methodology
This section details a proposed CNN model’s training and testing processes with the schematic diagram shown in
Figure 1. The raw dataset consists of 1600 original digital images that underwent a comprehensive pre-processing pipeline before being used for training with a pre-trained VGG16 architecture. The convolutional and max-pooling layers were frozen, and dense and dropout layers were added to expedite the training process and improve the model’s performance. Below are further details about the dataset, pre-processing, and classification process.
3.1. Dataset
A custom dataset consists of 1679 pepper leaf images captured by an iPhone digital camera with original dimensions of 1200 × 1600 pixels. Sample leaf images for all categories presented in the study are shown in
Figure 2. The dataset includes 300 healthy and 1379 infected leaf images isolated against a uniform background to ensure consistency in visual analysis. The “infected” group consists of diseases such as 123 aphid-infected leaves, 300 leaves burnt, 145 caterpillar-infested leaves, 300 mildew-infected leaves, 211 mite-infected leaves, and 300 leaves infected by leafworms. Thus, the dataset comprises images categorised into seven labelled classes representing healthy and various diseased conditions. In total, 70% of the images were used for training, while the remaining portion was utilised for testing and validation in a 2:1 ratio.
The images in each class were split according to the recorded ratios using the data adjustment function split_data, which creates a class-by-class, up-to-date status across training, continuous, and test sets. Class weights were maintained, preserving the classes during the split. Accordingly, training, test, and validation sets contain 1174, 335, and 164 images, respectively.
3.2. Pre-Processing Strategy
Data pre-processing is a set of steps to improve the image quality of the raw dataset for more efficient analysis. This improves performance in recognising the image and generalising across different image variations. The pre-processing model proposed in the study consists of four steps: resizing, cropping, colour enhancement, and data augmentation. More details about the pre-processing steps are given below.
3.2.1. Cropping and Resizing
The leaf images were originally captured at a resolution of
pixels. To enable more effective analysis and training, the non-leaf parts of the images were manually cropped, and the resolution was reduced to
pixels. However, the VGG16 models work with standard image dimensions of
pixels. This size is suitable for efficient computation and helps shorten the training time. To ensure compatibility with the VGG16 architecture, all images were resized to
pixels using the Img_Width, Img_Height = 224, 224 program [
30].
Figure 3 illustrates two before-and-after examples of cropping and resizing.
3.2.2. Augmentation
Data augmentation is a set of techniques for artificially creating new data samples to increase the variety of training data for better model generalisation. In the study, data augmentation was performed through eight techniques with details listed in
Table 1. With nine samples per image, the number of samples in the training dataset increased to 10,080.
Augmentation may cause some empty pixels to appear around the process-exposed area. Empty pixels are fixed with the ‘fil_mode=’nearest’ function by filling the gaps with values of neighbouring pixels. The enhancement is followed by highlighting disease features through colour enhancement, using pre-processing_function=enhance_colors, to streamline the training process.
Figure 4 illustrates various techniques employed in the data augmentation process, along with their corresponding visual examples. These techniques have enhanced data diversity, contributing to improved training performance.
3.2.3. Colour Enhancement
The algorithm with the flowchart diagram shown in
Figure 5 is designed to intensify the green and yellow colours in the image. The algorithm enhances images’ vibrancy, contrast, or other colour properties. This algorithm makes the colour tones of the images more prominent and helps the model to recognise better colour variations. The images are converted from RGB to the HSV (Hue, Saturation, Value) colour space. Lower and upper bounds are defined for green and yellow in the HSV colour space. Then, the cv2.inRange() function is used to detect and mask the green and yellow areas in the image. The detected green and yellow areas are then converted back to the RGB colour space.
This process is typically employed for images featuring green and yellow tones, such as those of plants, to enhance the vividness and prominence of these colours. By increasing the saturation of the green and yellow hues, the objective is to assist the model in better distinguishing and recognising these colours. In the present study, various masking techniques were utilised. Given that our research is centred on plant diseases, we specifically implemented masking operations on the green and yellow colour regions.
Figure 6 presents some sample images where the saturation of green and yellow colours has been enhanced.
3.3. Classification with Modified VGG16
Proposed by the Visual Geometry Group at the University of Oxford [
31], the VGG16 model is a CNN architecture that is characterized by a depth of 16 layers, including 13 convolutional layers and three fully connected layers. VGG-16 is renowned for its simplicity, versatility, and enviable performance on a range of computer vision tasks, including image classification and object recognition. As a result, it remains a popular choice for many DL applications.
In this study, to further improve its performance, the VGG16 model was previously trained on ImageNet, excluding the original fully connected layers by setting the include_top parameter to False. After initialization on the ImageNet dataset, the convolutional and MaxPooling layers of the model were set to non-combining (frozen), which reduced the learning of redundant information. Additionally, to enhance the learning process and accelerate classification, several new layers were introduced into the architecture. These can be ordered as follows:
A dense layer with 256 distributions and ReLU activation, followed by a 50% dropout layer;
A second dense layer with 128 distributions and ReLU activation, followed by a 50% dropout layer;
Finally, an output security layer with softmax activation, with lengths up to the number of classes.
The add-on layers can be characterized as flattened, dense, and dropout layers. The flattening layer is necessary to feed the output into the fully connected layers, and the dense layer is added as a fully connected layer using the ReLU activation function. The dense layer is also adjusted according to the number of classes in the training data, as it is the classification layer. The dropout is used to prevent overfitting. By adding a flattened dense and dropout layer, the desired level of learning was achieved, which in turn shortened the classification time.
The model is customized to successfully tackle the specific task of classifying pepper leaf diseases by leveraging the advantages of the large and complex VGG16 model, combined with the newly added layers. This approach also ensures faster training and more efficient performance.
The training batch size was 32. During training, Adam optimizer was utilized to adjust learning rates due to its high performance on large datasets and complex models, resulting from efficient memory usage and automatic adaptation of all parameters after switching to new learning rates. The algorithm was used with a learning rate of . Categorical cross-entropy was used as the loss function, while accuracy was used as the performance metric. ReduceLROnPlateau and EarlyStopping callbacks were implemented to increase training stability and prevent overfitting.
3.4. Theoretical Background Behind the Performance Analysis of the DL Model
A confusion matrix is one of the most prominent methods for summarising a classifier’s quality. It shows the absolute number of correct and false predictions on a set of test data. To provide a clearer picture of their performance, the DL models are also tested using the following criteria: accuracy, loss, precision, recall, and F1 score. For better organisation of the section, all symbols used during the calculations are listed in
Table 2.
The confusion matrix describes the performance of a classification model by showing the true vs. predicted classifications [
32]. For the confusion matrix of M classes, represented in Equation (
1):
For a predicted class
k, parameters
,
,
, and
are calculated through Equations (
2)–(
5):
Accuracy is the ratio of total correct predictions over the total number of predictions, and it is given as in Equation (
6):
The loss function measures the error between the model’s predictions and the actual values, and it is represented by Equation (
7):
Precision can be defined as the ratio of true positive predictions to the total predicted positive samples. It indicates how many predicted positive instances are correct [
33,
34,
35]. It is given by Equation (
8) as follows:
Recall, also known as sensitivity, is the ratio of true positive predictions
TP to the total actual positive samples (
TP + FN). This metric shows how well the model identifies the positive class; in other words, it indicates how effectively the model captures the actual positives. The recall of the
kth class is represented by Equation (
9) as follows:
Finally, F1 score is the harmonic mean of precision and recall. The metric balances both metrics are used to better reflect the overall model performance, especially in case of class imbalance. A higher F1 score indicates better model performance [
29,
34]. The F1 score of the
kth class is represented by Equation (
10) as follows:
4. Numerical Results
The study performed the classification task on a PC equipped with an Intel Core i5-9400 processor, an NVIDIA 360 GPU with 4 GB of VRAM, and 8 GB of RAM. Before training, the data is automatically partitioned, and randomly selected images are placed into training, validation, and testing folders based on the specified ratios. After training and validation, the model is tested on images from the test folder. The model’s performance parameters are evaluated using Equations (
6)–(
10). All results are presented in
Table 3.
The model achieved a rather high overall accuracy of 0.92, which indicates that the model is generally performing well across all classes. The macro averages indicate a precision of 0.94, a recall of 0.87, and an F1 score of 0.87. It can be inferred from the figures that the precision is generally high, despite the model may struggle with recall across certain classes, The weighted average, which accounts for the support of each class according to its incidence, indicates that the model’s overall performance is slightly better in terms of precision (0.93), recall (0.92), and F1 score (0.90).
The classification model performs well overall, especially in identifying the “CATERPILLAR”, “HEALTHY”, “MITE”, and “WORM” classes. The “APHID” class shows significant room for improvement, as indicated by its low precision, recall, and F1 score. While the model’s high accuracy is promising, specific classes need targeted adjustments or more training data to enhance their detection capabilities.
Figure 7a,b show the model’s training and validation performance progress over 50 epochs. The model’s training process is represented through (a) accuracy and (b) loss. The blue dotted line represents the model’s training accuracy, and the orange dotted line indicates validation accuracy. Training accuracy starts at 0.2 and increases as the epochs progress, reaching approximately 0.6. The fact that validation accuracy is higher than training accuracy indicates that the model has a strong generalisation ability. However, after the 20th epoch, there is no further improvement in validation accuracy, suggesting that the model has reached the maximum level of learning it can achieve from the dataset and will not show any more progress.
In the second graph, the dotted blue line shows the model’s training loss, while the dotted orange line represents validation loss. Training loss decreases continuously as the epochs progress, and validation loss is stabilised around the 20th epoch. After this point, further training may result in overfitting. Overfitting occurs when the model becomes too closely adapted to the training data, reducing its generalisation ability. Therefore, training should be stopped around the 20th epoch, or early stopping techniques should be applied to prevent the model from overtraining.
The test results are summarised in the confusion matrix presented in
Figure 8. The image analysis in the test folder revealed that the learning process yielded foolproof outcomes for some specific classes, such as HEALTHY, CATERPILLAR, MILDEW, and MITE. Minor classification errors are obtained for BURNT and WORM classes, while misclassifications in the APHID class are more challenging. Although APHID predictions are mostly confused with MILDEW, there is some minor confusion between the BURNT and HEALTHY classes.
5. Discussion
The classification of various diseases affecting pepper leaves is based on their distinct visual and colour differences, as an important guide to ensure accurate differentiation. Healthy pepper leaves are predominantly uniformly bright green. Unlike healthy ones, burnt leaves are characterised by brown, ring-shaped spots. The leaves affected by aphids display black dots as their defining feature. The leaves affected by caterpillars can be distinguished by characteristic leaf deficiencies and deep perforations, serving as their most prominent characteristic. Mite-affected peppers exhibit wrinkled leaves with dark green layers. In the worm-infected group, the formation of white lines on the leaf surface is identified as a key distinguishing detail. Finally, leaves infected with mildew are primarily characterised by yellow decay and spots. Comprehensive pre-processing based on that guidance, combined with high-quality photo lighting, builds up distinguishing characteristics of these classes, which, in turn, have positively impacted the learning process’s performance.
To the best of our knowledge, this is the first time that aphids and caterpillar classes have been investigated. Although the private dataset specially collected for this study is slightly imbalanced, it contains more than 220 images per class, which is a reasonable average compared to the other datasets in this field and enough to make meaningful deductions.
The proposed model is compared with recent counterparts to provide a more comprehensive judgment about accuracy. Studies are examined according to several criteria in
Table 4. It is clearly shown that higher accuracies are provided for a smaller number of tasks. In addition, in these studies, the number of patterns that models need to learn is significantly smaller, resulting in higher accuracy scores. We obtained accuracy for both modified and original VGG16 models. At first glance, the approximate results should also be viewed from the aspect of processing speeds under the same conditions. It should be highlighted that the processing time is shortened from 7618 s, which is originally required, to 6750 s.
Despite the strong performance of the proposed model, several limitations must be acknowledged. The dataset used in this study is private and relatively limited in size, particularly for certain disease classes such as aphids and caterpillars, which may affect the robustness and generalizability of the model. Predictions for the aphid class are frequently confused with images from the burnt and mildew classes. A deeper analysis of the results indicates that variations in the conditions under which the photographs were taken contributed to classification errors during the training process. That is, the performance could be sensitive to background noise and cool and warm colour tones due to variations in lighting conditions, which are unavoidable in real-world environments. Finally, it should also be pointed out that although colour enhancement techniques improved class separability, they may introduce artefacts or distortions if not properly calibrated across diverse image sources.
6. Conclusions
The study presented a modified DL-based model for classifying six pepper leaf diseases and healthy leaves. This is the first time two classes, aphids and caterpillars, have been included in the diagnosis process. The private dataset, specially collected for this study, contains 123 aphid-infected and 145 caterpillar-infested leaves. Contrary to previous research based on enhancing contrast settings, this study mainly focuses on a novel pre-processing model that promotes enhancing some distinguishing characteristic colours among classes. For this purpose, the colour settings have been designed to be sensitive to yellows and greens. More comprehensive pre-processing has built up distinguishing characteristics for each class to be diagnosed and positively impacted the learning process’s performance. The VGG16-based model, chosen primarily due to its strong feature extraction performance, was initialised with weights pre-trained on the ImageNet dataset. These weights were preserved and frozen before all convolutional and MaxPooling layers of VGG16 were set to be non-trainable. Adding new layers to the pre-trained VGG16 model and training them only increases the training speed and facilitates pattern recognition, which is important for diagnostic performance. Despite thorough results for most classes, accuracy results show that minor classification errors are obtained for the burnt and worm classes, while misclassifications in the aphid class are more challenging. The relatively small number of aphid-infected samples is believed to have hindered accuracy. Additionally, lighting conditions during sample collecting and the model’s sensitivity to cool and warm colour tones contribute to classification errors during training. These avenues remain for further exploration. Hence, in addition to expanding the dataset, particularly for underrepresented classes such as aphids, to improve model generalizability and reduce misclassification, our future work will mainly focus on increasing the robustness in sensitivity to cool and warm colour tones. Additionally, exploring alternative deep learning architectures and ensemble methods may further enhance diagnostic accuracy.
We hope our study will be a valuable step toward a quick diagnosis from snapshots obtained by real-time greenhouse monitoring systems. This could be followed by appropriate agricultural spraying by drones or unmanned aerial vehicles.