Optimizing Plant Disease Classification with Hybrid Convolutional Neural Network–Recurrent Neural Network and Liquid Time-Constant Network

Le, An Thanh; Shakiba, Masoud; Ardekani, Iman; Abdulla, Waleed H.

doi:10.3390/app14199118

Open AccessArticle

Optimizing Plant Disease Classification with Hybrid Convolutional Neural Network–Recurrent Neural Network and Liquid Time-Constant Network

¹

School of Computing, Electrical and Applied Technology, Unitec Institute of Technology, Auckland 1025, New Zealand

²

Department of Mathematics and Data Analytics, School of Arts and Sciences, The University of Notre Dame Australia, Broadway Campus, Chippendale, NSW 2007, Australia

³

Department of Electrical, Computer and Software Engineering, The University of Auckland, 20 Symonds Streer, Auckland 1010, New Zealand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9118; https://doi.org/10.3390/app14199118

Submission received: 16 July 2024 / Revised: 8 September 2024 / Accepted: 18 September 2024 / Published: 9 October 2024

(This article belongs to the Special Issue Multimedia Signal Processing: Theory, Methods, and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This paper addresses the practical challenge of detecting tomato plant diseases using a hybrid lightweight model that combines a Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). Traditional image classification models demand substantial computational resources, limiting their practicality. This study aimed to develop a model that can be easily implemented on low-cost IoT devices while maintaining high accuracy with real-world images. The methodology leverages a CNN for extracting high-level image features and an RNN for capturing temporal relationships, thereby enhancing model performance. The proposed model incorporates a Closed-form Continuous-time Neural Network, a lightweight variant of liquid time-constant networks, and integrates Neural Circuit Policy to capture long-term dependencies in image patterns, reducing overfitting. Augmentation techniques such as random rotation and brightness adjustments were applied to the training data to improve generalization. The results demonstrate that the hybrid models outperform their single pre-trained CNN counterparts in both accuracy and computational cost, achieving a 97.15% accuracy on the test set with the proposed model, compared to around 94% for state-of-the-art pre-trained models. This study provides evidence of the effectiveness of hybrid CNN-RNN models in improving accuracy without increasing computational cost and highlights the potential of liquid neural networks in such applications.

Keywords:

smart agriculture; plant disease detection; deep learning; convolutional neural network; recurrent neural network; liquid time-constant networks; internet of things; sustainable agriculture

Graphical Abstract

1. Introduction

Tomatoes are essential for human nutrition and global food security, as they are rich in nutrients and antioxidants. Environmentally, tomato farming is beneficial due to sustainable practices like integrated pest management, crop rotation, and organic farming, which help maintain soil health and biodiversity. Urban tomato farming also reduces the carbon footprint associated with long-distance food transportation [1]. The market size of this product is estimated at USD 207.17 billion in 2024 and is expected to reach USD 261.41 billion by 2029, growing at a CAGR of 4.76% during the forecast period (2024–2029) [2], highlighting its economic importance. However, tomato farming faces significant challenges, such as pests, climate change, labor demands, and, particularly, disease detection [3]. Effective disease detection is crucial to ensure sustainable production and maintain the economic viability of tomato farming. This research aimed to develop advanced disease detection methods to support and enhance the resilience of tomato cultivation.

Current research explores various innovative strategies for detecting tomato leaf diseases, such as molecular biology and nanotechnology methods that leverage nanobodies and nano-sensors to detect pathogens at the molecular level with high specificity and sensitivity [4,5] or predictive modeling techniques that use environmental data, such as weather conditions and soil properties, to forecast disease outbreaks. However, these methods face challenges, including scalability, high-maintenance infrastructure, and the need for specialized personnel, making them less feasible for large agricultural fields [6].

Above all, computer vision techniques, particularly image classification, have gained attention for their cost-efficiency and scalability [7]. Farmers or automated systems use digital cameras or smartphones to capture images of plant leaves, which are then preprocessed (resizing, normalization, augmentation) to create a robust dataset. These images, labeled as healthy or diseased, are used to train a machine learning model, usually a CNN, to differentiate between healthy and diseased leaves. Once trained, the model can classify new images and determine the presence and type of disease. This technology can be integrated into mobile apps or IoT devices, allowing farmers to receive instant feedback on plant health and take timely action. In this method, traditional image recognition technologies have been widely used in applications such as facial recognition, object detection, and medical image analysis. However, they struggle with complex and diverse visual data, especially under varying conditions like lighting, scale, and occlusion. These methods also rely on manually designed features, which is a subjective, time-consuming, and inefficient process.

Recent advancements in deep learning have revolutionized image recognition by automatically learning hierarchical features from raw data, improving accuracy and scalability. However, these techniques suffer from computational complexity due to their architectures and the need for high-performance hardware [8]. Additionally, processing high-resolution images can be challenging, requiring larger models to handle the complexity.

To address these challenges, various strategies have been developed to reduce model complexity. One novel approach was to use hybrid CNN-RNN models that combine a CNN and RNN to leverage the strengths of both architectures to improve classification accuracy and efficiency. One of the main approaches to hybrid CNN-RNN models was to use image classification with hierarchical models [9,10,11], where a CNN is used to generate discriminative features, and an RNN is used to generate sequential labels from those features. However, these methods mostly apply to classification problems that have multiple hierarchical labels, so they are not suitable for single-label classification.

Other approaches used for single-label classification include different variations of combinations. One approach [12] is to model the spatial dependency to improve classification accuracy by scanning the image features outputted by the CNN model region-by-region from different directions (left to right, top to bottom, etc.) and input them to RNN to capture the sequential relationship. Another approach [13] works in a different direction by splitting the image into different patches and feeding them into the CNN model; the output features are then concatenated and input to RNN for better classification. In [14], researchers combined a CNN and RNN parallelly to extract image features and then used the perceptron attention mechanism to weigh the features extracted from the two models. In another simpler approach [15], the authors simply concatenated the output features from a CNN and fed them directly to the RNN as sequential information to achieve better classification results.

Deep learning models reported in [9] and [16] can identify and categorize fruit images by combining CNN, RNN, and Long Short-Term Memory (LSTM) models to create a multi-model fruit image identification system. The CNN is used to develop discriminative features from the images, while the RNN handles sequential labeling tasks. The LSTM is employed to encode learning at each classification interval, enhancing the model’s understanding of the data. When it comes to disease detection using leaf images, Ref. [17] proposes an approach that leverages pre-trained CNN networks like Xception, VGG16, and InceptionV3 to extract deep features from various fully connected layers. Additionally, an LSTM is employed to capture relationships among these image features. Transfer learning is used to facilitate feature extraction from pre-trained models. The deep features extracted from both the CNN and RNN layers are concatenated and inputted into a fully connected layer.

However, most of the current studies in this direction face one of the three main gaps below:

Much of the research primarily studies improving accuracy without adequately considering the limitations of IoT devices or edge computing [18]. This oversight highlights the need to address the efficiency of models, including factors like time complexity and computational cost. Bridging this gap is crucial to ensure that disease detection models can effectively operate within resource-limited environments. This is particularly important for enabling real-time monitoring and intervention in agriculture, where timely detection and response are vital. Therefore, it is essential to take a balanced approach that aims for both accuracy and efficient use of computational resources for practical applications;
Another significant aspect is that many of the existing studies utilize an LSTM as the primary RNN model for temporal dependency detection. However, LSTMs are computationally intensive, especially when dealing with high-dimensional datasets, such as those used in image classification. Ref. [19] demonstrates that LSTM models struggle to retain information when the input time steps T > 100, limiting their effectiveness with high-resolution images. Hence, there is a pressing need to explore alternative RNN models, such as the recently developed LTC model, to assess the impact of combining CNN and RNN architectures;
Finally, while most studies focus on improving model architecture and testing on images captured in perfect conditions, it is necessary to validate lightweight models on real-life datasets that reflect the conditions in which images are captured in real-world scenarios [20]. These datasets should include images taken from different angles, under various lighting conditions, or with low-resolution cameras. Such validation is crucial to ensure that the models are feasible and perform well when applied in real-life applications.

Addressing these research gaps, this paper proposes a hybrid lightweight model combining a CNN and an RNN to detect tomato plant diseases from leaf images, focusing on reducing computational costs while maintaining high accuracy. The dataset is enhanced using augmentation methods like rotation and adjustments to brightness and contrast to mimic real-life conditions. The model employs a liquid neural network with Neural Circuit Policy (NCP) [21], improving performance over traditional LSTM models. This approach results in better generalization while remaining sensitive to rare disease cases.

The technological innovation of this study for real-life agricultural applications lies in the deployment of a lightweight CNN-RNN model directly on IoT devices. This approach enables real-time image processing on-site, eliminating the need to send images to remote servers for analysis and wait for results.

As illustrated in Figure 1, the system captures images of tomato plants using cameras installed on the field. These images are then processed locally by an optimized CNN-RNN model deployed on an IoT device, such as a Raspberry Pi. The lightweight nature of the model ensures that it can run efficiently on these resource-constrained devices, allowing for rapid image analysis directly at the point of capture.

This on-site processing capability offers several significant advantages:

Due to the lightweight nature of the model, the system can handle multiple camera devices with only a few IoT devices, making it easy to scale. The elimination of the need for powerful central servers reduces infrastructure costs and simplifies deployment in large agricultural fields;
By processing images locally, the system can quickly send results to user devices, enabling timely interventions. This rapid feedback loop is crucial for effective disease management, allowing farmers to address issues as soon as they are detected;
Since the image processing is performed on-site, the system is less dependent on stable and high-speed internet connections, which can be a limitation in remote farming areas. This independence enhances the reliability of the system;
Processing data locally on IoT devices reduces the risk associated with transmitting sensitive data over networks. This approach ensures that data remain secure and private, addressing potential concerns related to data breaches and unauthorized access.

The same approach used in this study has been previously published in [22], where we discuss the related methodologies and initial findings. However, this current research goes beyond the previous by exploring and validating the performance of the model on a deeper scale, providing a more comprehensive analysis and testing under a wider range of conditions. This extended investigation offers greater insight into the model’s capabilities and its potential applications, thereby enhancing the novelty and contribution of this work.

2. Materials and Methods

2.1. Dataset

The dataset used for this research includes the Tomato Leaves dataset, which is publicly available in [23]. This dataset comprises over 20,000 images categorized into ten different diseases and a healthy class, as shown in Figure 2. The images were collected from two distinct environments: controlled lab settings and real-world, in-the-wild scenes. This diverse dataset provides a comprehensive representation of tomato plant conditions, making it suitable for training and evaluating plant disease detection models.

The dataset, which includes 11 distinct classes, is split into training, validation, and test sets in a specific ratio of 70%, 20%, and 10%, respectively. This particular division ensures that each class is adequately represented across the three subsets, maintaining a balanced distribution of data for model training, validation, and evaluation. The training set, encompassing 70% of the dataset, serves as the foundation for model learning, allowing for it to discern intricate patterns and relationships within each class. Subsequently, the validation set, constituting 20% of the dataset, is instrumental in fine-tuning model hyperparameters and monitoring performance during training iterations, thereby mitigating the risk of overfitting and ensuring robust generalization capabilities across all classes. Finally, the test set, comprising 10% of the dataset, provides an impartial benchmark for objectively evaluating the model’s performance on unseen data from all eleven classes, thus validating its efficacy and reliability across diverse real-world scenarios.

The primary approach in this research involves a hybrid model combining a CNN and an RNN. The process begins with image preprocessing, which includes resizing images into a unified size and augmentation techniques such as rotation, brightness adjustment, and contrast enhancement to simulate real-life conditions and augment the dataset. Following preprocessing, the images are fed into the hybrid CNN-RNN model for classification. This model leverages a CNN for extracting high-level features from the images and an RNN for capturing temporal dependencies, ensuring accurate and efficient disease detection.

2.2. Image Preprocessing Techniques

Resizing images is a crucial preprocessing step in image-based machine-learning tasks, including image classification, object detection, and segmentation. Ensuring that all input images are of a consistent size is essential for effectively training neural network models.

2.2.1. Image Resizing

There are several resizing techniques that are commonly used in image processing and machine learning applications. One method is bilinear interpolation [24], offering a balance between computational efficiency and the preservation of image quality. Bilinear interpolation works by estimating new pixel values based on the intensity values of neighboring pixels in the original image.

2.2.2. Data Augmentation

To enhance the performance and generalization ability of the model, data augmentation is a critical technique, especially when dealing with an imbalanced dataset as the one used in this research, with some classes having far fewer images than others, as shown in Figure 3.

This imbalance can cause the model to become biased towards the majority class, leading to poor generalization and accuracy for the minority classes [25]. Consequently, the model may struggle to properly classify rare but important cases, such as identifying diseased leaves in a dataset where healthy leaves are predominant. Additionally, this imbalance can lead to overfitting to the majority class, reducing the model’s ability to make accurate predictions on new, unseen data.

This method can artificially increase the size of the minority classes by applying transformations such as rotation, scaling, and flipping to existing instances. This introduces diversity into the dataset and mitigates the effects of class imbalance. In the research, data augmentation techniques—specifically rotation and adjustments to brightness and contrast—were used to enhance model performance and robustness in real-life scenarios. These techniques help the model learn from varied image conditions, improving its ability to generalize and perform accurately on new data.

Rotation augmentation involves rotating images by a certain angle, introducing variations in orientation. By randomly rotating images during training, different viewpoints and perspectives are simulated, thereby enhancing the model’s ability to recognize objects from various angles. This augmentation technique is particularly useful in scenarios where objects may appear in different orientations in real-world images [26]. For instance, in the case of detecting tomato leaf diseases, rotating images allow the model to learn to identify diseased leaves regardless of their orientation, mimicking the variability encountered in natural environments.

The adjustment of brightness and contrast involves modifying the intensity and distribution of pixel values in an image. Increasing or decreasing the brightness simulates changes in lighting conditions while adjusting the contrast alters the difference in intensity between pixels. This augmentation technique helps the model become invariant to changes in illumination levels, enhancing its ability to generalize across diverse lighting conditions [27]. In the context of detecting tomato leaf diseases, adjusting brightness and contrast ensures robust performance in real-world settings where lighting may vary significantly.

2.3. Image Classification Methods

2.3.1. Convolutional Neural Networks

The CNN plays a pivotal role in many deep learning models in computer vision. A CNN typically consists of several key layers, as shown in Figure 4, including convolutional, pooling, and fully connected layers [28]. Convolutional layers use filters to create feature maps by detecting patterns like edges and textures. An activation function like ReLU (Rectified Linear Unit) [29] adds non-linearity. Pooling layers reduce the spatial dimensions of feature maps to improve efficiency and prevent overfitting, with common methods being max and average pooling.

High-level reasoning is performed in fully connected layers, which flatten the input and produce the final output. Dropout layers can be added to prevent overfitting by ignoring random activations during training, and batch normalization helps the network learn faster and more stably.

Techniques like data augmentation, transfer learning, and advanced architectures like ResNet, VGG, and Inception have further enhanced the CNN’s capabilities and applications.

2.3.2. Recurrent Neural Networks

The RNN is a class of artificial neural networks designed to process sequential data, such as time series, text, speech, and video [30]. Unlike feedforward neural networks, which process inputs independently, RNNs maintain an internal memory, allowing them to capture temporal dependencies and patterns within the data. At each time step t, an RNN receives an input

x_{t}

and the hidden state

h_{t - 1}

from the previous time step. It then computes a new hidden state

h_{t}

using a set of learnable parameters (weights and biases) and an activation function. This hidden state

h_{t}

serves as a representation of the input sequence up to time step t. The process can be mathematically represented as follows:

h_{t} = f (W_{h x} x_{t} + W_{h h} h_{t - 1} + b_{h}),

(1)

where

$x_{t}$ is the input at time step t;
$h_{t - 1}$ is the hidden state from the previous time step;
$W_{h x}$ and $W_{h h}$ are weight matrices;
$b_{h}$ is the bias vector;
f is the activation function (e.g., sigmoid, tanh, ReLU).

With the hidden state calculated from the previous time step, the output of an RNN unit can be calculated as follows:

y_{t} = f (W h_{t} + b_{o}),

(2)

where

$y_{t}$ is the output at time step t;
$h_{t}$ is the hidden state from the previous time step;
W is the weight matrix connecting the hidden to the output layer;
$b_{0}$ is the bias for the output layer;
f is the activation function applied element-wise to the weighted sum $(W h_{t} + b_{o})$ .

RNNs are optimized using the backpropagation through time (BPTT) [31] algorithm, which extends the standard backpropagation algorithm to account for the temporal dependencies in sequential data. This makes RNNs particularly effective for tasks involving sequential or time-varying data, such as natural language processing, speech recognition, and time series prediction.

However, training RNNs can be challenging due to the issue of vanishing or exploding gradients, where gradients become too small or too large as they propagate backward through time. This problem can hinder the learning process, especially in long sequences. To mitigate this issue, techniques like gradient clipping and specialized architectures such as LSTM networks and Gated Recurrent Units (GRUs) have been developed.

LSTM networks are a type of RNN architecture designed to address the vanishing gradient problem of traditional RNNs and effectively capture long-term dependencies in sequential data [32]. They achieve this by introducing specialized memory cells and gating mechanisms that regulate the flow of information through the network over time.

This research will involve an extensive training and comparison of the designed model with a hybrid method that uses LSTM networks. The goal is to validate the effectiveness of Liquid Time-Constant (LTC) networks in comparison to LSTM networks. The same LSTM-based approach was utilized in studies referenced in [15] and [17]. This comparison aims to assess the benefits of using the LTC in terms of accuracy, computational efficiency, and robustness, as opposed to previous studies that focused on the LSTM.

2.3.3. Liquid Time-Constant Networks

Although an LSTM can effectively solve the vanishing or exploding gradient problem, RNN models also face another problem, which is that they operate in discrete time steps, which can be inefficient for modeling continuous-time processes or data with irregular time intervals. Another limitation of traditional RNNs lies in their memory efficiency. Traditional RNNs suffer from a fixed-size memory bottleneck, where they need to store the entire history of past inputs and activations to compute the current hidden state. This can lead to memory constraints, especially when dealing with long sequences of data.

To address these limitations, researchers have introduced a novel framework called Neural Ordinary Differential Equations (Neural ODEs) [33], which seamlessly integrates ordinary differential equations (ODEs) with neural networks.

In a Neural ODE-based RNN, instead of explicitly updating the hidden state at each time step, the hidden state evolves continuously over time according to an ODE. The ODE is parameterized by a neural network that takes the current hidden state, the current time, and possibly other variables as input. The solution to this ODE provides the hidden state trajectory over time.

Mathematically, the dynamics of the hidden state in a Neural ODE-based RNN can be described as follows:

\frac{d x (t)}{d t} = f (x (t), t; θ),

(3)

where

$x (t)$ is the hidden state at time t;
$f$ is a neural network function with parameters $θ$ that determines the rate of change of $x (t)$ with respect to time;
t represents time.

During training, the parameters of the neural network

θ

are learned by optimizing a loss function using gradient-based optimization techniques. This is typically performed by solving the ODE using numerical methods such as Euler’s method or Runge–Kutta methods and then applying backpropagation through the ODE solver to compute gradients with respect to the parameters and optimize the model based on those gradients.

Applying Neural ODEs to an RNN offers several significant benefits. Firstly, Neural ODEs provide a more flexible and expressive framework for modeling sequential data, allowing for the effective capture of complex temporal dynamics and long-range dependencies. By seamlessly integrating ordinary differential equations with neural networks, a Neural ODE-based RNN can adaptively adjust the rate of change of the hidden state over time, enabling more efficient and memory-effective processing of sequential data. Additionally, Neural ODEs offer a continuous-time formulation, which is particularly advantageous for modeling continuous-time processes or data with irregular time intervals. This continuous-time approach not only improves the scalability and performance of RNNs but also enhances their ability to handle diverse types of sequential data in various applications such as time series prediction, sequence generation, and dynamic system modeling.

To enhance the stability of the model, Ref. [34] introduced an additional term to the differential equation,

- \frac{x (t)}{τ}

, forming a new equation to calculate the hidden state:

\frac{d x (t)}{d t} = - \frac{x (t)}{τ} + f (x (t), I (t), t; θ),

(4)

where

The term $- \frac{x (t)}{τ}$ represents a decay term where $τ$ is a time-constant;
$x (t)$ is the hidden state at time t;
$I (t)$ is the input;
$f$ is a neural network function with parameter $θ$ that determines the rate of change of $x (t)$ with respect to time;
t represents time.

This modification facilitates the creation of a continuous-time recurrent neural network (CT-RNN) that exhibits increased stability [34]. This decay term

- \frac{x (t)}{τ}

, characterized by the time constant

τ

, plays a crucial role in guiding the system towards equilibrium, thus enhancing its ability to effectively capture and represent temporal dynamics inherent in sequential data. Furthermore, the inclusion of the function f(x(t), I(t), t, θ) allows for a more comprehensive modeling approach, incorporating the influence of both the current hidden state and external inputs on the dynamics of the system over time.

Based on the above modification, the researchers of [35] proposed the Liquid Time-Constant recurrent neural networks using an alternative formulation inspired by synaptic transmission mechanisms where they replace the neural network function

f (x (t), I (t), t; θ)

with the nonlinearity function

S (t) = f (x (t), I (t), t; θ) (A - x (t))

that is used to approximately calculate synaptic currents to the cell in the ready state, forming a new equation to calculate the hidden state as below:

\frac{d x (t)}{d t} = - [\frac{1}{τ} + f (x (t), I (t), t; θ)] x (t) + f (x (t), I (t), t; θ) A,

(5)

With this approach, the neural network function not only governs the rate of change of the hidden state but also acts as an input-dependent time constant for the learning system. This dynamic time constant enables each element of the hidden state to adapt to the specific input features present at each time point, allowing the model to identify specialized dynamical systems tailored to the input data.

The LTC model offers several key benefits. Firstly, LTCs enable differentiable computational graphs, facilitating their training with various gradient-based optimization algorithms similar to ODEs. This enhances their adaptability and ease of optimization. Moreover, LTCs ensure stability in output dynamics by bounding the state and time constant within a finite range, which is essential for maintaining reliable performance, particularly when faced with continuously increasing inputs. Additionally, LTCs exhibit superior expressivity compared to other time-continuous models, as they offer a universal modeling framework capable of capturing complex temporal patterns effectively. Finally, in the research, LTCs demonstrate improved performance in time-series modeling tasks compared to modern RNNs, showcasing their practical utility and efficacy in real-world applications.

2.3.4. Neural Circuit Policies

Based on the foundation of LTC networks, the authors of [35] introduced a novel end-to-end learning system incorporating a unique architecture termed NCP [21]. This development stems from the requirement to create a balance between the long-term dependencies and short-term causality of some specific tasks, like driving an autonomous car, as studied in the research. The problem of the task lies in how to build a system that can learn long-term relationships as a sequel of actions when driving the car without reducing its sensitivity in adapting to short-term situations.

Inspired by the wiring diagram of the C. elegans nematode, the structure of NCPs follows a four-layer hierarchical topology resembling the organization of neural circuits in the nematode’s nervous system. This architecture efficiently processes environmental inputs, propagating them through sensory neurons, interneurons, command neurons, and, ultimately, motor neurons for action execution. NCPs leverage continuous-time ODEs to capture neural dynamics, enhancing their expressiveness in modeling time series data. NCP networks are constructed to be compact and sparse, facilitating efficient learning and decision-making. This novel approach enhances the performance of RNN agents in real-world control tasks by limiting the temporal attention span of the network to the most recent observations, which aligns more closely with the short-term causality inherent in driving tasks.

In this research, the same method for disease detection on tomato leaf images was applied. Although the input domain differs from the time-serial task where the above NCP network was used, the task of image classification was converted to a similar problem by breaking down each image into a series of feature layers and adding a time dimension based on the layers. This approach allows for the integration of the NCP network into the model to help learn the disease pattern through a series of image features as long-term dependency features while focusing on each feature layer of each image as short-term causality of the task. This method aims to limit the overfitting of the model by capturing both long-term dependencies and short-term causality.

2.3.5. Closed-Form Continuous-Time Neural Network

The Closed-form Continuous-time Neural Network (CfC) model extends the capabilities of LTC networks by providing a closed-form solution to the differential equations that define their dynamics [36]. This advancement addresses the computational challenges posed by traditional numerical solvers used in the LTC, enabling more efficient real-time processing and scalability in applications such as time-series prediction and sequential decision-making.

The CfC model builds on the foundational architecture of LTC networks, where neurons and synapses interact over continuous time. A significant improvement is the ability to translate these continuous interactions into a closed-form expression, which eliminates the need for iterative numerical solvers and, thus, significantly enhances computational efficiency.

The mathematical formulation of LTCs is based on the differential equation:

\frac{d x (t)}{d t} = - [\frac{1}{τ} + f (x (t), I (t), t; θ)] x (t) + f (x (t), I (t), t; θ) A,

(6)

where

$x (t)$ is the hidden state at time t;
$I (t)$ is the input;
$f (x (t), I (t), t; θ)$ is a neural network function with parameter $θ$ that determines the rate of change of $x (t)$ with respect to time;
$τ$ represents a time constant;
$A$ is an adjacency matrix representing connections.

For piecewise constant inputs, the differential equation can be solved in a closed form. Given a sequence of time points

\{t_{i}\}

and assuming

I (t)

is constant between

t_{i}

and

t_{i + 1}

, the hidden state

x (t)

can be approximated as follows:

x (t_{i + 1}) = e^{- \frac{1}{τ} (t_{i + 1} - t_{i})} x (t_{i}) + (\int_{t_{i}}^{t_{i + 1}} e^{- \frac{1}{τ} (t_{i + 1} - s)} d s) f (x (t_{i}), I (t_{i}), t_{i}; θ) A,

(7)

This expression can be further simplified to

x (t_{i + 1}) = e^{- \frac{1}{τ} ∆ t} x (t_{i}) + \frac{1}{τ} (1 - e^{- \frac{1}{τ} ∆ t}) f (x (t_{i}), I (t_{i}), t_{i}; θ) A,

(8)

where

∆ t = t_{i + 1} - t_{i}

.

For continuous inputs, the integral can be approximated by

x (t) \approx e^{- \frac{1}{τ} t} x (0) + \frac{1}{τ} (1 - e^{- \frac{1}{τ} t}) f (x (0), I (0), 0; θ) A

(9)

This closed-form approximation ensures a tight bound on the approximation error, providing high accuracy in modeling temporal dynamics.

By transforming the integral solution of LTCs into a closed form, the CfC model significantly reduces the computational complexity associated with numerical solvers, resulting in up to a 100-times speed increase in both training and inference times. The CfC model scales efficiently compared to traditional ODE-based networks, allowing for larger network architectures without a proportional increase in computational resources. Despite the simplification, the CfC model retains the expressive power of the LTC, performing exceptionally well in time-series prediction tasks. Empirical results demonstrate that the CfC model closely approximates the dynamics of ODE-based LTC networks with minimal error.

This model is particularly useful in real-time applications, such as autonomous driving, where rapid and accurate processing of sequential data is critical. The closed-form solution enables real-time decision-making and control without sacrificing accuracy. By leveraging the strengths of LTC and eliminating their computational drawbacks, the CfC model represents a significant advancement in continuous-time neural network design, offering a practical solution for real-time applications that require both speed and accuracy.

2.3.6. Transfer Learning

To enhance the efficiency of the disease detection models, transfer learning [37] is implemented. This technique is particularly beneficial in scenarios with limited datasets or constrained computational resources. Transfer learning involves leveraging knowledge from a source domain with abundant labeled data and applying it to a target domain with scarce labeled data. Unlike traditional methods that train models from scratch, transfer learning reuses pre-trained models and adapts them to new tasks, inheriting valuable features and representations learned from the source domain. Research has shown that models using transfer learning achieve higher accuracy and better performance with less training time and computational resources compared to models trained from scratch.

In this research, transfer learning was applied by further training the entire pre-trained model on the target dataset with a small learning rate—as shown in Figure 5—allowing the model to adapt its representations to the new task while retaining knowledge from the source domain. This approach was used to ensure a flexible adaptation for detecting tomato leaf diseases.

To improve efficiency and reduce computational costs for adaptation to low-cost IoT devices, MobileNetV2 was chosen as the core pre-trained model for this study. Developed by Google researchers, MobileNetV2 is an evolution of the original MobileNet architecture, designed to enhance performance while maintaining low computational complexity. This model employs techniques like depth-wise separable convolutions [38], linear bottlenecks [39], and inverted residuals [39] to make the model lightweight and efficient, ideal for deployment on IoT devices with limited computational resources. Its architecture allows for fast inference speeds without compromising performance, making it a suitable choice for this research in detecting tomato leaf diseases.

2.4. Model Development

The proposed model in Figure 6 combines a pre-trained MobileNetV2 architecture with a CfC-NCP-implemented layer to classify tomato leaf diseases. Initially, MobileNetV2, a widely used CNN architecture, extracts rich hierarchical features from input images. Pre-trained on the ImageNet dataset, MobileNetV2 efficiently learns diverse visual features, enhancing the model’s capability while conserving computational resources.

After feature extraction by the MobileNetV2 model as the CNN component, the 2 × 2 × 1280 features are reshaped, merging the first two dimensions into sequential data of size 4 × 1280. These data are then fed into the CfC model, an RNN designed to capture temporal dependencies. In this configuration, the CfC layer has an input size of 32 with 128 neurons implementing NCP wiring, and the output size is 36. These configurations represent the optimized settings achieved in this research after extensive testing. This allows for the model to efficiently capture disease patterns without compromising its ability to recognize unique patterns in rare cases. Finally, the output of the CfC-NCP layers is passed through a dense layer with SoftMax activation, generating class probabilities for disease classification. This structure leverages the strengths of CNNs for spatial feature extraction and RNNs for temporal sequence modeling, enabling the model to make accurate predictions regarding the presence of various tomato leaf diseases.

The layered structure of the model is shown in Table 1; it begins with the MobileNetV2, which is a pre-trained model used as a feature extractor. This group of layers processes the input images and extracts high-level features, resulting in an output shape of (None, 4, 1280) and comprising 2,257,984 parameters.

Next, the CFC layer, identified as cf_c_4, processes the reshaped feature maps. This layer, which includes NCP, captures temporal dependencies and transforms the input data from a shape of (4, 1280) to (36). This reduction in dimensionality is achieved with 217,568 parameters.

Finally, a dense layer, referred to as dense_3, is used for classification. This fully connected layer classifies the processed features into different disease classes, with an output shape of (11) and containing 407 parameters.

In summary, the model architecture utilizes MobileNetV2 for initial feature extraction, reshapes the features, processes them through the CFC layer to capture temporal dependencies, and classifies them into 11 disease classes with a dense layer.

2.5. Hyperparameter Optimization

In developing an effective tomato leaf disease classification model, several key hyperparameters must be carefully selected and optimized. The main hyperparameters include the training epochs, loss function, optimizer, learning rate, and batch size.

The selection of the learning rate and batch size is critical, as these hyperparameters have a significant impact on model performance. Given the multitude of possible combinations, this research combined Grid Search [40] with 5-fold cross-validation [41] to systematically identify the best learning rate and batch size for the model. Grid Search thoroughly examines all possible combinations of these hyperparameters to ensure that no potentially beneficial configurations are missed. To save time, a shortened dataset consisting of randomly selected images from the full dataset was used instead of the entire dataset. This focused approach helps efficiently manage computational resources.

The heatmap in Figure 7 shows the results of testing different combinations of learning rates and batch sizes for the model based on 5-fold cross-validation. Each value in the heatmap represents the median performance across the five validation runs. The heatmap indicates that the best configuration for the model is a learning rate of 0.0001 and a batch size of 16, which was implemented in the proposed model.

To achieve a balance between sufficient learning and the prevention of overfitting, 50 epochs were selected for training. This number of epochs allows the model to adequately learn from the training data while minimizing the risk of overfitting, which could reduce its ability to generalize to unseen data.

The loss function measures how well the model’s predictions match the actual labels. For this research, the sparse categorical cross-entropy loss function was chosen. This loss function is particularly suitable for multi-class classification problems like the one in this research.

The Adam optimizer [42] was selected for training the model. Adam, which stands for Adaptive Moment Estimation, combines the advantages of two other popular optimizers: AdaGrad and RMSProp. It computes adaptive learning rates for each parameter, improving training efficiency and convergence speed.

2.6. Performance Metrics

Evaluating a machine learning model’s performance requires more than just accuracy and loss. While these metrics give an initial idea of the model’s learning and generalization capabilities, they do not provide a complete picture, especially when dealing with class imbalances or when different types of errors have varying consequences. To gain a comprehensive understanding of the model’s effectiveness, it is essential to validate it using a broader set of performance metrics.

In the context of this research, which involves detecting tomato leaf diseases across 11 different classes, the criteria used for calculating performance metrics are extended to accommodate the diverse range of classes. These criteria include True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN):

True Positive (TP): Instances where the model correctly predicts images belonging to a specific disease category or as healthy;
True Negative (TN): Instances where the model correctly predicts that a sample does not belong to any of the other 10 classes;
False Positive (FP): Instances where the model incorrectly predicts that an image belongs to a specific class when it actually belongs to another class;
False Negative (FN): Instances where the model incorrectly predicts that a sample does not belong to a particular class when it actually does.

Using these criteria, the following metrics are calculated: accuracy, precision, recall, F1-score, and ROC-AUC. Accuracy measures the proportion of correctly classified instances, while precision and recall provide insights into the model’s performance in distinguishing between true positives and false positives/negatives. The F1-score offers a balance between precision and recall, especially useful in imbalanced datasets. ROC-AUC quantifies the model’s ability to differentiate between classes, with values closer to 1 indicating better performance. These metrics collectively provide a comprehensive evaluation of the model’s effectiveness in tomato plant disease classification.

3. Result and Discussion

The overall implementation of the tomato leaf disease detection system is composed of several stages, as shown in Figure 8, integrating a CNN and an RNN to effectively process and classify the images:

Stage 1—Dataset Splitting: The process begins with the augmented dataset of tomato leaf images, which includes a training dataset, a validation dataset, and a test dataset with ratios of 70%, 20%, and 10%, respectively. This division ensures that the model can be trained on one portion of the data while the other is reserved for validating and benchmarking its performance.
Stage 2—Feature Extraction: The images are then resized to 64 × 64 using Bilinear interpolation and are fed into a pre-trained MobileNetV2 model, which has been fine-tuned to adjust its parameters specifically for the task of tomato leaf disease detection. The classification layers of MobileNetV2 are removed, retaining only the feature extractor layers. These layers output the image feature layers, which are the size of 2 × 2 × 1280, capturing the essential characteristics of the input images.
Stage 3—Dimensions Reshape: The image feature layers obtained from the feature extraction process are reshaped to the size of 4 × 1280, suitable for sequential data processing. This step is crucial for preparing the data for the subsequent RNN, ensuring that the sequential nature of the features is maintained.
Stage 4—Temporal feature processing: The reshaped features are then fed into a CFC layer with an NCP implemented. This layer has 128 neural units, with an input size of 36 and an output size of 32, designed to process the sequential data. The output from the CFC layer is then passed through a dense layer of 11 units with the SoftMax activation function, which classifies it into 11 different classes.
Stage 5—Model Validation: The validation dataset is used to run through the whole trained model again, evaluating the model’s predictions to ensure accuracy and reliability. This process is critical to ensure the model does not overfit the training dataset.

This comprehensive architecture leverages the strengths of a CNN for feature extraction and an RNN for sequence learning, creating a robust and efficient model for detecting tomato leaf diseases. By integrating these techniques, the model is well-equipped to handle the complexities of real-world agricultural scenarios.

The code and data used in this study are available to ensure transparency and reproducibility. All scripts developed for data preprocessing, model training, and evaluation are accessible through the GitHub repository available at [43]. This repository includes detailed instructions on how to replicate the experiments and results presented in this research. Additionally, the original dataset can be accessed via [23], and the augmented dataset has been uploaded to the same repository with the source code. By providing open access to the code and data, we aim to facilitate further studies and advancements in the field of tomato plant disease detection using hybrid CNN-RNN models.

After training using the defined hyperparameter configurations, the confusion matrix can be found. From this matrix, True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) for each class are determined. These values are then used to calculate the five performance metrics, which are summarized in Table 2.

The metrics table showcases the excellent performance of the proposed model when tested on a separate dataset on which the model had not been trained. The model consistently achieves high scores in accuracy, precision, recall, F1-score, and ROC-AUC across all classes. This indicates the model’s effectiveness in correctly identifying true positive cases while minimizing false positives and false negatives.

The ROC-AUC values in Figure 9 are close to 1 for all classes, highlighting the model’s strong ability to distinguish between different types of tomato leaf diseases. This comprehensive evaluation using multiple metrics confirms the model’s robustness and reliability, making it well-suited for real-world agricultural disease detection applications. Testing on a separate dataset further validates these results, demonstrating the model’s strong generalization capabilities to new, unseen data.

3.1. Comparative Analysis of Model Performance

In this section, we present a thorough comparison between the proposed hybrid model and several well-established baseline models, including ResNet50, VGG16, MobileNetV2, and MobileNetV2-LSTM. ResNet50 is a deep residual network known for its high accuracy in image classification tasks, utilizing skip connections to mitigate the vanishing gradient problem in deep networks [44]. VGG16 is a convolutional network that employs a deep architecture with small convolutional filters, known for its simplicity and effectiveness in image recognition [45]. MobileNetV2 is a lightweight model optimized for mobile and embedded vision applications, using depth-wise separable convolutions to reduce the number of parameters and computational costs. MobileNetV2-LSTM is another model validated in this research. This method was successfully applied in research [15] to detect the Coronavirus using X-ray images, showing promising results compared to other methods. Similarly, it was used in [17] in the agricultural field, combining CNN with LSTM to detect foliar disease in apple trees. This method was chosen for comparison to the proposed model because of its promising results in these papers and its low complexity, aligning with this research’s goal of creating a lightweight and effective model. MobileNetV2-LSTM extends MobileNetV2 by integrating LSTM units, which enables the model to capture temporal dependencies in sequential data. By evaluating the MobileNetV2-LSTM model, the relative strengths and weaknesses of using LSTM versus CfC in enhancing the classification capabilities for tomato leaf disease detection can be determined. This comparison highlights the potential benefits of the CfC approach, providing a comprehensive assessment of the new method’s effectiveness against established techniques.

Each of these baseline models was configured and trained with the same hyperparameters as the proposed model to ensure a fair and consistent evaluation. This comparative analysis aims to highlight the strengths and weaknesses of the proposed model in relation to these well-established architectures, thereby validating its performance and robustness in detecting tomato leaf diseases. Figure 10 shows the training and validation results’ comparison of the five models.

VGG-16 and ResNet-50 both show strong performance, with VGG-16 reaching about 99% training accuracy and its validation accuracy fluctuating but generally remaining high, demonstrating reasonable generalization with occasional drops. ResNet-50 achieves around 99.3% training accuracy but has occasional sharp declines in validation accuracy, suggesting possible overfitting. MobileNetV2 starts lower but reaches an approximately 98.9% training accuracy with a higher and more stable validation accuracy compared to the two above.

The hybrid models show notable improvements. MobileNetV2-LSTM reaches about a 99.1% training accuracy with a high validation accuracy and some sharp increases between 93% and 94%, indicating better generalization due to the LSTM layers, as shown in previous papers [15,17]. MobileNetV2-CfC-NCP, despite a lower initial baseline, achieves around a 99.2% training accuracy and a consistently high validation accuracy of 93%-94% with minimal fluctuations, demonstrating superior generalization and robustness.

In Figure 10, the fluctuation observed in the validation accuracy can be primarily attributed to the VGG-16 and ResNet50 models. Both of these models are highly complex with a large number of parameters, which makes them prone to overfitting. This is evident in the sharp peaks and troughs in the validation performance of VGG-16 and ResNet50.

Addressing this issue is one of the main reasons why MobileNetV2 was chosen as the baseline CNN model in this research. MobileNetV2 is a simpler and more efficient model, which inherently reduces the risk of overfitting. Consequently, it exhibits much less fluctuation in validation accuracy compared to VGG-16 and ResNet50, as shown in Figure 9. However, this simplicity also leads to a slightly lower accuracy overall.

To further enhance the performance, MobileNetV2 was combined with Recurrent Neural Network (RNN) models such as the LSTM and CfC-NCP. These hybrid models leverage the strengths of both CNN and RNN architectures, capturing spatial features with the CNN and temporal dependencies with the RNN. This combination not only improves the training accuracy but also stabilizes and enhances the validation accuracy, as seen in the smoother curves and higher overall performance in the graph.

In summary, the fluctuations in the original models are mitigated by the use of MobileNetV2 due to its simpler architecture, and the integration of RNN components compensates for any loss in accuracy, leading to a more robust and reliable performance in both training and validation phases.

Table 3 present the performance metrics for the five models, evaluated using a test dataset that was not seen during training or validation. This evaluation helps to validate each model’s performance on unseen data, highlighting their generalization power and potential value in real-life applications.

From the results, it is evident that the baseline models VGG-16, ResNet-50, and MobileNetV2 performed well but did not match the performance of the hybrid model. This confirms the benefit of combining the CNN and RNN in image recognition tasks.

Furthermore, the proposed model, MobileNetV2-CfC-NCP, outperformed all other models across all metrics. It achieved the highest accuracy (97.15%), precision (97.16%), recall (97.09%), and F1-score (97.11%).

This result shows that the hybrid models outperformed the traditional single CNN models in all performance metrics. The proposed MobileNetV2-CfC-NCP model achieved the highest performance, surpassing the LSTM-based hybrid model that was applied in previous studies [15,17], demonstrating the superior effectiveness of liquid neural architectures over the LSTM in this context.

3.2. Computational Efficiency

The main target of this research was to create a new lightweight model to enhance its implementation potential in real-life scenarios. This section focuses on assessing the computational cost and efficiency of the proposed model using key metrics, including number of parameters [46], computational complexity [47], memory footprint [48], and inference speed [49]. Analyzing these aspects helps determine the model’s suitability for deployment in resource-constrained environments and its overall performance efficiency compared to other models. This evaluation identifies the trade-offs between accuracy and computational requirements, ensuring that the model not only performs well but is also computationally feasible for practical use.

The result in Table 4 shows the comparison of computational efficiency across different models, which reveals notable differences in parameters, memory usage, inference speed, and FLOPs, highlighting the strengths and weaknesses of each.

VGG-16 and ResNet-50, though accurate, are computationally intensive and slow. VGG-16 has the longest inference speed (6.33 ms) and the highest FLOPs (2507.7 Mega). ResNet-50 has the highest parameters (24.64 million) and memory usage (93.99 MB), making it memory-expensive and less ideal for efficient applications.

In contrast, MobileNetV2 is highly efficient with 2.91 million parameters, 11.12 MB memory usage, the fastest inference speed (0.77 ms), and low FLOPs (50.22 Mega). This balance makes it suitable for real-time applications on limited hardware.

The MobileNetV2-LSTM hybrid adds sequence learning while maintaining efficiency. It has a slightly slower inference speed (0.89 ms) but fewer parameters (2.60 million), lower memory usage (9.93 MB), and the lowest FLOPs (49.60 Mega), enhanced by its capability to handle sequential data efficiently, showing its great potential as applied in previous studies [15,17].

The primary goal of this project was to reduce the number of parameters and memory usage of the model, addressing the key challenges faced by complex models when deployed in IoT applications [50]. The MobileNetV2-CfC-NCP model successfully meets this target, making it particularly well-suited for deployment on low-cost IoT devices. With the fewest parameters (2.48 million) and the lowest memory usage (9.45 MB) among the models compared, it is ideal for environments with limited computational resources. Although its inference speed (0.80 ms) is slightly higher than that of the basic MobileNetV2 model, this minor increase in inference time is not a significant drawback. The ability to process images directly on the device, rather than sending data to a cloud server, compensates for the slight speed difference. Additionally, the model achieves a low FLOPs value of 49.71 Mega, well within the operational requirements for most current IoT devices.

In summary, this section demonstrates that the proposed model is the most computationally efficient, showcasing significant improvements in performance. This is achieved through the integration of advanced neural network components like CfC and NCP, which play a crucial role in enhancing the effectiveness of agricultural image classification tasks.

3.3. Assessment of Data Augmentation

This section aims to validate the effects of data augmentation methods on model performance. It systematically analyzes how these methods influence model development and assesses their impact on improving the models’ generalization and robustness. The proposed model was retrained using the non-augmented dataset with the same hyperparameter configuration as in the augmented training. The performance of the trained model was then evaluated using key metrics to ensure that data augmentation effectively enhances the learning process and contributes to better overall model accuracy and efficiency.

Figure 11 shows the comparison between the non-augmented and augmented models, which reveals several key insights. The augmented model starts with a higher initial accuracy and lower initial loss, indicating that data augmentation provides an early advantage in the learning process. Specifically, the augmented model begins at an accuracy of 0.59 and a loss of 1.20, compared to the non-augmented model’s starting accuracy of 0.53 and loss of 1.35. This initial boost helps the augmented model learn more effectively from the outset.

As training progresses, both models exhibit similar trends of increasing accuracy and decreasing loss. Initially, data augmentation helps the model to generalize better by providing varied examples, which is evident in the early epochs where the augmented model might have a slight advantage. However, as training continues and the epoch count increases, the non-augmented model begins to catch up in performance.

By the 50th epoch, both models achieve comparable high levels of accuracy, just above 0.99, and low levels of loss, around 0.03. This convergence suggests that the designed model effectively learns to fit the training data, whether augmented or non-augmented. However, this similar level of performance in training does not necessarily indicate the similar effectiveness of the models in real-world applications.

The real effectiveness between the models must be validated using a separate test dataset that the models did not encounter during training. This test dataset should reflect real-world conditions, including variations in lighting, angles, and image quality. Only by evaluating both models on such a separate test dataset can we truly assess their generalization capabilities and effectiveness in practical scenarios.

The results in Figure 12 consistently shows that models trained with augmented data outperformed their non-augmented counterparts in terms of validation accuracy, precision, recall, and F1-score. This reinforces the conclusion that data augmentation significantly enhances the generalization ability of machine learning models, making them more effective in handling new, unseen data.

3.4. Ablation Experiments

To validate the efficiency of each implemented method, this section describes a series of ablation experiments conducted. Each method was progressively incorporated, and its impact on model performance was observed:

Baseline CNN Model (VGG-16): Served as the foundation for measuring the impact of additional techniques;
MobileNetV2: Applied transfer learning with MobileNetV2 to assess performance changes using a lightweight model;
MobileNetV2-LSTM: Evaluated the hybrid CNN-RNN method commonly used in current studies;
MobileNetV2-CfC: Replaced the LSTM with Liquid Time-Constant networks to compare performance improvements;
MobileNetV2-CfC-NCP: Assessed the proposed model’s performance with the implementation of Neural Circuit Policy;
MobileNetV2-CfC-NCP (Augmentation): Validate the effect of augmentation techniques on the proposed model performance.

Figure 13 shows the results of the above models in terms of test accuracy and model complexity. Starting with the baseline CNN model using VGG-16, switching to MobileNetV2 results in a slight reduction in accuracy. However, MobileNetV2 significantly improves model efficiency by greatly reducing complexity. Combining MobileNetV2 with an LSTM enhances accuracy even with reduced complexity, demonstrating the effectiveness of the hybrid method. Replacing the LSTM with the CfC does not improve accuracy but slightly reduces model complexity, indicating that the liquid neural network is more efficient in this context. Adding NCP to the model further increases accuracy with only a small increase in complexity. Finally, training the model with augmented data significantly boosts its accuracy on the test dataset.

This section highlights the successful implementation of various methods applied in this research, starting from the base CNN model, progressing through transfer learning combined with an RNN, and culminating in a hybrid model. We also introduce advanced neural network technologies like the CfC and NCP, followed by the application of augmentation techniques to further enhance model generalization. With each method implemented, the model accuracy improved and computational costs decreased, resulting in the most optimized model compared to others tested in this study. Our optimized model increases the practicality and value of IoT-based image classification systems in real-world scenarios, making them more precise, easier to deploy, and more cost-effective. These advancements contribute to the broader adoption of IoT-based image recognition solutions, especially in resource-constrained environments.

Future work should focus on optimizing the model for real IoT device deployment and extending the research to other datasets and applications, such as healthcare and manufacturing, to validate and improve the model’s versatility and generalizability. Although random rotation and adjustments of brightness and contrast were used to create an augmented dataset that aims to mimic real-life conditions and improve model generalization, these methods only offer a useful approximation and remain limited in scope. Consequently, further research and additional studies are necessary to thoroughly evaluate the model’s effectiveness in actual real-world scenarios, where environmental variables may introduce additional challenges.

Author Contributions

Conceptualization, A.T.L., M.S., I.A. and W.H.A.; methodology, A.T.L., M.S. and I.A.; software, A.T.L. and M.S.; validation, A.T.L., M.S., I.A. and W.H.A.; formal analysis, A.T.L., M.S., I.A. and W.H.A.; investigation, A.T.L., M.S. and I.A.; resources, A.T.L., M.S. and I.A.; data curation, A.T.L.; writing—original draft preparation, A.T.L., M.S., I.A. and W.H.A.; writing—review and editing, A.T.L., M.S., I.A. and W.H.A.; visualization, A.T.L. and I.A.; supervision, M.S. and I.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in Zenodo at DOI: 10.5281/zenodo.12729278, and can also be accessed via GitHub at MShakiba-Research/Lightweight-CNN-RNN-Model (version v1.0.0).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Torres Pineda, I.; Cho, J.H.; Lee, D.; Lee, S.M.; Yu, S.; Lee, Y.D. Environmental Impact of Fresh Tomato Production in an Urban Rooftop Greenhouse in a Humid Continental Climate in South Korea. Sustainability 2020, 12, 9029. [Google Scholar] [CrossRef]
Mordor Intelligence. Tomato Market Size. Available online: https://www.mordorintelligence.com/industry-reports/tomato-market (accessed on 14 June 2024).
Gatahi, D.M. Challenges and Opportunities in Tomato Production Chain and Sustainable Standards. Int. J. Hortic. Sci. Technol. 2020, 7, 235–262. [Google Scholar] [CrossRef]
Bachman, J. Chapter Two—Reverse-Transcription PCR (RT-PCR). Methods Enzymol. 2020, 530, 67–74. [Google Scholar] [CrossRef]
Bayani, J.; Squire, J.A. Fluorescence In Situ Hybridization (FISH). Curr. Protoc. Cell Biol. 2004, 23, 22–24. [Google Scholar] [CrossRef]
Khan, F.A.; Ibrahim, A.A.; Zeki, A.M. Environmental Monitoring and Disease Detection of Plants in Smart Greenhouse Using Internet of Things. J. Phys. Commun. 2020, 4, 055008. [Google Scholar] [CrossRef]
Petrellis, N. A Review of Image Processing Techniques Common in Human and Plant Disease Diagnosis. Symmetry 2018, 10, 270. [Google Scholar] [CrossRef]
Ragusa, E.; Cambria, E.; Zunino, R.; Gastaldo, P. A Survey on Deep Learning in Image Polarity Detection: Balancing Generalization Performances and Computational Costs. Electronics 2019, 8, 783. [Google Scholar] [CrossRef]
Gill, H.S.; Khalaf, O.I.; Alotaibi, Y.; Alghamdi, S.; Alassery, F. Multi-Model CNN-RNN-LSTM Based Fruit Recognition and Classification. Intell. Autom. Soft Comput. 2022, 33, 637–650. [Google Scholar] [CrossRef]
Wang, J.; Yang, Y.; Mao, J.; Huang, Z.; Huang, C.; Xu, W. Cnn-Rnn: A Unified Framework for Multi-Label Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2285–2294. [Google Scholar]
Guo, Y.; Liu, Y.; Bakker, E.M.; Guo, Y.; Lew, M.S. CNN-RNN: A Large-Scale Hierarchical Image Classification Framework. Multimed. Tools Appl. 2018, 77, 10251–10271. [Google Scholar] [CrossRef]
Zuo, Z.; Shuai, B.; Wang, G.; Liu, X.; Wang, X.; Wang, B.; Chen, Y. Convolutional Recurrent Neural Networks: Learning Spatial Dependencies for Image Representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 18–26. [Google Scholar]
Yan, R.; Ren, F.; Wang, Z.; Wang, L.; Zhang, T.; Liu, Y.; Rao, X.; Zheng, C.; Zhang, F. Breast Cancer Histopathological Image Classification Using a Hybrid Deep Neural Network. Methods 2020, 173, 52–60. [Google Scholar] [CrossRef]
Yao, H.; Zhang, X.; Zhou, X.; Liu, S. Parallel Structure Deep Neural Network Using CNN and RNN with an Attention Mechanism for Breast Cancer Histology Image Classification. Cancers 2019, 11, 1901. [Google Scholar] [CrossRef] [PubMed]
Islam, M.Z.; Islam, M.M.; Asraf, A. A Combined Deep CNN-LSTM Network for the Detection of Novel Coronavirus (COVID-19) Using X-ray Images. Informatics Med. Unlocked 2020, 20, 100412. [Google Scholar] [CrossRef] [PubMed]
Gill, H.S.; Khehra, B.S. An Integrated Approach Using CNN-RNN-LSTM for Classification of Fruit Images. Mater. Today Proc. 2022, 51, 591–595. [Google Scholar] [CrossRef]
Garg, D.; Alam, M. Integration of Convolutional Neural Networks and Recurrent Neural Networks for Foliar Disease Classification in Apple Trees. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 4. [Google Scholar] [CrossRef]
Agarwal, P.; Alam, M. A lightweight deep learning model for human activity recognition on edge devices. Procedia Comput. Sci. 2020, 167, 2364–2373. [Google Scholar] [CrossRef]
Arjovsky, M.; Shah, A.; Bengio, Y. Unitary evolution recurrent neural networks. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1120–1128. [Google Scholar]
Gonzalez-Huitron, V.; León-Borges, J.A.; Rodriguez-Mata, A.E.; Amabilis-Sosa, L.E.; Ramírez-Pereda, B.; Rodriguez, H. Disease detection in tomato leaves via CNN with lightweight architectures implemented in Raspberry Pi 4. Comput. Electron. Agric. 2021, 181, 105951. [Google Scholar] [CrossRef]
Lechner, M.; Hasani, R.; Amini, A.; Henzinger, T.A.; Rus, D.; Grosu, R. Neural Circuit Policies Enabling Auditable Autonomy. Nat. Mach. Intell. 2020, 2, 642–652. [Google Scholar] [CrossRef]
Le, A.T.; Shakiba, M.; Ardekani, I. Tomato Disease Detection with Lightweight Recurrent and Convolutional Deep Learning Models for Sustainable and Smart Agriculture. Front. Sustain. 2024, 5, 1383182. [Google Scholar] [CrossRef]
Tomato Leaves Dataset, Kaggle. Available online: https://www.kaggle.com/datasets/ashishmotwani/tomato (accessed on 31 May 2024).
Parsania, P.; Virparia, P.V. A Review: Image Interpolation Techniques for Image Scaling. Int. J. Innov. Res. Comput. Commun. Eng. 2014, 2, 7409–7414. [Google Scholar] [CrossRef]
Kraiem, M.S.; Sánchez-Hernández, F.; Moreno-García, M.N. Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models. Appl. Sci. 2021, 11, 8546. [Google Scholar] [CrossRef]
Marée, R.; Geurts, P.; Piater, J.; Wehenkel, L. Random Subwindows for Robust Image Classification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 34–40. [Google Scholar] [CrossRef]
Tan, X.; Triggs, B. Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions. IEEE Trans. Image Process. 2010, 19, 1635–1650. [Google Scholar] [CrossRef] [PubMed]
Taye, M.M. Theoretical Understanding of Convolutional Neural Network: Concepts, Architectures, Applications, Future Directions. Computation 2023, 11, 52. [Google Scholar] [CrossRef]
Agarap, A.F. Deep Learning Using Rectified Linear Units (ReLU). arXiv 2018, arXiv:1803.08375. [Google Scholar] [CrossRef]
Medsker, L.; Jain, L.C. (Eds.) Recurrent Neural Networks: Design and Applications; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
Werbos, P.J. Backpropagation Through Time: What It Does and How to Do It. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar] [CrossRef]
Chen, R.T.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D.K. Neural Ordinary Differential Equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; p. 31. [Google Scholar]
Funahashi, K.I.; Nakamura, Y. Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks. Neural Netw. 1993, 6, 801–806. [Google Scholar] [CrossRef]
Hasani, R.; Lechner, M.; Amini, A.; Rus, D.; Grosu, R. Liquid Time-Constant Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 7657–7666. [Google Scholar] [CrossRef]
Hasani, R.; Lechner, M.; Amini, A.; Liebenwein, L.; Ray, A.; Tschaikowski, M.; Teschl, G.; Rus, D. Closed-Form Continuous-Time Neural Networks. Nat. Mach. Intell. 2022, 4, 992–1003. [Google Scholar] [CrossRef]
Torrey, L.; Shavlik, J. Transfer Learning. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010; pp. 242–264. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Battineni, G.; Sagaro, G.G.; Nalini, C.; Amenta, F.; Tayebati, S.K. Comparative Machine-Learning Approach: A Follow-Up Study on Type 2 Diabetes Predictions by Cross-Validation Methods. Machines 2019, 7, 74. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Shakiba, M. Lightweight-CNN-RNN-Model; GitHub. Available online: https://github.com/MShakiba-Research/Lightweight-CNN-RNN-Model/tree/v1.0.0 (accessed on 12 July 2024).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Perrin, C.; Michel, C.; Andréassian, V. Does a Large Number of Parameters Enhance Model Performance? Comparative Assessment of Common Catchment Model Structures on 429 Catchments. J. Hydrol. 2001, 242, 275–301. [Google Scholar] [CrossRef]
Justus, D.; Brennan, J.; Bonner, S.; McGough, A.S. Predicting the Computational Cost of Deep Learning Models. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 3873–3882. [Google Scholar] [CrossRef]
Gao, Y.; Liu, Y.; Zhang, H.; Li, Z.; Zhu, Y.; Lin, H.; Yang, M. Estimating GPU Memory Consumption of Deep Learning Models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual, 8–13 November 2020; pp. 1342–1352. [Google Scholar] [CrossRef]
Marco, V.S.; Taylor, B.; Wang, Z.; Elkhatib, Y. Optimizing Deep Learning Inference on Embedded Systems Through Adaptive Model Selection. ACM Trans. Embed. Comput. Syst. 2020, 19, 1–28. [Google Scholar] [CrossRef]
Zikria, Y.B.; Afzal, M.K.; Kim, S.W.; Marin, A.; Guizani, M. Deep learning for intelligent IoT: Opportunities, challenges and solutions. Comput. Commun. 2020, 164, 50–53. [Google Scholar] [CrossRef]

Figure 1. On-site image processing for real-time tomato plant disease detection: the optimized CNN-RNN model deployed on IoT devices processes images locally, eliminating the need for cloud servers and enabling rapid feedback to user devices for timely intervention.

Figure 2. Images of tomato leaves with the ten different diseases.

Figure 3. Class distribution of tomato leaf diseases dataset from Kaggle, highlighting data imbalance. All the data classes are listed as (a) Bacterial spot; (b) Early blight; (c) Late blight; (d) Leaf Mold; (e) Septoria leaf spot; (f) Spider mites; (g) Target spot; (h) Tomato yellow leaf; (i) Tomato mosaic virus; (j) Healthy, and (k) Powdery mildew.

Figure 4. Structure of a typical convolutional neural network.

Figure 5. Application of transfer learning on ImageNet.

Figure 6. Proposed architecture for a hybrid CNN-RNN model that efficiently captures temporal and spatial relationships in images using pre-trained CNN and advanced neural network components CfC and NCP.

Figure 7. Model performance heatmap for learning rate and batch size combinations.

Figure 8. Model implementation process from image input processing to final validation for each training epoch.

Figure 9. ROC curves of the proposed model show the performance of the model across various disease classes with corresponding AUC values.

Figure 10. Training and validation accuracy comparison between 5 models over 50 epochs.

Figure 11. Comparison of training accuracy and loss between augmented and non-augmented datasets.

Figure 12. Comparison of metrics between non-augmented and augmented models on the test dataset.

Figure 13. Comparison of test accuracy and model complexity across different models: (a) VGG-16; (b) MobileNetV2; (c) MobileNetV2-LSTM; (d) MobileNetV2-CfC; (e) MobileNetV2-CfC-NCP; (f) MobileNetV2-CfC-NCP (Augmentation).

Table 1. The proposed model implements a layered structure for the tomato disease classification task.

Layer (Type)	Output Shape	Param #
MobileNetV2 (Functional)	(None, 4, 1280)	2,257,984
reshape_3 (Reshape)	(None, 4, 1280)	0
cf_c_4 (CfC)	(None, 36)	217,568
dense_3 (Dense)	(None, 11)	407

Table 2. Performance metrics summarization for the proposed model.

Class	Accuracy	Precision	Recall	F1-Score	ROC-AUC
Bacterial spot	0.9937	0.9689	0.9635	0.9662	0.9801
Early blight	0.9924	0.9577	0.9484	0.9530	0.9723
Late blight	0.9958	0.9896	0.9693	0.9793	0.9841
Leaf Mold	0.9989	0.9915	0.9971	0.9943	0.9981
Septoria leaf spot	0.9913	0.9452	0.9637	0.9544	0.9789
Spider mites	0.9968	0.9860	0.9591	0.9724	0.9791
Target spot	0.9955	0.9649	0.9607	0.9628	0.9792
Tomato yellow leaf	0.9989	0.9922	0.9922	0.9922	0.9958
Tomato mosaic virus	0.9979	0.9853	0.9853	0.9853	0.9921
Healthy	0.9984	0.9922	0.9922	0.9922	0.9956
Powdery mildew	0.9971	0.9603	0.9528	0.9565	0.9757

Table 3. Comparative performance metrics of the baseline and proposed models.

Model	Accuracy	Precision	Recall	F1-Score
VGG-16	94.41	94.62	94.19	94.37
ResNet-50	94.66	94.94	94.41	94.61
MobileNetV2	94.66	94.5	94.31	94.33
MobileNetV2-LSTM	96.38	96.39	96.20	96.27
MobileNetV2-CfC-NCP	97.15	97.16	97.09	97.11

Table 4. Computational efficiency comparison of the baseline and proposed models.

Model	Parameters	Memory (Mega Bytes)	Inference Speed (ms)	FLOPs (Mega)
VGG-16	14,978,379	57.14	6.33	2507.7
ResNet-50	24,637,835	93.99	3.14	633.1
MobileNetV2	2,914,891	11.12	0.77	50.22
MobileNetV2-LSTM	2,603,019	9.93	0.89	49.60
MobileNetV2-CfC-NCP	2,475,959	9.45	0.80	49.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Le, A.T.; Shakiba, M.; Ardekani, I.; Abdulla, W.H. Optimizing Plant Disease Classification with Hybrid Convolutional Neural Network–Recurrent Neural Network and Liquid Time-Constant Network. Appl. Sci. 2024, 14, 9118. https://doi.org/10.3390/app14199118

AMA Style

Le AT, Shakiba M, Ardekani I, Abdulla WH. Optimizing Plant Disease Classification with Hybrid Convolutional Neural Network–Recurrent Neural Network and Liquid Time-Constant Network. Applied Sciences. 2024; 14(19):9118. https://doi.org/10.3390/app14199118

Chicago/Turabian Style

Le, An Thanh, Masoud Shakiba, Iman Ardekani, and Waleed H. Abdulla. 2024. "Optimizing Plant Disease Classification with Hybrid Convolutional Neural Network–Recurrent Neural Network and Liquid Time-Constant Network" Applied Sciences 14, no. 19: 9118. https://doi.org/10.3390/app14199118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Plant Disease Classification with Hybrid Convolutional Neural Network–Recurrent Neural Network and Liquid Time-Constant Network

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Image Preprocessing Techniques

2.2.1. Image Resizing

2.2.2. Data Augmentation

2.3. Image Classification Methods

2.3.1. Convolutional Neural Networks

2.3.2. Recurrent Neural Networks

2.3.3. Liquid Time-Constant Networks

2.3.4. Neural Circuit Policies

2.3.5. Closed-Form Continuous-Time Neural Network

2.3.6. Transfer Learning

2.4. Model Development

2.5. Hyperparameter Optimization

2.6. Performance Metrics

3. Result and Discussion

3.1. Comparative Analysis of Model Performance

3.2. Computational Efficiency

3.3. Assessment of Data Augmentation

3.4. Ablation Experiments

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI