MTLBORKS-CNN: An Innovative Approach for Automated Convolutional Neural Network Design for Image Classification

Ang, Koon Meng; Lim, Wei Hong; Tiang, Sew Sun; Sharma, Abhishek; Towfek, S. K.; Abdelhamid, Abdelaziz A.; Alharbi, Amal H.; Khafaga, Doaa Sami

doi:10.3390/math11194115

Open AccessArticle

MTLBORKS-CNN: An Innovative Approach for Automated Convolutional Neural Network Design for Image Classification

by

Koon Meng Ang

¹

,

Wei Hong Lim

^1,*

,

Sew Sun Tiang

¹

,

Abhishek Sharma

²,

S. K. Towfek

^3,4,*,

Abdelaziz A. Abdelhamid

^5,6

,

Amal H. Alharbi

⁷ and

Doaa Sami Khafaga

⁷

¹

Faculty of Engineering, Technology and Built Environment, UCSI University, Kuala Lumpur 56000, Malaysia

²

Department of Computer Science and Engineering, Graphic Era Deemed to be University, Dehradun 248002, India

³

Department of Communications and Electronics, Delta Higher Institute of Engineering and Technology, Mansoura 35111, Egypt

⁴

Computer Science and Intelligent Systems Research Center, Blacksburg, VA 24060, USA

⁵

Department of Computer Science, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt

⁶

Department of Computer Science, College of Computing and Information Technology, Shaqra University, Shaqra 11961, Saudi Arabia

⁷

Department of Computer Science, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Mathematics 2023, 11(19), 4115; https://doi.org/10.3390/math11194115

Submission received: 3 September 2023 / Revised: 26 September 2023 / Accepted: 27 September 2023 / Published: 28 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Convolutional neural networks (CNNs) have excelled in artificial intelligence, particularly in image-related tasks such as classification and object recognition. However, manually designing CNN architectures demands significant domain expertise and involves time-consuming trial-and-error processes, along with substantial computational resources. To overcome this challenge, an automated network design method known as Modified Teaching-Learning-Based Optimization with Refined Knowledge Sharing (MTLBORKS-CNN) is introduced. It autonomously searches for optimal CNN architectures, achieving high classification performance on specific datasets without human intervention. MTLBORKS-CNN incorporates four key features. It employs an effective encoding scheme for various network hyperparameters, facilitating the search for innovative and valid network architectures. During the modified teacher phase, it leverages a social learning concept to calculate unique exemplars that effectively guide learners while preserving diversity. In the modified learner phase, self-learning and adaptive peer learning are incorporated to enhance knowledge acquisition of learners during CNN architecture optimization. Finally, MTLBORKS-CNN employs a dual-criterion selection scheme, considering both fitness and diversity, to determine the survival of learners in subsequent generations. MTLBORKS-CNN is rigorously evaluated across nine image datasets and compared with state-of-the-art methods. The results consistently demonstrate MTLBORKS-CNN’s superiority in terms of classification accuracy and network complexity, suggesting its potential for infrastructural development of smart devices.

Keywords:

deep learning; automatic network design; image classification; convolutional neural network; teaching-learning-based optimization

MSC:

68T07

1. Introduction

1.1. Problem Background

Convolutional neural networks (CNNs) have garnered significant attention in recent years within the field of artificial intelligence due to their exceptional performance in addressing diverse real-world tasks such as object recognition [1], image classification [2], defect identification [3], denoising [4], etc. Typically, CNNs consist of convolutional layers, pooling layers, and fully-connected layers. This network architecture exhibits remarkable versatility when dealing with a wide range of data types, including images, audio, and video. One fundamental challenge in these applications is the development of efficient signal preprocessing and feature extraction systems to generate appropriate data structures for specific tasks such as classification. CNNs, however, have alleviated the need for preprocessing in many domains by automatically extracting valuable features. In several cases, the convolutional layers of CNNs can perform this automatic preprocessing task as effectively as human experts could achieve, through the meticulous design process of network architecture.

Nevertheless, crafting meaningful CNN architectures remains a meticulous and labor-intensive process, often requiring the expertise of specialists [5]. This network design process involves identifying the most suitable CNN components, both in terms of architecture and hyperparameters. The performance of CNNs is primarily influenced by two key factors: architectures and trainable parameters (i.e., weights and biases) [6]. While gradient descent algorithms have been effective for optimizing trainable parameters, there are no explicit functions available for directly determining the optimal CNN architecture needed to achieve competitive results on specific datasets [6]. Notable CNN architectures, including AlexNet [7], ResNet [8], VGGNet [9], MobileNet [10], and GoogleNet [11], have demonstrated remarkable accuracy enhancements in computer vision tasks due to their distinctive network architectures characterized by layer count, interlayer connections, and basic unit designs. However, the manual design of these network architectures remains challenging due to the large number of parameters [12]. Notably, these CNN architectures are hand-crafted and cannot autonomously learn optimal configurations, necessitating designers with extensive expert knowledge in CNN architecture design [13]. Furthermore, the optimal design of CNN architecture is problem-specific, and determined by the varying data distributions. These manually crafted architectures lack flexibility, often necessitating time-consuming trial-and-error approaches. Additionally, these manually designed networks may have limited adaptability to diverse datasets, potentially hindering network generalization [6].

To address these challenges and limitations of manually crafted network architectures, there is an increasing demand for automated CNN architecture design methods that reduce reliance on domain-specific human expertise. The goal of these automated methods is to efficiently search for the optimal CNN architectures that can surpass the performance of manually crafted counterparts. The development of automated CNN architecture design approaches, capable of accommodating diverse datasets, represents a critical step in enhancing CNN efficiency and effectiveness. Such automated strategies yield network architectures tailored to specific tasks and dataset requirements, ultimately enhancing CNN generalization capabilities.

1.2. Recent Progress in Network Architecture Design Techniques

Designing optimal CNN architectures for specific datasets has traditionally demanded substantial manual effort. Recent advancements in network architecture design have introduced three main approaches to automate this process: reinforcement learning-based [14,15,16], gradient-based [17,18], and metaheuristic search algorithm (MSA)-based [19,20] methods.

Reinforcement learning-based approaches, such as efficient architecture search (EAS) [15] and BlockQNN [16], have exhibited impressive performance in discovering competitive architectures. EAS employs network transformation techniques to evolve existing model architectures, while BlockQNN employs a reinforcement technique based on Q-Learning, with an epsilon-greedy strategy to balance exploitation and exploration. However, these approaches demand significant computational resources. For instance, EAS and BlockQNN require 5 and 32 graphics processing units (GPUs), respectively, for tasks such as solving CIFAR-10 and ImageNet datasets. Gradient-based methods, such as differentiable architecture search (DARTS) [17], offer higher efficiency compared to reinforcement learning-based strategies, but may yield inconsistent results. Nonetheless, they still require substantial human knowledge and computational resources during the network construction phase.

In contrast, MSA-based approaches present a promising solution by integrating nature-inspired search operators, facilitating the discovery of optimal network architectures without the need for specialized domain expertise. These methods, including particle swarm optimization (PSO), grey wolf optimization (GWO), teaching-learning-based optimization (TLBO), and differential evolution (DE), exhibit robust global search capabilities and find extensive application across various domains [21,22,23,24]. Due to their appealing features, MSA-based techniques have emerged as popular alternatives to conventional design methods, offering researchers a versatile tool to effectively address a wide array of deep learning challenges.

1.3. General Challenges of MSA-Based Network Design Techniques

MSA-based approaches hold great promise for robustly searching for optimal CNN architecture designs tailored to given datasets. However, despite their potential, several fundamental challenges persist. One major challenge is that optimal CNN architectures for different datasets are generally unknown in advance, and can encompass a wide range of architectures and parameters, including the number and type of layers, number of filters, kernel size, pool size, pool stride, and number of neurons. Addressing this challenge requires the adoption of an appropriate encoding strategy that can represent individual solutions as potential CNN architectures with varying network lengths, while maintaining manageable search complexity.

Additionally, integrating effective network constraints into MSA-based approaches during the optimization of CNN architectures is crucial. These constraints prevent the construction of invalid networks, while preserving flexibility in the discovery of novel network architectures. Another challenge associated with population-based MSAs is the significant computational time and resources required to evaluate the effectiveness of each candidate solution representing potential CNN architectures for a given dataset. Therefore, it is imperative to implement a fitness evaluation process with enhanced computational efficiency to make MSAs more practical for optimizing CNN architectures.

While numerous state-of-the-art MSAs inspired by various natural phenomena have emerged due to the no free lunch theorem, their applicability to complex optimization tasks remains relatively unexplored. Although established MSAs such as PSO, GA, and DE have been used to address CNN architecture optimization, there is a critical need to advance the field by exploring the potential of other MSAs in tackling these intricate real-world challenges.

1.4. Drawbacks of Original TLBO in CNN Architecture Optimization

TLBO has recently emerged as a promising approach for optimizing the search for optimal CNN architecture designs based on given datasets [25]. However, this automatic network design method primarily relies on search operators from the original TLBO version, which has certain limitations. One notable concern is that in the original TLBO, all learners are guided by the same exemplars (population mean and teacher solution) during the teacher phase, overlooking potentially valuable information contributed by other population members. While the original TLBO shows promising convergence speed, it is highly susceptible to premature convergence, especially when both the teacher and population mean become trapped in suboptimal regions early in the optimization process.

Moreover, the learner phase of the original TLBO does not emulate a realistic classroom learning scenario, as it confines each learner’s interaction to a randomly selected peer for acquiring new knowledge. Therefore, considering alternative strategies such as self-learning and adaptive interaction with multiple peers during the learner phase could enhance TLBO’s learning capability. Lastly, TLBO uses a greedy selection scheme based solely on the fitness criterion to determine the survival of learners in the subsequent generation. While this scheme is straightforward to implement, it may neglect potentially superior learners that exhibit temporarily poor fitness values, but could substantially improve the overall population quality in long terms. These limitations can hinder the effectiveness of the original TLBO in solving complex tasks, including CNN architecture optimization, due to the imbalance between exploration and exploitation strengths.

1.5. Research Significances and Contributions

This paper introduces a new variant, i.e., modified TLBO with refined knowledge sharing (MTLBORKS), designed to autonomously discover optimal CNN architectures tailored to specific datasets without human intervention. This process involves searching for the optimal combination of network hyperparameters across three vital CNN building blocks: the convolutional block, pooling block, and fully-connected block. Determining the best combinations of these hyperparameters, encompassing layer numbers and types, filter numbers, kernel sizes, pool size and stride, and neuron numbers, contributes to the optimal design of CNN architecture. MTLBORKS incorporates several enhancements throughout its teacher phase, learner phase, and selection scheme, collectively promoting a more balanced exploration and exploitation search.

The key highlights of this study include:

Introduction of MTLBORKS-CNN, an automatic network design method demonstrating outstanding accuracy in image classification tasks. This approach leverages on the robust global search capabilities of MTLBORKS to autonomously identify the optimal CNN architectures tailored to specific datasets without human intervention.
The proposed MTLBORKS-CNN method incorporates a comprehensive solution encoding strategy that facilitates the search for optimal network hyperparameters. This encoding strategy enables the construction of robust and innovative CNN architectures with varying network lengths for diverse datasets, while avoiding the generation of infeasible models. To ensure practicality, a fitness evaluation process with reduced computational intensity is implemented.
In the modified teacher phase of MTLBORKS-CNN, the unique population mean and social exemplar are calculated for each learner by harnessing the valuable information from better-performing learners through a social learning concept. This individualized approach enhances the potential of learners for discovering novel CNN architectures, while preserving population diversity.
The modified learner phase of MTLBORKS-CNN integrates two innovative strategies: self-learning and adaptive peer learning, aiming to enhance the knowledge of learners effectively. The self-learning mechanism promotes exploration, empowering each learner to independently search for new CNN architectures through personal efforts. Conversely, the adaptive peer learning encourages exploitation by facilitating knowledge sharing among learners through adaptive interactions with multiple peers based on their fitness values during the search for optimal CNN architectures.
MTLBORKS-CNN integrates a dual-criterion selection scheme that comprehensively evaluates learners for their survival in the next generation. This scheme considers the fitness and diversity values of learners, reducing the risk of premature convergence. It ensures that learners with the relatively lower fitness but promising diversity values are not prematurely excluded, allowing them to contribute to the long-term enhancement of the overall population’s quality.
This study conducts thorough simulation studies on nine image datasets derived from MNIST variations to meticulously evaluate the efficacy and feasibility of MTLBORKS-CNN in automatically discovering optimal CNN architectures. The results highlight the considerable merit of MTLBORKS-CNN, as the produced optimal CNN architectures consistently outperform state-of-the-art methods across most of the datasets.

This paper is structured as follows: Section 2 provides an overview of prior research efforts. Section 3 explains the search mechanisms used by MTLBORKS-CNN to autonomously discover optimal CNN architectures. Section 4 presents detailed performance assessments conducted on nine distinct image datasets derived from MNIST variations. Lastly, Section 5 concludes the paper and outlines avenues for future research.

2. Related Works

2.1. Conventional TLBO

TLBO was originally devised to tackle mechanical design problems, drawing inspiration from the pedagogical processes observed in real classrooms [26]. Let N represent the population size, and D is the number of decision variables to be optimized for a given problem. In the initialization stage of TLBO, a group of N learners, each with D dimensions, is randomly generated. Each learner is represented by a position vector

X_{n} = [X_{n, 1}, \dots, X_{n, d}, \dots, X_{n, D}]

, indicating a potential solution for the given problem. Here, n denotes the learner index, and d indicates the dimension index. The knowledge level of each learner is assessed based on its fitness value, denoted as

F (X_{n})

.

The optimization process of TLBO consists of two phases, each incorporating unique learning mechanisms to enhance the learners’ knowledge. In the teacher phase, each n-th learner acquires new knowledge by considering the differences between the most knowledgeable teacher within the population, represented as

X^{T e a c h e r}

, and the population mean, denoted as

\bar{X}

, which reflects the collective knowledge level of the classroom. The value of

\bar{X}

is determined by averaging the position vectors of all individual learners in the population, as depicted below:

\bar{X} = \frac{1}{N} \sum_{n = 1}^{N} X_{n}

(1)

For each n-th learner, a new position denoted as

X_{n}^{N e w}

is computed using Equation (2). This equation involves a random number,

r_{1} \in [0, 1]

, generated from a uniform distribution, and a teacher factor,

F^{T} \in {0, 1}

, which dictates the extent to which mainstream knowledge influences the learning process, i.e.,

X_{n}^{N e w} = X_{n} + r_{1} (X^{T e a c h e r} - F^{T} \bar{X})

(2)

In the learner phase, the knowledge of the n-th learner

X_{n}

can be improved through interaction with a randomly selected peer learner, denoted as

X_{m}

, where

m \neq n

and

m = 1, \dots, N

. In the context of minimization problems, smaller values of

F (X_{n})

and

F (X_{m})

indicates better levels of knowledge among the learners. For each n-th learner, a new position

X_{n}^{N e w}

is computed using Equation (3), which is based on a random number, r₂ ∈ [0,1], generated using the uniform distribution. Equation (3) demonstrates that the n-th learner has the option to either move away from the m-th learner when the latter has inferior fitness to encourage exploration, or move closer to the m-th learner when the later has superior fitness to facilitate exploitation.

X_{n}^{N e w} = {\begin{matrix} X_{n} + r_{2} (X_{n} - X_{m}), if F (X_{n}) < F (X_{m}) \\ X_{n} + r_{2} (X_{m} - X_{n}), otherwise \end{matrix}

(3)

Upon completion of either the teacher or learner phase, the fitness value corresponding to the new position of the n-th learner, denoted as

F (X_{n}^{N e w})

, is compared to the current value of

F (X_{n})

. If

F (X_{n}^{New}) < F (X_{n})

, the new values of

X_{n}^{New}

and

F (X_{n}^{N e w})

replace the current values of

X_{n}

and

F (X_{n})

, respectively. Otherwise,

X_{n}^{N e w}

and

F (X_{n}^{N e w})

are discarded. The iterative learning processes of each learner within both the teacher and learner phases of TLBO continue until the predefined termination criteria are met, at which point

X^{T e a c h e r}

is returned as the optimal solution.

2.2. Existing MSA-Based Approaches for CNN Optimization

Over the past few decades, various neuroevolutionary approaches have emerged with the goal of optimizing both the hyperparameters and architectures of ANNs. Their aim is to reduce the computational time required for training ANNs compared to conventional backpropagation methods. NeuroEvolution of Augmenting Topologies (NEAT) is recognized as a population-based variant of Topology and Weight Evolving Artificial Neural Networks (TWEANNs) [27]. NEAT mimics natural evolution by evolving from a basic structure to a more complex one, while adjusting connection weights. It introduced a mechanism called speciation to maintain diversity in neural networks while evolving toward more complex structures. Another TWEANN, Evolutionary Acquisition of Neural Topologies (EANT) [28] consists of two layers: (i) an outer layer for structural exploration through mutation, and (ii) an inner layer for structural exploitation by optimizing parameters within the current structure using an evolution strategy to enhance network performance. Hypercube-Based NeuroEvolution of Augmenting Topologies (HyperNEAT) [29] was introduced to address limitations in NEAT’s direct encoding strategy for representing complex network structures. HyperNEAT uses an indirect solution representation scheme known as Compositional Pattern Producing Networks (CPPN) to generate complex networks with multimillions of connections. However, HyperNEAT encountered challenges in image classification [30]. Although most TWEANN approaches show promise in optimizing ANN architectures and hyperparameters, they still suffer with certain limitations, including high computational demands and inefficiencies in encoding complex structures.

Metaheuristic search algorithms (MSAs) are widely utilized for optimizing the hyperparameters and architectures of CNNs. One prominent MSA is genetic algorithm (GA), extensively applied to various optimization and deep learning problems. In 2016, GA was employed to discover an optimized CNN structure for face recognition tasks, where a support vector machine replaced the final CNN layer [31]. Another application involved using GA to optimize a multichannel CNN for stock market prediction [32]. This optimization targeted the hyperparameters of the CNN’s feature extraction module, including pooling size, kernel size, and kernel numbers, which were encoded into a chromosome structure and optimized through GA. In 2021, a recent GA variant, called the two-level GA (2LGA) emerged [33]. This variant focuses on optimizing CNN hyperparameters and architectures for pattern recognition tasks. The 2LGA method involves partial training and evaluation of multiple individuals with fewer epochs in the first level, followed by full training of the best-performing individual from the first level with more epochs in the second level. This approach resulted in significantly improved CNN models for solving datasets such as six MNIST variants and CIFAR-10.

Particle swarm optimization (PSO) has gained prominence as a swarm intelligence algorithm for optimizing CNNs in recent years. In 2018, an Improved PSO (IPPSO) [34] was introduced to enhance CNN solution diversity. IPPSO utilized a variable-depth encoding scheme inspired by the internet protocol address to optimize both CNN architecture and network parameters for image classification. While the CNN model optimized by IPPSO exhibited competitive performance on three MNIST-variants datasets, its encoding scheme introduced prolonged computational time due to frequent parameter value conversions between decimal and binary representations. Another novel PSO variant, called OLPSO [35], was proposed to fine-tune the dropout rate and minibatch size for ensemble classifiers comprising two pretrained CNNs: VGGNet-16 and VGGNet-19 [9]. Additionally, OLPSO integrated a learning rate with an exponential decay characteristic to optimize model training. The model optimized by OLPSO exhibited superior performance over Inception-v3 and Xception in diagnosing plant diseases. In 2021, Jayanthi et al. [36] devised a PSO-CNN model for detecting and classifying diabetic retinopathy from color fundus images. PSO-CNN was employed for noise reduction and feature extraction to isolate relevant features, which were subsequently input into a decision tree for classification. PSO-CNN outperformed several pretrained models, including AlexNet [7], M-AlexNet [7], VGGNet-S, VGGNet-16 [9], VGGNet-19 [9], GoogleNet [11], and ResNet [8], achieving the specificity, sensitivity, and accuracy values of 98.63%, 94.80%, and 97.75%, respectively.

Grey wolf optimization (GWO) stands out as a robust swarm intelligence algorithm that has gained popularity for optimizing CNNs in recent years. In a study by [37], a CNN-based transmission line protection scheme was proposed to detect and classify faults arising from power swings, both with symmetrical and asymmetrical characteristics. GWO was employed to finely tune the optimal combination of CNN parameters, including the depth and filter size of convolutional layers, the maximum size of the pooling layer, and the dropout rate. In 2020, a new GWO variant [38] was introduced, incorporating a nonlinear exploration scheme, a chaotic leadership dispatching strategy, local exploitation with a rectified spiral, and leader enhancement based on probability distribution. This variant aimed to optimize the CNN-LSTM model for enhanced performance in addressing diverse time series prediction and classification problems. In 2022, a GWO-based CNN (GWOCNN) [39] was proposed for the detection of skin cancer. GWOCNN incorporated GWO to determine the optimal combination of CNN hyperparameters for achieving competitive performance. A direct encoding strategy was employed to encode the CNN hyperparameters, which include kernel numbers and sizes of convolutional layers, kernel size of the pooling layer, dropout rate, and number of dropout layers. Each of these parameters was mapped onto distinct dimensional components of each GWO agent. GWOCNN exhibited a faster convergence rate compared to GA-based and PSO-based CNN optimization methods during the optimization process, highlighting GWO’s excellent global search ability and high convergence rate.

3. Proposed MTLBORKS-CNN

This study introduces MTLBORKS-CNN as an automatic technique for designing optimized CNN architectures for image classification tasks. The proposed approach eliminates the need for domain-specific expertise from humans. Figure 1 provides an overview of the MTLBORKS-CNN framework, and specific modifications within MTLBORKS-CNN will be detailed in subsequent subsections.

3.1. Overview of CNN Architecture

Figure 2 depicts a typical sequential CNN architecture, comprising a feature extraction module with two convolutional layers and two pooling layers, along with a trainable classification module consisting of three fully-connected layers. Each CNN layer is defined by specific network hyperparameters, vital for network construction and training, as detailed in the following subsections.

3.1.1. Convolutional Layer

CNNs utilize two types of convolution processes: SAME convolution and VALID convolution. SAME convolution produces feature maps with the same dimensions as the input data by using zero padding. In contrast, VALID convolution generates smaller feature maps without any padding. Convolutional blocks employ filters with predetermined height and width to generate feature maps from input data.

In the convolution process, a filter horizontally slides from left to right with a specified stride width, then vertically moves downward with a stride height, repeating the left-to-right process to form a complete feature map. Feature map elements are computed by summing the product of filter elements and corresponding input elements covered by the filter. The key network hyperparameters of convolutional layer considered for CNN architecture optimization in this study include: number of convolutional layers (

N^{C o n v}

), number of filters in each l-th convolutional layer (

N_{l}^{F i l}

) and kernel size (i.e.,

r o w \times c o l u m n

) of filter in each l-th convolutional layer (

S_{l}^{K e r}

).

3.1.2. Pooling Layer

Pooling is critical in CNNs for achieving local translation invariance and results in down-sampled feature maps that exhibit increased robustness to variations in feature locations within input data. There are two common pooling types: average pooling and maximum pooling. Average pooling computes the mean values of elements within a kernel to create down-sampled feature maps. Meanwhile, maximum pooling identifies the largest values among the captured elements.

Pooling involves applying a kernel with predefined type, height, and width to input data. Down-sampled feature maps are generated by sliding the kernel from the top-left to the bottom-right, guided by predetermined stride height and width. The network hyperparameters of pooling layer considered for CNN architecture optimization in this study include: selection probability of pooling type connected to each l-th convolutional layer, (

P_{l}^{P o o l}

), kernel size (i.e.,

r o w \times c o l u m n

) of pooling layer connected to each l-th convolutional layer (

S_{l}^{P o o l}

), and stride size (i.e.,

r o w \times c o l u m n

) of pooling layer connected to each l-th convolutional layer,

S_{l}^{S t r}

.

3.1.3. Fully-Connected Layer

The feature extraction module, comprising convolutional and pooling layers, extracts relevant features from raw data. Once extracted, these features can be fed into a classifier, often a fully-connected layer used for classification. To introduce the output feature maps to the fully-connected layer, the feature maps must be flattened and reshaped into a vector.

The CNN’s classification module can consist of one or multiple fully-connected layers, comprising neurons that process input to generate an output. These layers receive inputs from neurons in the preceding layers via connections with assigned weights. Generally, the fully-connected layers are used alongside backpropagation to learn network weights. CNN training aims to minimize errors between predicted and actual dataset outputs via the reduction of cross-entropy loss. The network hyperparameters of fully-connected layers considered for CNN architecture optimization in this study include: number of fully-connected layer (

N^{F C}

) and number of neurons in the q-th fully-connected layer (

N_{q}^{N e u}

).

3.2. Solution Encoding Scheme of MTLBORKS-CNN

Once the network hyperparameters crucial for optimizing the CNN architecture are identified, an efficient solution scheme is designed for MTLBORKS-CNN, as depicted in Figure 1. This scheme allows each learner to encode vital network hyperparameters necessary for generating optimal CNN architectures. These encoded hyperparameters encompass network length, layer types, filter numbers, kernel sizes, pool sizes, pool strides, and neuron numbers. The solution encoding approach in MTLBORKS-CNN is meticulously designed to prevent the generation of invalid network architectures, while maintaining the flexibility required to discover effective CNN architectures for diverse image classification tasks.

Figure 3 illustrates D-dimensional position vector

X_{n}

of the n-th MTLBORKS-CNN learner. Each d-th dimension (

X_{n, d}

) in this vector contains essential information for constructing a unique CNN architecture, including layer types and their corresponding hyperparameters. The information encoded in

X_{n}

is categorized into three main sections: convolutional, pooling, and fully-connected. Table 1 summarizes the attributes of network hyperparameters to optimize in each section, such as data type, encoded dimension index, index number (if applicable), lower limit, and upper limit. Notably, the pooling layer hyperparameters,

P_{l}^{P o o l} \in [0, 1]

, encoded in dimensions

d = 2 N_{m a x}^{C o n v} + 3 l - 1

, where

l = 1, \dots, N_{m a x}^{C o n v}

, indicate the types of pooling layers connected to each l-th convolutional layer. Specifically, (a) no pooling layer for

0 \leq P_{l}^{P o o l} < 1 / 3

, (b) maximum pooling for

1 / 3 \leq P_{l}^{P o o l} < 2 / 3

, and (c) average pooling for

P_{l}^{P o o l} \geq 2 / 3

. With predefined

N_{m a x}^{C o n v}

and

N_{m a x}^{F C}

values, the total dimensional size of

X_{n}

for each n-th learner is

D = 5 N_{m a x}^{C o n v} + N_{m a x}^{F C} + 2

.

Referring to the solution encoding scheme depicted in Figure 3 and detailed in Table 1, Algorithm 1 has been devised to decode the network hyperparameters contained within each learner and transform it into a valid CNN architecture. It is worth noting that despite all position vectors

X_{n}

share the same D-dimensional size for

n = 1, \dots, N

, each n-th MTLBORKS learner can generate CNNs of varying network lengths based on values

N^{C o n v} \in {N_{m i n}^{C o n v}, N_{m a x}^{C o n v}}

,

P_{l}^{P o o l} \in [0, 1]

, and

N^{F C} \in {N_{m i n}^{F C}, N_{m a x}^{F C}}

encoded in different dimensional indices, as indicated in Table 1. For instance, when

N^{C o n v} < N_{m a x}^{C o n v}

, only the first

N^{C o n v}

pieces of information (i.e.,

N_{l}^{F i l}

,

S_{l}^{K e r}

,

P_{l}^{P o o l}

,

S_{l}^{P o o l}

and

S_{l}^{S t r}

) for

l = 1, \dots, N^{C o n v}

are utilized for constructing the convolutional and pooling sections of CNN. Information pertaining to the remaining convolutional and pooling sections of the CNN, specifically for

l = N^{C o n v} + 1, \dots ., N_{m a x}^{C o n v}

, are disregarded during network construction. It should be noted that information pertaining to

S_{l}^{P o o l}

and

S_{l}^{S t r}

are also omitted when the corresponding value of

P_{l}^{P o o l}

falls within the range of

[0, 1 / 3]

. This indicates that no pooling layer is linked to the l-th convolutional layer under this condition. Similarly, when

N^{F C} < N_{m a x}^{F C}

, only the first

N^{F C}

pieces of information (i.e.,

N_{q}^{N e u}

), for

q = 1, \dots, N^{F C}

, are used to construct the fully-connected layer of CNN. Any remaining information related to the fully-connected layer, for

q = N^{F C} + 1, \dots ., N_{m a x}^{F C}

, is omitted.

Algorithm 1: Decoding Learner to CNN Architecture
Input: $X_{n}$ , $N_{m a x}^{C o n v}$
01:	Initialize an empty CNN architecture;
02:	Extract $N^{C o n v}$ and $N^{F C}$ encoded in dimensions $d = 1$ and $d = 5 N_{m a x}^{C o n v} + 2$ of learner $X_{n}$ , respectively;
03:	Initialize the indices of convolutional layer and fully-connected layer as $l \leftarrow 1$ and $q \leftarrow 1$ , respectively;
04:	while $l \leq N^{C o n v}$ do /only the first $N^{C o n v}$ pieces of information are used to construct the convolutional and pooling sections/
05:	Extract $N_{l}^{F i l}$ and $S_{l}^{K e r}$ encoded in dimensions $d = 2 l$ and $d = 2 l + 1$ of learner $X_{n}$ , respectively;
06:	Append the l-th convolutional layer with $N^{C o n v}$ and $N^{F C}$ to the CNN architecture;
07:	Extract $P_{l}^{P o o l}$ , $S_{l}^{P o o l}$ and $S_{l}^{S t r}$ encoded in dimensions $d = 2 N_{m a x}^{C o n v} + 3 l - 1$ , $d = 2 N_{m a x}^{C o n v} + 3 l$ and $d = 2 N_{m a x}^{C o n v} + 3 l + 1$
	of learner $X_{n}$ , respectively;
08:	if $0 \leq P_{l}^{P o o l} < 1 / 3$ then
09:	No pooling layer is appended to the l-th convolutional layer of CNN architecture;
10:	elseif $1 / 3 \leq P_{l}^{P o o l} < 2 / 3$ then
11:	Append a maximum pooling layer with $S_{l}^{P o o l}$ and $S_{l}^{S t r}$ to the l-th convolutional layer of CNN architecture;
12:	else /when $P_{l}^{P o o l} \geq 2 / 3$ /
13:	Append an average pooling layer with $S_{l}^{P o o l}$ and $S_{l}^{S t r}$ to the l-th convolutional layer of CNN architecture;
14:	end if
15:	$l \leftarrow l + 1$ ;
16:	end while
17:	while $q \leq N^{F C}$ do /*only the first $N^{F C}$ pieces of information are used to construct the fully-connected section
18:	Extract $N_{q}^{N e u}$ encoded in dimension $d = (5 N_{m a x}^{C o n v} + 2) + q$ of learner $X_{n}$ , respectively;
19:	Append the q-th fully-connected layer with $N_{q}^{N e u}$ to the CNN architecture;
20:	$q \leftarrow q + 1$ ;
21:	end while
Output: A valid CNN architecture corresponds to learner $X_{n}$

Table 2 provides an overview of the feasible search ranges for network hyperparameters that are subject to CNN architecture optimization, and the total dimension of for each learner is determined as

D = 5 N_{m a x}^{C o n v} + N_{m a x}^{F C} + 2 = 19

. As depicted in Figure 4, Algorithm 1 demonstrates the capability to construct different valid CNN architectures based on the unique combination of network hyperparameters encoded within each MTLBORKS-CNN learner. This holds true, provided that these network hyperparameters encoded in each MTLBORKS-CNN learner fall within their predefined boundary limits.

3.3. Population Initialization of MTLBORKS-CNN

Algorithm 2 outlines the initialization process of the MTLBORKS-CNN population. During this stage, diverse candidates of CNN architectures are created by randomly generating position vectors,

X_{n}

for N learners, where

n = 1, \dots, N

. The dimension size of each

X_{n}

is

D = 5 N_{m a x}^{C o n v} + N_{m a x}^{F C} + 2

, considering the predefined values of

N_{m a x}^{C o n v}

and

N_{m a x}^{F C}

.

For each potential CNN architecture represented by the n-th learner, network hyperparameters for convolutional, pooling, and fully-connected sections are randomly generated within their feasible search ranges in

X_{n, d}

, where

d = 1, \dots, D

. For example, the convolutional section’s hyperparameters, such as

N^{C o n v} \in {N_{m i n}^{C o n v}, N_{m a x}^{C o n v}}

,

N_{l}^{F i l} \in {N_{m i n}^{F i l}, N_{m a x}^{F i l}}

, and

S_{l}^{K e r} \in {S_{m i n}^{K e r}, S_{m a x}^{K e r}}

, are initialized in dimensions

d = 1

,

d = 2 l

, and

d = 2 l + 1

, where

l = 1, \dots, N_{m a x}^{C o n v}

. The remaining network hyperparameters are initialized based on the solution encoding scheme and attributes of network hyperparameters described in Figure 3 and Table 1, respectively.

Algorithm 2: Population Initialization of MTLBORKS-CNN
Input: $N, N_{m i n}^{C o n v}$ $, N_{m a x}^{C o n v}$ $, N_{m i n}^{F i l}$ $, N_{m a x}^{F i l}$ $, S_{m i n}^{K e r}$ , $S_{m a x}^{K e r}$ $, S_{m i n}^{P o o l}$ , $S_{m a x}^{P o o l}$ $, S_{m i n}^{S t r}$ $, S_{m a x}^{S t r}$ , $N_{m i n}^{F C}$ , $N_{m a x}^{F C}$ , $N_{m i n}^{N e u}$ , $N_{m a x}^{N e u}$
01:	Calculate the total dimension size of $D = 5 N_{m a x}^{C o n v} + N_{m a x}^{F C} + 2$ ;
02:	Initialize teacher solution, i.e., $X^{T e a c h e r} \leftarrow \emptyset$ $and F (X^{T e a c h e r}) \leftarrow \infty$ ;
03:	for n = 1 to N do
04:	$Initialize X_{n} \leftarrow \emptyset$ ;
05:	for d = 1 to D do /Initialize each dimension of learner/
06:	Identify the types of network hyperparameter encoded in $X_{n, d}$ ;
07:	Randomly initialize the network hyperparameter in $X_{n, d}$ based on its corresponding attributes described in Table 1;
08:	end for
09:	$Evaluate fitness of X_{n}$ $as F (X_{n})$ based on Algorithm 3;
10:	$if F (X_{n}) < F (X^{T e a c h e r})$ then
11:	$X^{T e a c h e r} \leftarrow X_{n}$ $, F (X^{T e a c h e r}) \leftarrow F (X_{n})$ ;
12:	end if
13:	end for
Output: $P = [X_{1}, \dots, X_{n}, \dots ., X_{N}]$ $, X^{T e a c h e r}$ $, F (X^{T e a c h e r})$

Once

X_{n}

is generated for each n-th learner, its fitness value in classification error,

F (X_{n})

, is determined through a fitness evaluation process detailed in the next section. This initialization is performed for all N learners, forming an initial population

P = [X_{1}, \dots, X_{n}, \dots, X_{N}]

. The learner with the lowest classification error becomes the teacher solution, represented by its position vector

X^{Teacher}

and fitness

F (X^{Teacher})

.

3.4. Fitness Evaluation of MTLBORKS-CNN

Each MTLBORKS-CNN learner possesses a position vector representing a potential CNN architecture for solving a given problem. The learner’s fitness is evaluated based on the corresponding classification error of this CNN. Learners with lower error rates in classifying datasets are considered to have superior fitness, and vice versa. In this study, the automatic network architecture design problem is formulated as a minimization problem. The primary goal of MTLBORKS-CNN is to identify the optimal CNN model that solves the assigned tasks with the fewest classification errors. Algorithm 3 outlines the two major stages of fitness evaluation for each MTLBORKS-CNN learner, involving (i) constructing and training a potential CNN architecture with the training set, and (ii) evaluating the trained CNN architecture using the validation set.

Algorithm 3: Fitness Evaluation of MTLBORKS-CNN
Inputs: $X_{n}$ , $ℛ^{t r a i n}$ , $ℛ^{v a l i d}$ , $S^{b a t c h}$ , $ε^{t r a i n}$ , $R^{L}$ , $C^{n u m}$
01:	Compile a full-fledged CNN architecture based on the network information extracted from $X_{n}$ using Algorithm 1 and a fully-connected layer with $C^{n u m}$ output neurons added as the last layer of the CNN;
02:	Compute $τ^{t r a i n}$ and $τ^{v a l i d}$ using Equations (4) and (6), respectively;
03:	Initialize the weights of compiled CNN model as $ϖ = {ϖ_{1}, ϖ_{2}, \dots}$ using He Normal initializer;
04:	for $ε = 1$ to $ε^{t r a i n}$ do /First stage of fitness evaluation as explained in Section 3.4.1/
05:	$for i = 1$ $to τ^{t r a i n}$ do
06:	Calculate the cross-entropy function of CNN model based on $ϖ$ and the i-th batch data $ℛ_{i}^{t r a i n}$ $as f_{C E} (ϖ, ℛ_{i}^{t r a i n})$ ;
07:	$Update the weights ϖ^{n e w} = {ϖ_{1}^{n e w}, ϖ_{2}^{n e w}, \dots}$ based on Equation (5);
08:	end for
09:	end for
10:	for $j = 1$ $to τ^{v a l i d}$ do /Second stage of fitness evaluation as explained in Section 3.4.2/
11:	Use the trained CNN model to classify $ℛ_{j}^{v a l i d}$ dataset;
12:	Calculate the classification errors of trained CNN model based on the j-th batch data $ℛ_{j}^{v a l i d}$ $as E r r_B a t c h_{j}$ ;
13:	end for
14:	Calculate the mean classification error of the CNN constructed based on $X_{n}$ , using Equation (7), to obtain its fitness value $F (X_{n})$ ;
Output: $F (X_{n})$

3.4.1. Stage 1: Construction and Training of Potential CNN Architecture

When evaluating the fitness of each n-th learner, a CNN architecture is constructed using Algorithm 1 based on network hyperparameters decoded from the corresponding position vector,

X

. These hyperparameters include

N^{C o n v}

,

N_{l}^{F i l}

,

S_{l}^{k e r}

,

P_{l}^{P o o l}

,

S_{l}^{P o o l}

,

S_{l}^{S t r}

,

N^{F C}

, and

N_{q}^{N e u}

for

l = 1, \dots, N^{C o n v}

, and

q = 1, \dots, N^{F C}

. This compiled CNN architecture is also augemented with a fully-connected layer having the same number of output neurons as the required classification classes, denoted as

C^{n u m}

.

All convolutional and fully-connected layers’ weights are initialized using the He Normal weights initializer [40]. These trainable CNN parameters are denoted as

ϖ = {ϖ_{1}, ϖ_{2}, \dots}

. Let

ℛ^{t r a i n}

be the training dataset with size

| ℛ^{t r a i n} |

, employed to train each potential CNN architecture from MTLBORKS-CNN learners. Each CNN architecture undergoes training in

τ^{t r a i n}

steps, with a predefined batch size

S^{b a t c h}

, where

τ^{t r a i n} = \frac{| ℛ^{t r a i n} |}{S^{b a t c h}}

(4)

The compiled CNN architecture is trained using the Adam optimizer [41] for a predetermined number of epochs (

ε^{t r a i n}

), performed over

τ^{t r a i n}

data batches extracted from the training dataset (

ℛ^{t r a i n}

). At eachi-th training step (

i = 1, \dots, τ^{t r a i n}

), the cross-entropy loss function of the CNN model is evaluated as

f_{C E} (ϖ, ℛ_{i}^{t r a i n})

using the current weight parameters (

ϖ = {ϖ_{1}, ϖ_{2}, \dots}

) and the i-th batch data (

ℛ_{i}^{t r a i n}

). The learning rate is represented as

R^{L}

, and the gradient of the cross-entropy loss is denoted as

\nabla f_{C E} (ϖ, ℛ_{i}^{t r a i n})

. The updated weight parameters of the CNN model, i.e.,

ϖ^{n e w} = {ϖ_{1}^{n e w}, ϖ_{2}^{n e w}, \dots}

, are calculated as follows:

ϖ^{n e w} = ϖ - R^{L} \nabla f_{C E} (ϖ, ℛ_{i}^{t r a i n})

(5)

3.4.2. Stage 2: Evaluation of the Trained CNN Architecture

Next, the trained CNN model is evaluated using a validation dataset,

ℛ^{v a l i d}

, with

| ℛ^{v a l i d} |

samples. The resulting classification error is assigned as the fitness value for the corresponding MTLBORKS-CNN learner. This evaluation is performed over multiple steps,

τ^{v a l i d}

, where

τ^{v a l i d} = \frac{| ℛ^{v a l i d} |}{S^{b a t c h}}

(6)

In each j-th evaluation step, different data batches,

ℛ_{j}^{v a l i d}

, are used to evaluate the trained CNN models, resulting in distinct classification errors, indicated as

E r r_B a t c h_{j}

for

j = 1, \dots, τ^{v a l i d}

. The mean classification error of the trained CNN model, computed across all

τ^{v a l i d}

batches of data within

ℛ^{v a l i d}

, determines the fitness value for each n-th learner, represented as

F (X_{n})

, i.e.,

F (X_{n}) = \frac{1}{τ^{v a l i d}} \sum_{j = 1}^{τ^{v a l i d}} E r r_B a t c h_{j}

(7)

3.4.3. Design Consideration of Epoch Numbers during Fitness Evaluation Process

CNNs are often characterized by deep architectures, which makes full training with dataset

ℛ^{t r a i n}

for minimal classification error computationally expensive and time-consuming. This is primarily due to the requirement for a large number of training epochs

ε^{t r a i n}

, typically exceeding 100, for full training of CNNs. However, when employing population-based MTLBORKS-CNN for identifying the optimal CNN architectures tailored to a specific dataset, each learner undergoes full training in each generation. This approach becomes impractical because it requires evaluating numerous candidate CNN architectures represented by all learners, resulting in a significant number of fitness evaluations. In the context of optimizing CNN architectures, each fitness evaluation involves the full training of a potential CNN architecture represented by an MTLBORKS-CNN learner. These excessive computational demands pose challenges to the feasibility of MTLBORKS-CNN in handling a large number of alternatives for achieving substantial improvements in the optimization process of CNN architectures.

To tackle this issue, a fitness approximation method is implemented. It involves training the potential CNN architecture represented by each MTLBORKS-CNN learner with a reduced training epoch (e.g.,

ε^{t r a i n} = 1

) during the fitness evaluation process. While this may result in less precise evaluations, it effectively mitigates the computational load. In the selection process for the next generation of the population, it becomes more vital to ensure a fair comparison among learners to establish their relative superiority, rather than achieving precise fitness evaluations for each individual.

Moreover, a candidate CNN architecture is more likely to demonstrate a promising final classification error if it displays superior performance in the initial training epochs. Upon completing the MTLBORKS-CNN search process, the optimal CNN architecture formulated based on the network hyperparameters derived from the teacher can undergo full training with a larger epoch size to obtain its final classification error. To address potential network overfitting, the dropout and batch normalization techniques are incorporated between different layers of candidate CNN architectures [13].

3.5. Modified Teacher Phase of MTLBORKS-CNN

In the teacher phase of original TLBO, all learners are guided by the same exemplars, specifically the teacher solution and population mean, as defined by Equation (1). However, this conventional approach disregards valuable directional information contributed by other nonfittest population members. While both teacher solution and population mean can expedite learners’ convergence towards promising solution regions in the early stages of optimization, their influence diminishes as population diversity declines over time. This limitation becomes especially evident when addressing real-world optimization problems characterized by complex fitness landscapes, such as CNN architecture optimization in this study. These complex optimization problems often feature numerous local optima, which can misguide TLBO towards suboptimal regions. This undesirable phenomenon impedes the effectiveness of TLBO in searching for satisfactory CNN architectures due to premature convergence. To address this inherent limitation of teacher phase in the original TLBO, it is vital to incorporate a robust diversity maintenance mechanism. In this study, the modified teacher phase in MTLBORKS-CNN integrates a social learning concept. This mechanism considers the valuable direction information provided by other nonfittest learners, enabling more diversified and tailored guidance to each learner during the teacher phase, ultimately resulting in more effective CNN architecture searching.

3.5.1. Construction of Unique Mean Positions

In the modified teacher phase of MTLBORKS-CNN, a social learning concept is employed to calculate unique mean positions for all learners. This process begins by arranging all learners in descending order based on their fitness values, denoted as

F (X_{n})

for

n = 1 to N

. It is assumed that any learner outperforming the n-th learner falls within the population indices

m \in {n + 1, \dots, N}

. For each n-th learner, a unique mean position

{\bar{X}}_{n}

is defined to represent the mean CNN architecture information for

n = 1 to N - 1

. Particularly, each d-th dimension of unique mean position (i.e.,

{\bar{X}}_{n, d}

) is calculated as follows:

{\bar{X}}_{n, d} = \frac{1}{N - n + 1} \sum_{m = n}^{N} X_{m, d}

(8)

The rounding operator, i.e.,

R o u n d (\cdot)

, transforms all dimensional components of

{\bar{X}}_{n}

into integer values. This conversion excludes the network hyperparameters

P_{l}^{P o o l}

stored in

d = 2 N_{m a x}^{C o n v} + 3 l - 1

for

l = 1, \dots, N_{m a x}^{C o n v}

, which signify the selection probabilities of pooling layers connected to the l-th convolutional layer. The mean CNN architecture, represented by

{\bar{X}}_{n}

, varies across learners due to its computation relying on distinct groups of learners with superior fitness. However, the best-performing learner, with n = N, lacks other better-performing peers to mimic. Therefore, it does not receive a unique mean position using Equation (8). Figure 5 illustrates the calculation of the unique mean position,

{\bar{X}}_{1}

, for the worst-performing learner using Equation (8), considering directional information from other superior learners (

X_{n}

for

n = 2, \dots, 5

) under

N_{m a x}^{C o n v} = 3

and

N_{m a x}^{F C} = 2

. Algorithm 4 outlines the pseudocode used to compute the mean CNN architecture, represented by

{\bar{X}}_{n}

, for every n-th learner (

n = 1, \dots N - 1

).

Algorithm 4: Computation of Unique Mean Positions Based on Social Learning Concept
Inputs: $N$ , $D$ , $P = [X_{1}, \dots, X_{n}, \dots ., X_{N}]$
01:	Sort all learners in descending order based on their fitness values (i.e., $F (X_{n})$ for $n = 1, \dots, N$ );
02:	for $n = 1$ to $N - 1$ do
03:	Initialize ${\bar{X}}_{n} \leftarrow \emptyset$ ;
04:	for $d = 1$ to D do
05:	Calculate ${\bar{X}}_{n, d}$ using Equation (8);
06:	if $d \neq 2 N_{m a x}^{C o n v} + 3 l - 1$ with $l = 1, \dots, N_{m a x}^{C o n v}$ do
07:	${\bar{X}}_{n, d} \leftarrow R o u n d ({\bar{X}}_{n, d})$ ;
08:	end if
09:	end for
10:	end for
Output: ${\bar{X}}_{n}$ for $n = 1, \dots, N - 1$

3.5.2. Construction of Unique Social Exemplar

To enhance knowledge exchange within a classroom, learners often benefit not only from their teachers, but also from peers who excel in various subjects. A similar approach is employed in the modified teacher phase of MTLBORKS-CNN, where each n-th learner is assigned a unique social exemplar,

X_{n}^{S E}

. This exemplar is designed to provide more effective guidance, drawing from valuable insights offered by other population members. Specifically, any randomly selected learner who outperforms the n-th learner (with a population index

k_{r a n d} \in {n + 1, \dots, N}

) contributes to the d-th dimension of the unique social exemplar allocated to each n-th learner, denoted as

X_{n, d}^{S E}

. This contribution is achieved via the corresponding dimension of the

k_{r a n d}

-th learner, as shown below:

X_{n, d}^{S E} = X_{k_{r a n d}, d}^{S E}, where n = 1, \dots ., N - 1 and k_{r a n d} \in {n + 1, \dots, N}

(9)

Note that the unique social exemplar,

X_{n}^{S E}

, differs for each n-th learner, as it is computed based on a distinct group of learners who demonstrate superior fitness compared to the n-th learner. Similar to Equation (8), Equation (9) does not apply to the top-performing learner with a population index of

n = N

, as it surpasses all other learners. Figure 6 illustrates the process of constructing a unique social exemplar for the worst-performing learner,

X_{1}^{S E}

, by considering the position vectors of four other learners with superior fitness (i.e.,

X_{n}

for

n = 2, \dots, 5

). Algorithm 5 presents the pseudocode detailing the creation of the unique social exemplar,

X_{n}^{S E}

, for every n-th learner, where

n = 1, \dots ., N - 1

.

Algorithm 5: Computation of Unique Social Exemplars Based on Social Learning Concept
Inputs: $N$ , $D$ , $P = [X_{1}, \dots, X_{n}, \dots ., X_{N}]$
01:	for $n = 1$ to $N - 1$ do
02:	Initialize $X_{n}^{S E} \leftarrow \emptyset$ ;
03:	for $d = 1$ to D do
04:	Randomly select a fitter learner with the population index of $k_{r a n d} \in {n + 1, \dots, N}$ ;
05:	Assign $X_{n, d}^{S E} \leftarrow X_{k_{r a n d}, d}^{S E}$ based on Equation (9);
06:	end for
07:	end for
Output: $X_{n}^{S E}$ for $n = 1, \dots, N$

3.5.3. Construction of New CNN Architecture

In the modified teacher phase of MTLBORKS-CNN, a new CNN architecture is determined for each n-th offspring learner using the position vector

X_{n}^{o f f}

. This vector is derived by incorporating both the unique mean position

{\bar{X}}_{n}

and the unique social exemplar,

X_{n}^{S E}

, computed for each learner with

n = 1, \dots ., N - 1

, i.e.,

X_{n}^{o f f} = X_{n} + r_{3} (X^{T e a c h e r} - F^{T} {\bar{X}}_{n}) + r_{4} (X_{n}^{S E} - X_{n})

(10)

where

r_{3}, r_{4} \in [0, 1]

and

F^{T} \in {1, 2}

are randomly generated from the uniform distribution. As shown in Equation (10), the position vector for the n-th offspring learner, crucial for generating the new CNN architecture, is determined by the differences between the CNN architectures represented by (

X^{T e a c h e r} - F^{T} {\bar{X}}_{n}

) and

(X_{n}^{S E} - X_{n})

.

After obtaining

X_{n}^{o f f}

from Equation (10), boundary checking is performed to ensure all network hyperparameters are within their search boundaries, as specified in Table 1. A rounding operator,

R o u n d (\cdot)

, is applied to convert the network hyperparameters into integer values. This conversion excludes the network hyperparameters stored in

d = 2 N_{m a x}^{C o n v} + 3 l - 1

for

l = 1, \dots, N_{m a x}^{C o n v}

, which signify the selection probabilities of pooling layers connected to the l-th convolutional layer. Algorithm 3 measures the fitness of

X_{n}^{o f f}

as

F (X_{n}^{o f f})

. The teacher solution can be replaced by the n-th offspring learner with superior fitness. All generated offspring solutions,

X_{n}^{o f f}

for

n = 1, \dots, N - 1

, along with the best-performing N-th learner, are stored in an offspring population set, denoted as

P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]

. These solutions will be used in the subsequent modified learner phase alongside the original population,

P = [X_{1}, \dots, X_{n}, \dots ., X_{N}]

. The search mechanisms for the modified teacher phase of MTLBORKS-CNN are described in Algorithm 6.

Algorithm 6: Modified Teacher Phase of MTLBORKS-CNN
Inputs: $N$ , $D$ , $P = [X_{1}, \dots, X_{n}, \dots ., X_{N}]$ , $X^{T e a c h e r}$ , $F (X^{T e a c h e r})$ , $ℛ^{t r a i n}$ , $ℛ^{v a l i d}$ , $S^{b a t c h}$ , $ε^{t r a i n}$ , $R^{L}$ , $C^{n u m}$
01:	Calculate the unique mean position ${\bar{X}}_{n}$ of each learner for $n = 1, \dots, N$ using Algorithm 4;
02:	Calculate the unique social exemplar $X_{n}^{S E}$ of each learner for $n = 1, \dots, N$ using Algorithm 5;
03:	Initialize the offspring population $P^{o f f} \leftarrow \emptyset$ ;
04:	for $n = 1$ to $N - 1$ do
05:	if $n \neq N$ then
06:	Randomly generate $r_{3}, r_{4} \in [0, 1]$ and $F^{T} \in {1, 2}$ ;
07:	Calculate $X_{n}^{o f f}$ using Equation (10) and perform boundary checking;
08:	for $d = 1$ to D do
09:	if $d \neq 2 N_{m a x}^{C o n v} + 3 l - 1$ with $l = 1, \dots, N_{m a x}^{C o n v}$ do
10:	$X_{n, d}^{o f f} \leftarrow R o u n d (X_{n, d}^{o f f})$ ;
11:	end if
12:	end for
13:	Perform fitness evaluation on $X_{n}^{o f f}$ to obtain $F (X_{n}^{o f f})$ using Algorithm 3;
14:	else /i.e., $n = N$ /
15:	$X_{N}^{o f f} \leftarrow X_{N}$ ;
16:	end if
17:	If $F (X_{n}^{o f f}) < F (X^{T e a c h e r})$ then
18:	$X^{T e a c h e r} \leftarrow X_{n}^{o f f}$ , $F (X^{T e a c h e r}) \leftarrow F (X_{n}^{o f f})$ ; /Update the teacher/
19:	end if
20:	$P^{o f f} \leftarrow P^{o f f} \cup^{} X_{n}^{o f f}$ ; /Store the new learner into the offspring population/
21:	end for
Output: $P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]$ , $X^{T e a c h e r}$ , $F (X^{T e a c h e r})$

3.6. Modified Learner Phase of MTLBORKS-CNN

In the learner phase of original TLBO, each learner performs searching within the solution space by interacting with a randomly selected peer learner from the population. This interaction, as described in Equation (3), involves attracting all dimensional components of the learner toward the peer learner if the peer has superior fitness, or repelling the learner away from the peer if the peer has inferior fitness. However, as the number of iterations increases, the likelihood of triggering the exploration-focused repelling mechanism diminishes. This is primarily because most learners converge toward specific solution regions, resulting in reduced population diversity. When dealing with complex optimization problems, such as CNN architecture optimization in the current study, the decreasing exploration strength of the original TLBO in later optimization stages can hinder its ability to discover new CNN architectures, as it becomes highly prone to becoming trapped in local optima. Additionally, the learning mechanisms of the original TLBO, as defined in Equation (3), fail to accurately emulate real classroom learning dynamics, as they overlook individual efforts and adaptive interactions among peer learners for more effective knowledge improvement. This inaccurate emulation of real classroom learning dynamics limits the balance between exploration and exploitation searches in the original TLBO, thereby constraining its effectiveness in searching for promising CNN architectures during the learner phase. To address these limitations, the modified learner phase of MTLBORKS-CNN introduces self-learning and adaptive peer learning schemes to achieve a more accurate emulation of real classroom learning dynamics. The incorporation of both self-learning and adaptive peer learning into the modified learner phase aims to strike a delicate balance between exploration and exploitation searches in MTLBORKS-CNN, ultimately enhancing its efficiency in optimizing CNN architectures during the modified learner phase.

3.6.1. Self-Learning Scheme

The self-learning scheme introduced in the modified learner phase of MTLBORKS-CNN aims to simulate learners’ preference for improving their knowledge in specific subjects through individual efforts using a probabilistic mutation operator. This scheme provides learners with an additional momentum through random perturbations, helping them break away from local optima.

After completing modified teacher phase, each n-th learner is assigned a probability of

P^{S L} = 1 / D

to engage in the self-learning scheme during the modified learner phase. To facilitate the n-th learner’s self-learning process, a dimension index

d_{r a n d} \in [1, D]

is randomly generated. This index is used to apply a random perturbation to

X_{n, d_{r a n d}}^{o f f}

, which denotes the

d_{r a n d}

-th dimension of the offspring solution generated by the n-th learner:

X_{n, d_{r a n d}}^{o f f} = X_{n, d_{r a n d}}^{o f f} + r_{5} (X_{d_{r a n d}}^{m a x} - X_{d_{r a n d}}^{m i n})

(11)

where

r_{5} \in [- 1, 1]

is a random number obtained from a uniform distribution;

X_{d_{r a n d}}^{m a x}

and

X_{d_{r a n d}}^{m i n}

represent the upper and lower limits of the

d_{r a n d}

-th dimension of the arrays containing network hyperparameters, as defined in Table 1.

Subsequently, a rounding operator

R o u n d (\cdot)

is applied to

X_{n, d_{r a n d}}^{o f f}

to convert network hyperparameters into integer values. However, this rounding operation excludes values that pertain to the selection probability of the pooling layer connected to the l-th convolutional layer. These specific values are situated within the dimension indices of

d = 2 N_{m a x}^{C o n v} + 3 l - 1

for

l = 1, \dots, N_{m a x}^{C o n v}

. The fitness value of

X_{n}^{o f f}

is then evaluated using Algorithm 3 to obtain

F (X_{n}^{o f f})

. If

X_{n}^{o f f}

can construct a CNN architecture with lower classification error than

X^{T e a c h e r}

, the latter solution will be replaced by

X_{n}^{o f f}

. Algorithm 7 outlines the pseudocode for the self-learning scheme.

Algorithm 7: Self-Learning Scheme of MTLBORKS-CNN
Inputs: $D$ , $X_{n}^{o f f}$ , $X^{T e a c h e r}$ , $F (X^{T e a c h e r})$ , $X^{m a x}$ , $X^{m i n}$ , $ℛ^{t r a i n}$ , $ℛ^{v a l i d}$ , $S^{b a t c h}$ , $ε^{t r a i n}$ , $R^{L}$ , $C^{n u m}$
01:	Generate a random dimension index, $d_{r a n d} \in [1, D]$ ;
02:	Retrieve $d_{r a n d}$ -th dimension of position vectors $X_{n}^{o f f}$ , $X^{m a x}$ and $X$ ;
03:	for $d = 1$ to D do
04:	if $d = d_{r a n d}$ then
05:	Update $X_{n, d_{r a n d}}^{o f f}$ using Equation (11);
06:	if $d \neq 2 N_{m a x}^{C o n v} + 3 l - 1$ with $l = 1, \dots, N_{m a x}^{C o n v}$ do
07:	$X_{n, d_{r a n d}}^{o f f} \leftarrow R o u n d (X_{n, d_{r a n d}}^{o f f})$ ;
08:	end if
09:	end if
10:	end for
11:	Perform fitness evaluation on $X_{n}^{o f f}$ to update $F (X_{n}^{o f f})$ using Algorithm 3;
12:	if $F (X_{n}^{o f f}) < F (X^{T e a c h e r})$ then
13:	$X^{T e a c h e r} \leftarrow X_{n}^{o f f}$ , $F (X^{T e a c h e r}) \leftarrow F (X_{n}^{o f f})$ ;
14:	end if
Output: Updated $X_{n}^{o f f}, F (X_{n}^{o f f})$ , $X^{T e a c h e r}$ and $F (X^{T e a c h e r})$

3.6.2. Adaptive Peer Learning Scheme

In the modified learner phase of MTLBORKS-CNN, offspring solutions of learners who do not opt for the self-learning scheme are updated using the proposed adaptive peer learning approach. To initiate this phase, the fitness values of all offspring learners stored in

P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]

are utilized to arrange them in descending order, from worst to best. The ranking

R_{n}

for each n-th offspring learner is calculated as follows:

R_{n} = N - n

(12)

Based on Equation (12), an offspring learner with a lower value of

F (X_{n}^{o f f})

receives a higher value of

R_{n}

and vice versa. With these assigned ranking values, the probability of interaction with peers for each n-th offspring learner, denoted as

P_{n}^{P L}

, is calculated as:

P_{n}^{P L} = \frac{R_{n}}{N}

(13)

Let

X_{n, d}^{o f f}

represent the d-th dimension of each n-th offspring learner. A random number

r_{6} \in [0, 1]

is generated using a uniform distribution and compared with the corresponding

P_{n}^{P L}

when updating

X_{n, d}^{o f f}

. If

r_{6}

is smaller than

P_{n}^{P L}

, three peer offspring learners, denoted as

X_{p}^{o f f}

,

X_{q}^{o f f}

, and

X_{s}^{o f f}

, are randomly selected from

P^{o f f}

. The network hyperparameters stored in their respective d-th dimension are used to update

X_{n, d}^{o f f}

, where

p \neq s \neq u \neq n

. Otherwise, the original value stored in

X_{n, d}^{o f f}

is retained. Let

χ_{n} \in [0.5, 1]

be a peer learning factor generated randomly from a uniform distribution for each n-th offspring learner. The adaptive peer learning scheme updates

X_{n, d}^{o f f}

for each n-th offspring learner as follows:

X_{n, d}^{o f f} = {\begin{matrix} X_{p, d}^{o f f} + χ_{n} (X_{s, d}^{o f f} - X_{u, d}^{o f f}), if r_{6} < P_{n}^{P L} \\ X_{n, d}^{o f f}, otherwise \end{matrix}

(14)

After obtaining

X_{n}^{o f f}

from Equation (14), boundary checking in performed to ensure all network hyperparameters are within their search boundaries, as specified in Table 1. A rounding operator

R o u n d (\cdot)

is applied to all network hyperparameters encoded in

X_{n}^{o f f}

, except for those stored in the dimension indices where

d = 2 N_{m a x}^{C o n v} + 3 l - 1

for

l = 1, \dots, N_{m a x}^{C o n v}

. Algorithm 3 is utilized to evaluate the fitness value of the n-th offspring learner as

F (X_{n}^{o f f})

based on its updated

X_{n}^{o f f}

. If the CNN architecture represented by

X_{n}^{o f f}

results in a lower classification error than that of

X^{T e a c h e r}

,

X_{n}^{o f f}

replaces

X^{T e a c h e r}

. Equations (12)–(14) illustrate that the n-th offspring learner with a larger (indicating worse fitness) is more inclined to learn from multiple peers and update most of its dimensional components in

X_{n}^{o f f}

, as indicated by its larger

P_{n}^{P L}

value, and vice versa.

This adaptive peer learning scheme effectively regulates the explorative and exploitative behaviors of MTLBORKS-CNN. It enables learners to adjust their CNN architectures through interactions with peers, or retain their original architectures based on their fitness levels. A detailed description of the adaptive peer learning scheme is presented in Algorithm 8. Algorithm 9 outlines the comprehensive search mechanisms for the modified learner phase of MTLBORKS-CNN, combining both self-learning and adaptive peer learning schemes detailed in Algorithms 7 and 8, respectively.

Algorithm 8: Adaptive Peer Learning of MTLBORKS-CNN
Inputs: $D$ , $P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]$ , $X^{T e a c h e r}$ , $F (X^{T e a c h e r})$ , $ℛ^{t r a i n}$ , $ℛ^{v a l i d}$ , $S^{b a t c h}$ , $ε^{t r a i n}$ , $R^{L}$ , $C^{n u m}$
01:	Calculate $R_{n}$ and $P_{n}^{P L}$ using Equations (12) and (13), respectively.
02:	for $d = 1$ to D do
03:	Randomly select $X_{p}^{o f f}$ , $X_{q}^{o f f}$ , and $X_{s}^{o f f}$ from $P^{o f f}$ , where $p \neq s \neq u \neq n$ ;
04:	Update $X_{n, d}^{o f f}$ using Equation (14) and perform boundary checking;
05:	if $d \neq 2 N_{m a x}^{C o n v} + 3 l - 1$ with $l = 1, \dots, N_{m a x}^{C o n v}$ do
06:	$X_{n, d}^{o f f} \leftarrow R o u n d (X_{n, d}^{o f f})$ ;
07:	end if
08:	Perform fitness evaluation on the updated $X_{n}^{o f f}$ to obtain $F (X_{n}^{o f f})$ using Algorithm 3;
09:	if $F (X_{n}^{o f f}) < F (X^{T e a c h e r})$ then
10:	$X^{T e a c h e r} \leftarrow X_{n}^{o f f}$ , $F (X^{T e a c h e r}) \leftarrow F (X_{n}^{o f f})$ ;
11:	end if
Output: Updated $X_{n}^{o f f}$ , $F (X_{n}^{o f f})$ , $X^{T e a c h e r}$ , and $F (X^{T e a c h e r})$

Algorithm 9: Modified Learner Phase of MTLBORKS-CNN
Inputs: $D$ , $P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]$ , $X^{T e a c h e r}$ , $ℛ^{t r a i n}$ , $ℛ^{v a l i d}$ , $S^{b a t c h}$ , $ε^{t r a i n}$ , $R^{L}$ , $C^{n u m}$
01:	Sort the offspring learners stored in $P^{o f f}$ in descending order (i.e., from worst to best);
02:	for $n = 1$ to N do
03:	if $r_{6} < P_{n}^{P L}$ do
04:	Perform self-learning scheme to update $X_{n}^{o f f}$ and $X^{T e a c h e r}$ using Algorithm 7;
05:	else
06:	Perform adaptive peer learning scheme to update $X_{n}^{o f f}$ and $X^{T e a c h e r}$ using Algorithm 8;
07:	end
08:	end for
Output: Updated $P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]$ , $X^{T e a c h e r}$ , and $F (X^{T e a c h e r})$

3.7. Dual-Criterion Selection Scheme of MTLBORKS-CNN

The selection scheme plays a crucial role in forming the next generation population in MSAs, using predefined criteria during the optimization process. Traditional selection schemes, such as greedy selection and tournament selection, determine the survival of individual solutions solely based on their fitness values. For example, in the original TLBO, the greedy selection scheme compares the fitness values of current learners with those of new learners generated through teacher and learner phases. Despite their straightforward implementation, these fitness-based selection schemes have drawbacks when dealing with complex optimization problems, such as CNN architecture optimization in this study. In the context of CNN architecture optimization, the excessive reliance on fitness-based selection schemes may hinder the ability of original TLBO to discover novel CNN architectures that might initially show slightly lower performance, but have the potential for long-term usefulness, such as maintaining population diversity. To overcome this limitation, MTLBORKS-CNN introduces a dual-criterion selection approach. This approach constructs the next generation population by considering both the fitness and diversity of learners. It allows learners with slightly lower fitness but greater diversity to move forward to the next iteration, thereby preserving population diversity.

Upon the completion of the modified learner phase in MTLBORKS-CNN, a merged population is formed, denoted as

P^{M G}

. This merged population combines the current population, represented as

P = [X_{1}, \dots, X_{n}, \dots, X_{N}]

, with the updated offspring population, denoted as

P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]

. The population size of

P^{M G}

is 2N, and it is represented as follow:

P^{M G} = P \cup^{} P^{o f f} = [X_{1}^{M G}, \dots, X_{n}^{M G}, \dots, X_{2 N}^{M G}]

(15)

where

X_{n}^{M G}

refers to the n-th solution member stored in

P^{M G}

, which may originate from either the current learners in

P

or the offspring learners in

P^{o f f}

. These solution members within

P^{M G}

are sorted in ascending order based on the classification error

F (X_{n}^{M G})

associated with the CNN architecture represented by the corresponding

X_{n}^{M G}

. Let

D i s_{n}^{M G}

represents the Euclidean distance between the n-th solution member of

P^{M G}

(i.e.,

X_{n}^{M G}

) and the best solution member (i.e.,

X_{1}^{M G}

). The value of

D i s_{n}^{M G}

is determined using a Euclidean distance operator

E u d (\cdot, \cdot)

, where

D i s_{n}^{M G} = E u D (X_{n}^{M G}, X_{1}^{M G})

(16)

The dual-criterion selection scheme proposed in MTLBORKS-CNN utilizes the calculated values of

F (X_{n}^{M G})

and

D i s_{n}^{M G}

for each n-th solution within

P^{M G}

, where

n = 1, \dots, 2 N

. To construct the next generation population, denoted as

P^{N e x t}

, with a population size of N, K solution members with the best

F (X_{n}^{M G})

values are directly selected from

P^{M G}

in each iteration of MTLBORKS-CNN. Here,

K \in {1, N}

represents a randomly generated integer. For the remaining

(2 N - K)

solution members in

P^{M G}

, with population indices

n = K + 1, \dots, 2 N

, a weighted fitness value is calculated for each n-th solution member, denoted as

W F (X_{n}^{M G})

. This calculation takes into account both the classification error (i.e.,

F (X_{n}^{M G})

) and the diversity (i.e.,

F (X_{n}^{M G})

), as follows:

W F (X_{n}^{M G}) = α (\frac{F (X_{n}^{M G}) - F^{m i n}}{F^{m a x} - F^{m i n}}) + (1 - α) (\frac{D i s^{m a x} - D i s_{n}^{M G}}{D i s^{m a x} - D i s^{m i n}})

(17)

where

α

is a weighted factor randomly generated from a normal distribution

N (0.9, 0.05)

. Its value is constrained within the range [0.8,1.0] to ensure that diversity does not dominate the selection process for the remaining population members in the subsequent iteration.

F^{m a x}

and

F^{m i n}

represent the worst and best fitness values observed within

P^{M G}

, respectively, while

D i s^{m a x}

and

D i s^{m i n}

correspond to the largest and smallest Euclidean distances measured from the best solution member

X_{1}^{M G}

, respectively.

After calculating the weighted fitness value

W F (X_{n}^{M G})

for each n-th solution member, where

n = K + 1, \dots, 2 N

, a binary tournament strategy randomly selects two solution members,

X_{a}^{M G}

and

X_{b}^{M G}

from

P^{M G}

, where

a \neq b

. The solution member with the smaller weighted fitness value is then chosen to be one of the remaining solution members of

P^{N e x t}

, denoted as

X_{n}^{N e x t}

, for

n = K + 1, \dots, N

, as follows:

X_{n}^{N e x t} = {\begin{matrix} X_{a}^{M G}, if W F (X_{a}^{M G}) \leq W F (X_{b}^{M G}) \\ X_{b}^{M G}, otherwise \end{matrix}

(18)

The remaining

(N - K)

solution members for the next iteration are selected from

P^{N e x t}

by repeating the binary tournament strategy described in Equation (18). Algorithm 10 provides the pseudocode for the proposed dual-criterion selection scheme. In contrast to fitness-based selection schemes, such as greedy selection and tournament selection, this dual-criterion selection scheme not only retains the K elite solution members for

P^{N e x t}

in the subsequent iteration, but also effectively maintains population diversity by considering both the fitness and diversity levels of solutions when choosing the remaining

(N - K)

solution members of

P^{N e x t}

.

Algorithm 10: Dual-Criterion Selection Scheme of MTLBORKS-CNN
Inputs: $N$ , $P = [X_{1}, \dots, X_{n}, \dots, X_{N}]$ , $P^{o f f} = [X_{1}^{o f f}, \dots, X_{n}^{o f f}, \dots ., X_{N}^{o f f}]$
01:	Initialize the next generation population $P^{N e x t} \leftarrow \emptyset$ ;
02:	Construct the merged population $P^{M G}$ using Equation (15);
03:	Rearrange the solutions of $P^{M G}$ in ascending order by referring to their fitness values;
04:	for $n = 1$ to 2N do
05:	Calculate the $D i s_{n}^{M G}$ for every n-th solution of $P^{M G}$ using Equation (16);
06:	end for
07:	Randomly generate an integer $K \in {1, N}$ ;
08:	for $n = 1$ to K do /Select the first K learners by only considering their fitness/
09:	$X_{n}^{N e x t} \leftarrow X_{n}^{M G}$ ;
10:	$P^{N e x t} \leftarrow P^{N e x t} \cup^{} X_{n}^{N e x t}$ ;
11:	end for
12:	for $n = K + 1$ to 2N do
13:	Randomly generate $α$ from a normal distribution $N (0.9, 0.05)$ ;
14:	if $α < 0.8$ then
15:	$α \leftarrow 0.8$ ;
16:	else if $α > 1$ then
17:	$α \leftarrow$ 1;
18:	end if
19:	Calculate the $W F (X_{n}^{M G})$ of each n-th solution member of $P^{M G}$ using Equation (17);
20:	end for
21:	for $n = K + 1$ to N do /Select the remaining learners by considering their fitness and diversity/
22:	Randomly select two solution members of $X_{a}^{M G}$ and $X_{b}^{M G}$ from $P^{M G}$ , where $a \neq b$ ;
23:	Determine $X_{n}^{N e x t}$ using Equation (18);
24:	$P^{N e x t} \leftarrow P^{N e x t} \cup^{} X_{n}^{N e x t}$ ;
25:	end for
Output: $P^{N e x t} = [X_{1}^{N e x t}, \dots, X_{n}^{N e x t}, \dots ., X_{N}^{N e x t}]$

3.8. Complete Mechanisms of MTLBORKS-CNN

Algorithm 11 delineates the comprehensive mechanisms of MTLBORKS-CNN, designed for the search of an optimized CNN architecture tailored to a specific dataset. Throughout this algorithm, a counter variable t keeps track of the current iteration, with

T^{m a x}

representing the predefined maximum iteration number, which acts as the termination criterion for MTLBORKS-CNN.

The procedure commences by loading the training and validation datasets, denoted as

ℛ^{t r a i n}

and

ℛ^{v a l i d}

, respectively, from the designated directory. Following this, the MTLBORKS-CNN population is initialized using Algorithm 2. Subsequently, Algorithms 6 and 9, elucidating the modified teacher phase and modified learner phase of MTLBORKS-CNN, respectively, are iteratively employed to generate a population set,

P^{o f f}

, comprising diverse new CNN architectures. The next-generation population,

P^{N e x t}

, is then derived from the merged population,

P^{M G} = P \cup^{} P^{o f f}

, employing the dual-criterion selection scheme proposed in Algorithm 10. The optimization process for CNN architecture, facilitated by the proposed MTLBORKS-CNN method, concludes when the termination condition

t > T^{m a x}

is met.

As explained in the previous subsection, the fitness evaluation process, as delineated in Algorithm 3, employs a reduced epoch number, denoted as

ε^{t r a i n}

, for training each CNN architecture generated by every MTLBORKS-CNN learner. While this approach effectively reduces computational demands, its efficacy in addressing complex problems may be limited. Therefore, after the termination of MTLBORKS-CNN, the CNN architecture constructed from the teacher solution (i.e.,

X^{T e a c h e r}

) undergoes a full training process. This full training process shares the same mechanisms as Algorithm 3, but with a larger number of epochs, denoted as

ε^{F T}

. Once the full training is completed, all relevant network details of the fully trained CNN model, including its architecture, classification error, number of trainable parameters, and more, are returned.

Algorithm 11: Proposed MTLBORKS-CNN
Inputs: $N$ , $D$ , $ℛ^{t r a i n}$ , $ℛ^{v a l i d}$ , $S^{b a t c h}$ , $ε^{t r a i n}$ , $ε^{F T}$ , $R^{L}$ , $C^{n u m}$ , $N_{m i n}^{C o n v}$ , $N_{m a x}^{C o n v}$ , $N_{m i n}^{F i l}$ , $N_{m a x}^{F i l}$ , $S_{m i n}^{K e r}$ , $S_{m a x}^{K e r}$ , $S_{m i n}^{P o o l}$ , $S_{m a x}^{P o o l}$ , $S_{m i n}^{S t r}$ , $S_{m a x}^{S t r}$ , $N_{m i n}^{F C}$ , $N_{m a x}^{F C}$ , $N_{m i n}^{N e u}$ , $N_{m a x}^{N e u}$
01:	Load $ℛ^{t r a i n}$ and $ℛ^{v a l i d}$ from the directory;
02:	Initialize the population $P = [X_{1}, \dots, X_{n}, \dots, X_{N}]$ using Algorithm 2;
03:	Initialize the iteration counter as $t \leftarrow 0$ ;
04:	while $t < T^{m a x}$ do
05:	Generate $P^{o f f}$ and update $X^{T e a c h e r}$ and $F (X^{T e a c h e r})$ using modified teacher phase (Algorithm 6);
06:	Update $P^{o f f}$ , $X^{T e a c h e r}$ and $F (X^{T e a c h e r})$ using modified learner phase (Algorithm 9);
07:	Determine $P^{N e x t}$ using dual-criterion selection scheme (Algorithm 10);
08:	$P \leftarrow P^{N e x t};$
09:	$t \leftarrow t + 1$ ;
10:	end while
11:	Fully train the CNN architecture constructed from $X^{T e a c h e r}$ with larger $ε^{F T}$ (Algorithm 3);
Output: An optimal CNN architecture constructed from $X^{T e a c h e r}$ with all related network information

3.9. Complexity Analysis of MTLBORKS

The time complexity of the proposed MTLBORKS, compared to the original TLBO, is evaluated using Big O analysis. Since both TLBO and MTLBORKS are used to solve the CNN architecture optimization problem in the current study, the time complexity for fitness evaluation is the same for both algorithms. The original TLBO incurs a time complexity of

O (N D)

across population initialization, teacher phase, and learner phase, where N is the population size, and D is problem dimensionality. Thus, the time complexity of the original TLBO in each iteration is

O (N D)

in the worst-case scenario. Similar to the original TLBO, the time complexity for initializing MTLBORKS population is

O (N D)

. The computational time complexity of MTLBORKS is affected by three key modifications: the modified teacher phase, the modified learner phase, and the dual-criterion selection scheme.

In the modified teacher phase, learners are rearranged in descending order based on their fitness values, incurring a time complexity of

O (N \log N)

per iteration. Additionally, the values of

{\bar{X}}_{n}

,

X_{n}^{S E}

, and

X_{n}^{o f f}

are calculated for all N learners using Equations (8), (9), and (10), respectively, resulting in a time complexity of

O (N D)

in every iteration. Therefore, the total time complexity of the modified teacher phase of MTLBORKS amounts to

O (N D)

per iteration, due to its growth rate being higher than

O (N \log N)

.

During the modified learner phase, learners are rearranged in descending order based on their fitness values, incurring a time complexity of

O (N \log N)

per iteration. Furthermore, the values of

R_{n}

and

P_{n}^{P L}

are calculated for all N learners in each iteration using Equations (12) and (13), respectively, with a time complexity of

O (N D)

. When updating

X_{n}^{o f f}

using the self-learning scheme and adaptive peer learning, as described in Equations (11) and (14), respectively, a computational time complexity of

O (D)

is incurred for each learner. Thus, the total time complexity of the modified learner phase of MTLBORKS remains

O (N D)

per iteration.

For the proposed dual-criterion selection scheme, a time complexity of

O (N)

per iteration is incurred when merging

P

and

P^{o f f}

to produce

P^{M G}

using Equation (15). Then, a time complexity of

O (N \log N)

per iteration is incurred to rearrange all learners in

P^{M G}

in descending order based on their fitness values. Time complexities of

O (N D)

and

O (N)

are incurred in each iteration to calculate the values of

D i s_{n}^{M G}

and

W F (X_{n}^{M G})

for all N learners using Equations (16) and (17), respectively. When constructing

P^{N e x t}

for the next iteration of MTLBORKS, a time complexity of

O (N)

is incurred. Thus, the total time complexity of the dual-criterion selection scheme of MTLBORKS is

O (N D)

per iteration.

In conclusion, based on the time complexity analyses presented above, it is evident that the overall time complexity for each iteration of MTLBORKS, encompassing the modified teacher phase, modified learner phase, and dual-criterion selection scheme, remains

O (N D)

under the worst-case scenario.

4. Performance Analysis of Proposed Method

4.1. Selection of Benchmark Datasets

The performance of CNN architectures optimized by the proposed MTLBORKS-CNN for image classification tasks is assessed across various benchmark datasets, and is compared with other established algorithms. Similar to [13], this evaluation covers nine benchmark datasets: MNIST, MNIST-RD, MNIST-RB, MNIST-BI, MNIST-RD+BI, Rectangles, Rectangles-I, Convex, and MNIST-Fashion. Four of these datasets (MNIST-RD, MNIST-RB, MNIST-BI, and MNIST-RD+BI) are more challenging than the original MNIST due to the presence of irrelevant features in their input images.

The Rectangle dataset is used for training machine learning and deep learning models to recognize the dimensions of rectangles, while Rectangle-I focuses on discerning whether an image patch is inside or outside a rectangle. The Convex and Fashion datasets are employed for training classifiers to differentiate between convex and nonconvex input images and various categories of fashion items, respectively. These MNIST variant datasets can be accessed at http://www.iro.umontreal.ca/~lisa/icml2007data/ (accessed on 3 June 2023). For visual reference, Figure 7 provides sample images from each dataset. All selected benchmark datasets feature images with an input size of 28 × 28 × 1, and Table 3 offers a comprehensive overview of their characteristics.

4.2. Simulation Settings

Performance evaluations of CNN architectures constructed via MTLBORKS-CNN for the first eight datasets (MNIST, MNIST-RD, MNIST-RB, MNIST-RI, MNIST-RD+BI, Rectangle, Rectangle-I, and Convex) are compared with 18 peer algorithms, which include psoCNN [42], EvoCNN [13], RandNet-2 [43], PCANet-2 [43], CAE-1 [44], CAE-2 [44], SVM+Poly [45], ScatNet-2 [46], SVM+RBF [45], NNet [45], SAA-3 [45], LDANet-2 [43], DBN-3 [45], AlexNet [7], ResNet-50 [8], VGGNet [9], MobileNet [10], and GoogleNet [11]. For the comparison using MNIST-Fashion, another 17 classifiers, including psoCNN [42], EvoCNN [13], SqueezeNet-200 [47], AlexNet [7], ResNet-50 [8], VGGNet [9], MobileNet [10], GoogleNet [11], MLP 256-128-64, MLP 256-128-100, GRU+SVM+Dropout, GRU+SVM, 2C1P2F+Dropout, 2C1P2F, 3C1P2F+Dropout, 3C2F, and HOG+SVM, are selected.

EvoCNN and psoCNN share conceptual similarities with MTLBORKS-CNN, using population-based search mechanisms within a predefined maximum iteration limit. While most compared methods were tailored for specific problems, it is crucial to assess the ability of MTLBORKS-CNN to autonomously identify superior CNN architectures, distinguishing it from handcrafted approaches reliant on human expertise (e.g., AlexNet, ResNet-50, VGGNet, MobileNet, and GoogleNet). Table 4 summarizes the simulation settings of MTLBORKS-CNN, adhering to conventions in deep learning and MSA research. MTLBORKS-CNN is implemented in Python 3.8.5 on a personal computer with an Nvidia GeForce RTX 3090, and simulations are performed over 30 runs.

4.3. Simulation Results

4.3.1. Classification Accuracy in Handling the First Eight Benchmark Datasets

To assess the effectiveness of MTLBORKS-CNN and its peers, a comprehensive performance analysis is conducted on eight benchmark datasets: MNIST, MNIST-RD, MNIST-RB, MNIST-BI, MNIST-RD+BI, Rectangles, Rectangles-I, and Convex. The generalization performances of these classifiers are evaluated using test data from these datasets, and the results are presented in Table 5. The best and second-best performances are highlighted using bold and underline formatting, respectively. Table 5 provides insights into how MTLBORKS-CNN compares to its competitors for each dataset, denoted as “(+)” for outperforming, “(−)” for underperforming, and “(=)” for achieving comparable results. In cases where no results were available, “NA” is reported. Additionally, Table 5 summarizes classification accuracies for all classifiers across the eight datasets, using “w/t/l” to indicate whether MTLBORKS-CNN performed better, tied, or worse than competitors on “w”, “t”, and “l” datasets, respectively. “#BCA” is also included to specify the number of datasets in which a classifier achieved the best accuracy.

Table 5 presents the best and mean accuracies achieved by the compared classifiers on each benchmark dataset. MTLBORKS-CNN demonstrates superior performance, emerging as the top-performing algorithm with the highest classification accuracy on seven out of eight image datasets. For the eight benchmark image datasets, MTLBORKS-CNN achieves the following best classification accuracies: 99.65% for MNIST, 96.50% for MNIST-RD, 98.11% for MNIST-RB, 97.15% for MNIST-BI, 82.82% for MNIST-RD+BI, 97.39% for Rectangle-I, and 97.95% for Convex. Figure 8 visually illustrates the distribution of test errors generated by MTLBORKS-CNN while solving each benchmark dataset, providing deeper insights into its performance. Remarkably, the two best-performing classifiers consistently involve at least one developed using MSAs, such as MTLBORKS-CNN, psoCNN, or EvoCNN. In comparison to other handcrafted networks (e.g., AlexNet, MobileNet, GoogleNet, ResNet-50, and VGGNet), the simulation results in Table 5 reveal that MSAs offer better flexibility in searching for optimal CNN architectures tailored to specific datasets with minimal human intervention. The findings in Table 5 strongly.

Support MTLBORKS-CNN’s effectiveness in addressing diverse image classification tasks with high accuracy due to its excellent ability to construct CNN models tailored to specific datasets. MTLBORKS-CNN’s superior classification performance compared to psoCNN and EvoCNN for most benchmark datasets highlights the effectiveness of the modifications introduced in its teacher and learner phases, achieving a better balance between exploratory and exploitative characteristics, which is crucial for constructing robust CNN models. Additionally, the dual-criterion selection scheme in MTLBORKS-CNN proves valuable in preserving information stored in learners with higher diversity and temporarily lower fitness, leading to significant long-term improvements.

The Wilcoxon signed-rank test [48] is employed to assess performance deviations between MTLBORKS-CNN and each competitor across the eight benchmark datasets. It is noteworthy that CAE-1 and CAE-2 are excluded from these analyses due to the absence of accuracy data for the Convex dataset. These pairwise comparison results are presented in Table 6, including

R^{+}

,

R^{-}

, p-, and h-values. Specifically,

R^{+}

and

R^{-}

represent the total ranks assigned to MTLBORKS-CNN for outperforming and underperforming each competitor, respectively. The p-value indicates the significance level for detecting performance differences between two classifiers, with

α = 0.05

serving as the minimum threshold for statistical significance. A p-value below

α = 0.05

indicates that one classifier significantly outperforms the other. The h-value is expressed as “+”, “−”, or “=” to signify whether MTLBORKS-CNN is significantly better than, worse than, or comparable to its competitors in solving the eight selected benchmark datasets. Based on the h-values presented in Table 6, it becomes evident that MTLBORKS-CNN consistently and significantly outperforms all of its competitors.

To comprehensively assess performance differences between MTLBORKS-CNN and its competitors across the eight chosen benchmark datasets, a multiple comparison analysis known as the Friedman test [48] is conducted. This test ranked all compared algorithms based on their classification accuracies, with the overall ranking presented in Table 7. MTLBORKS-CNN achieves the best rank, followed by psoCNN, GoogleNet, MobileNet-v2, EvoCNN, ResNet-50, VGGNet-16, LDANet-2, PCANet-2, ScatNet-2, RandNet-2, AlexNet, DBN-3, SAA-3, SVM+RBF, SVM+Poly, and NNet. Additionally, Table 7 demonstrates significant global differences among all compared classifiers, with the p-value below the threshold of

α = 0.05

.

To investigate concrete deviations, three post hoc procedures [48], i.e., Bonferroni–Dunn, Holm, and Hochberg, are applied by using MTLBORKS-CNN as the control algorithm. The results of these post hoc analyses, including z-values, unadjusted p-values, and adjusted p-values (APVs), are presented in Table 8. All post hoc procedures confirm the significant superiority of MTLBORKS-CNN compared to SVM+RBF, SVM+Poly, NNet, DBN-3, AlexNet, RandNet-2,ScatNet-2, PCANet-2, and LDANet-2, as evidenced by APVs consistently falling below

α = 0.05

. Furthermore, the Holm and Hochberg procedures confirm the significant improvement of MTLBORKS-CNN over VGGNet-16.

4.3.2. Classification Accuracy in Handling MNIST-Fashion Dataset

Table 9 presents the accuracies and network complexities (determined by the total number of trainable parameters) for all classifiers used with the MNIST-Fashion dataset. As previously observed in Table 5 for the initial eight benchmark datasets, MSA-based approaches continue to demonstrate superior performance in classifying the MNIST-Fashion dataset. This outcome reaffirms the promising capability of MSAs to autonomously design optimal CNN architectures for diverse classification tasks, minimizing the need for human domain expertise.

For the MNIST-Fashion dataset, MTLBORKS-CNN achieves a commendable classification accuracy of 93.27%, securing the second-best performance. EvoCNN and psoCNN claim the top and third positions, with classification accuracies of 94.53% and 92.81%, respectively. Although EvoCNN outperforms MTLBORKS-CNN by a marginal 1.26% in classification accuracy on the MNIST-Fashion dataset, it is noteworthy that the optimal CNN architecture produced by MTLBORKS-CNN comprises a mere 1.96 million trainable parameters. This represents a significant reduction, specifically 70.36% less than EvoCNN (with 6.68 million trainable parameters) and 61.38% less than psoCNN (with 2.58 million trainable parameters). In comparison to EvoCNN and psoCNN, MTLBORKS-CNN demonstrates a better balance between accuracy and network complexity when crafting optimal CNN architectures for a specific classification task. In an era where automated intelligent systems, such as facial recognition systems, traffic monitoring systems, and vehicle auto-driving systems, are increasingly integrated into daily life, designing resource-efficient deep learning models in terms of computing power and energy consumption is crucial. Given its potential to automatically identify network architectures that offer both high accuracy and low complexity, MTLBORKS-CNN emerges as an attractive solution for developers aiming to deploy various mobile intelligent systems.

Table 9 also reveals relatively lower classification accuracies for other hand-crafted classifiers when dealing with the MNIST-Fashion dataset. These classifiers, including AlexNet, ResNet-50, VGGNet-16, MobileNet-v2, and GoogleNet, feature more complex network architectures, with 60 million, 23 million, 41 million, 7 million, and 16 million trainable parameters, respectively. These manually designed deep learning models often incorporate the excessive trainable parameters, resulting in a significant increase in computational power consumption without a substantial enhancement in classification accuracy. In contrast, MTLBORKS-CNN demonstrates competitive performance when solving the MNIST-Fashion dataset without the need for data augmentation techniques or complex network structures. MTLBORKS-CNN initiates its learners with simpler network architectures, contributing to lower complexity levels and faster convergence rates during the search process. The remarkable classification accuracy exhibited by MTLBORKS-CNN reveals the potential to achieve competitive results through the utilization of relatively straightforward network configurations.

4.3.3. CNN Architectures Optimized by MTLBORKS-CNN for Different Datasets

In this section, the ability of MTLBORKS-CNN to optimize CNN architectures based on provided datasets is investigated. Figure 9 displays the convergence curves generated by MTLBORKS-CNN when optimizing the network hyperparameters crucial for solving each selected dataset. It is evident that MTLBORKS-CNN effectively identifies optimal network hyperparameters for CNN architectures, achieving the highest classification accuracy for each dataset within a relatively small iteration number. These observations emphasize the competitive accuracy and rapid convergence speed of MTLBORKS-CNN in the context of CNN architecture optimization.

Table 10 and Figure 10 present the optimal network hyperparameters and corresponding CNN architectures identified by MTLBORKS-CNN, achieving the highest classification accuracy across all nine selected datasets. MTLBORKS-CNN consistently designs optimal CNN architectures with a single fully-connected layer for all benchmark datasets. This aligns with prior research [49], suggesting that CNN models with a single fully-connected layer can outperform those with multiple fully-connected layers. Additionally, this investigation reveals that for specific benchmark datasets such as MNIST-RD, MNIST-BI, MNIST-RB+BI, and Rectangles-I, including a pooling layer between consecutive convolutional layers is not imperative for achieving the highest classification accuracy. These findings confirm that MTLBORKS-CNN can autonomously design high-performance network architectures with low complexity for various classification tasks, without the need for prior domain knowledge.

4.3.4. Ablation Experiment

Extensive performance analyses confirm the superiority of MTLBORKS-CNN in classifying nine selected image datasets. This exceptional performance can be attributed to three key innovations: (i) the integration of a social learning concept in the modified teacher phase, (ii) the introduction of a self-learning mechanism and an adaptive peer-learning scheme in the modified learner phase, and (iii) the utilization of a dual-criterion selection scheme to determine the survival of learners in subsequent generations. To gain deeper insights into the impacts of these modifications on solving complex problems such as CNN architecture optimization, an ablation experiment is conducted to assess the individual contributions of each modification.

In this ablation experiment, the original TLBO-CNN served as the baseline method for solving CNN architecture optimization. Three MTLBORKS-CNN variants were introduced: MTLBORKS-CNN-v1, MTLBORKS-CNN-v2, and MTLBORKS-CNN-v3, each designed to isolate the effects of different modifications. MTLBORKS-CNN-v1 incorporated the modified teacher phase while retaining the original learner phase, whereas MTLBORKS-CNN-v2 employed the modified learner phase alongside the original teacher phase. MTLBORKS-CNN-v3 integrated both the modified teacher and learner phases. All three MTLBORKS-CNN variants retained the original greedy-based selection scheme to assess the impact of the proposed dual-criterion selection scheme.

Table 11 presents the performance of each MTLBORKS-CNN variant, including classification accuracy and network complexity (i.e., number of trainable parameters), across the nine selected image datasets. The best results obtained in each dataset are presented in bold font. These results clearly indicate that all MTLBORKS-CNN variants significantly improve classification accuracy and network complexity compared to TLBO-CNN when applied to most image datasets. These results confirm the effectiveness of incorporating the modified teacher and learner phases into the original TLBO to enhance its CNN architecture optimization capabilities. Compared to the original teacher phase in TLBO-CNN, MTLBORKS-CNN introduces valuable enhancements. It achieves improved population diversity by generating unique social exemplars and unique population means for each learner, based on different sets of superior population members. Despite the increasing population diversity, all learners continue to benefit from personalized guidance via their unique exemplars, which accelerates their convergence toward promising solution regions. Additionally, the modified learner phase in MTLBORKS-CNN, designed to realistically simulate classroom learning dynamics, provides greater flexibility compared to the original learner phase in TLBO-CNN. It empowers learners to explore new CNN architectures using diverse learning strategies. Specifically, the self-learning mechanism encourages exploration by enabling each learner to search for novel CNN architectures through random perturbations. Meanwhile, the adaptive peer learning scheme fosters exploitation by facilitating collaboration and knowledge sharing among learners through probabilistic interactions with multiple peers, based on their fitness values. This collaborative approach significantly enhances the efficiency of the search for optimal CNN architectures.

The results from MTLBORKS-CNN-v1 and MTLBORKS-CNN-v2 reveal varying degrees of performance improvement over the original TLBO-CNN when searching for optimal CNN architectures. Specifically, MTLBORKS-CNN-v2 outperforms MTLBORKS-CNN-v1 in terms of classification accuracy and network complexity for four image datasets (MNIST, MNIST-RD, MNIST-RD+BI, and Convex). Conversely, MTLBORKS-CNN-v1 achieves higher classification accuracy and lower network complexity than MTLBORKS-CNN-v2 for two image datasets (MNIST-RB and MNIST-Fashion). Both MTLBORKS-CNN-v1 and MTLBORKS-CNN-v2 exhibit superior network complexity and classification accuracy, respectively, when solving the remaining three image datasets (MNIST-BI, Rectangles, and Rectangles-I), suggesting the unique contributions of the modified teacher and learner phases in automating network design tasks. Furthermore, it is worth noting that MTLBORKS-CNN-v3 outperforms both MTLBORKS-CNN-v1 and MTLBORKS-CNN-v2 when solving most of the datasets in terms of higher classification accuracy and lower network complexity. These results confirm the synergistic effects generated by both the modified teacher and learner phases in achieving high-performing yet low-complexity CNN architectures for the designated classification tasks.

Finally, it is noteworthy that the complete MTLBORKS-CNN surpasses MTLBORKS-CNN-v3 in terms of classification accuracy and network complexity for all nine selected image datasets. While the advantage of MTLBORKS-CNN over MTLBORKS-CNN-v3 in classification accuracy is marginal for most datasets, the performance gains achieved by MTLBORKS-CNN in network complexity are substantial. Unlike TLBO-CNN and the other three MTLBORKS-CNN variants, which rely on a fitness-based selection scheme, the proposed MTLBORKS-CNN employs a dual-criterion selection scheme. This approach determines the survival of learners in the next generation by considering both fitness and diversity criteria. This enhancement maintains population diversity throughout the search process and mitigates the risk of premature convergence, allowing learners with relatively lower fitness but greater diversity to continue into the next iteration. In the context of CNN architecture optimization, the dual-criterion selection scheme has proven effective to assist MTLBORKS-CNN in discovering innovative CNN architectures with reduced network complexity that achieve high accuracy in solving classification tasks.

5. Conclusions

The proposed MTLBORKS offers an automatic and flexible solution for optimizing CNN architecture in image classification. MTLBORKS-CNN achieves this optimization by effectively encoding scheme and design constraints, leading to the discovery of novel network architectures. The modified teacher phase integrates social learning to enhance diversity, while the modified learner phase employs self-learning and adaptive peer learning to better represent real-world learning processes. Additionally, a dual-criterion selection scheme preserves promising learners, ensuring long-term population quality by considering both fitness and diversity criteria. The performance of MTLBORKS-CNN is validated on nine benchmark datasets, showcasing competitive accuracy and simplified architectures ideal for edge devices.

Future directions encompass the integration of advanced network blocks, such as DenseNet and ResNet, into the automatic network design task to create innovative CNN architectures for tackling complex problems. Another avenue involves exploring the formulation of automatic network architecture design as a multiobjective optimization problem, considering conflicting requirements such as inference speed and network complexity. This approach could lead to the development of a multiobjective version of MTLBORKS-CNN, capable of discovering optimized CNN architectures to meet diverse stakeholder needs. Lastly, it is recommended to investigate MTLBORKS-CNN’s effectiveness in handling more complex real-world datasets, such as those related to human action recognition and medical diagnosis. These efforts hold the potential to further enhance the performance and versatility of MTLBORKS-CNN.

Author Contributions

Conceptualization, K.M.A., W.H.L., S.S.T. and S.K.T.; methodology, K.M.A., W.H.L., S.S.T. and S.K.T.; software, K.M.A.; validation, W.H.L.; formal analysis, K.M.A., S.S.T. and A.H.A.; investigation, K.M.A., W.H.L. and D.S.K.; resources, S.K.T., A.A.A., A.H.A. and D.S.K.; data curation, K.M.A., W.H.L., A.S. and A.H.A.; writing—original draft preparation, K.M.A., W.H.L., S.S.T. and A.S.; writing—review and editing, S.K.T., A.A.A., A.H.A. and D.S.K.; visualization, K.M.A. and A.S.; supervision, W.H.L. and S.S.T.; project administration, W.H.L., S.K.T., A.A.A., A.H.A. and D.S.K.; funding acquisition, S.K.T., A.A.A., A.H.A. and D.S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R120), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data Availability Statement

The data will be provided upon reasonable request.

Acknowledgments

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R120), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

Indices
n $, m, k_{r a n d},$ p, s, u	Population index
d $, d_{r a n d}$	Dimension index
$l$	Index of convolutional layer
$q$	Index of fully-connected layer
$i$	Index of training step
$j$	Index of validation step
Operators
$F (\cdot)$	Operator used to calculate the fitness value of a given position vector
$f_{C E} (\cdot, \cdot)$	Operator used to calculate the cross-entropy loss function of the CNN model
$\nabla f_{C E} (\cdot, \cdot)$	Operator used to calculate the gradient of cross-entropy loss.
$R o u n d (\cdot)$	Operator used for rounding process
$E u D (\cdot, \cdot)$	Operator used to calculate Euclidean distance between two solutions.
Parameters and Variables (Related to TLBO and MTLBORKS)
$N$	Population size
$D$	Number of decision variables to be optimized for a problem
$P$	Original population set
$P^{o f f}$	Offspring population set
$P^{M G}$	Merged population set
$P^{N e x t}$	Next generation population set
$X_{n, d}$	The d-th dimension of position for each n-th learner
$X_{n, d}^{N e w}$	The d-th dimension of new position for each n-th learner
$F (X_{n})$	Fitness value corresponds to the current position vector of the n-th learner
$F (X_{n}^{N e w})$	Fitness value corresponds to the new position vector of the n-th learner
$X^{T e a c h e r}$	Position vector of teacher solution
$F (X^{T e a c h e r})$	Fitness value corresponds to the position vector of teacher solution
$\bar{X}$	Population mean
${\bar{X}}_{n, d}$	The d-th dimension of unique mean position for each n-th learner
$X_{n, d}^{S E}$	The d-th dimension of social exemplar generated for each n-th learner
$X_{n, d}^{o f f}$	The d-th dimension of offspring solution for each n-th learner
$F (X_{n}^{o f f})$	Fitness value corresponds to the offspring solution for each n-th learner
$X_{n}^{M G}$	The n-th solution member stored in the merged population set.
$F (X_{n}^{M G})$	Fitness value corresponds to the n-th solution member of merged population set
$W F (X_{n}^{M G})$	Weighted fitness value corresponds to the n-th solution member of merged population set
$X_{n}^{N e x t}$	The n-th solution member stored in the next generation population set.
$X_{d}^{m i n}$	The d-th dimension of array used to store lower limit of decision variables
$X_{d}^{m a x}$	The d-th dimension of array used to store upper limit of decision variables
$D i s_{n}^{M G}$	Euclidean distance between the n-th solution member of merged population set and the current best solution member.
$F^{m i n}$	Best fitness value observed from the merged population set
$F^{m a x}$	Worst fitness value observed from the merged population set
$D i s^{m i n}$	Smallest Euclidean distance measured from the best solution member in the merged population set
$D i s^{m a x}$	Largest Euclidean distance measured from the best solution member in the merged population set
$F^{T}$	Teacher factor randomly generated within ${1, 2}$
$P^{S L}$	Probability of a learner to engage in the self-learning scheme
$P_{n}^{P L}$	Probability of the n-th learner to interact with peers
$R_{n}$	Ranking of the n-th learner
$χ_{n}$	Peer learning factor randomly generated within [0.5,1] for each n-th learner
$r_{1}$ , $r_{2}$ , $r_{3}$ , $r_{4}$ , $r_{6}$ ,	Random number within [0, 1]
$r_{5}$	Random number within [−1, 1]
$K$	$Random number within {1, N}$ to determine the number of learners selected for next generation solely based on fitness criterion.
$α$	Weighted factor randomly generated from a normal distribution.
Parameters and Variables (Related to CNN)
$N^{C o n v}$	Number of convolutional layers
$N_{m i n}^{C o n v}$	Minimum number of convolutional layers
$N_{m a x}^{C o n v}$	Maximum number of convolutional layers
$N_{l}^{F i l}$	Number of filters in each l-th convolutional layer
$N_{m i n}^{F i l}$	Minimum number of filters in convolutional layer
$N_{m a x}^{F i l}$	Maximum number of filters in convolutional layer
$S_{l}^{K e r}$	$Kernel size (i . e ., r o w \times c o l u m n$ ) of filter in each l-th convolutional layer
$S_{m i n}^{K e r}$	Minimum kernel size (i.e., $r o w \times c o l u m n$ ) of filter in convolutional layer
$S_{m a x}^{K e r}$	Maximum kernel size (i.e., $r o w \times c o l u m n$ ) of filter in convolutional layer
$P_{l}^{P o o l}$	Selection probability of pooling type connected to each l-th convolutional layer
$S_{l}^{P o o l}$	Kernel size of pooling layer connected to each l-th convolutional layer
$S_{m i n}^{P o o l}$	Minimum kernel size of pooling layer
$S_{m a x}^{P o o l}$	Maximum kernel size of pooling layer
$S_{l}^{S t r}$	$Stride size (i . e ., r o w \times c o l u m n$ ) of pooling layer connected to each l-th convolutional layer
$S_{m i n}^{S t r}$	Minimum stride size (i.e., $r o w \times c o l u m n$ ) of pooling layer
$S_{m a x}^{S t r}$	Maximum stride size (i.e., $r o w \times c o l u m n$ ) of pooling layer
$N^{F C}$	Number of fully-connected layer
$N_{m i n}^{F C}$	Minimum number of fully-connected layer
$N_{m a x}^{F C}$	Maximum number of fully-connected layer
$N_{q}^{N e u}$	Number of neurons in the q-th fully-connected layer
$N_{m i n}^{N e u}$	Minimum number of neurons in the fully-connected layer
$N_{m a x}^{N e u}$	Maximum number of neurons in the fully-connected layer
$C^{n u m}$	Numbers of output class
$ϖ$	A vector used to store the trainable parameters of CNN
$ϖ^{n e w}$	A vector used to store the updated trainable parameters of CNN
$ℛ^{t r a i n}$	Training dataset with a size of $\| ℛ^{t r a i n} \|$
$ℛ^{v a l i d}$	Validation dataset with a size of $\| ℛ^{v a l i d} \|$
$ℛ_{i}^{t r a i n}$	The i-th batch data of training dataset used in i-th training step.
$ℛ_{j}^{v a l i d}$	The j-th batch data of validation dataset used in j-th validation step.
$τ^{t r a i n}$	Training step size
$τ^{v a l i d}$	Validation step size
$S^{b a t c h}$	Batch size
$ε^{t r a i n}$	Epoch number used for training during fitness evaluation stage
$ε^{F T}$	Epoch number used for training during full training stage
$R^{L}$	Learning rate
$E r r_B a t c h_{j}$	Classification error of trained CNN model in each j-th validation step

References

Xiao, J.; Xu, J.; Tian, C.; Han, P.; You, L.; Zhang, S. A serial attention frame for multi-label waste bottle classification. Appl. Sci. 2022, 12, 1742. [Google Scholar] [CrossRef]
Zheng, M.; Xu, J.; Shen, Y.; Tian, C.; Li, J.; Fei, L.; Zong, M.; Liu, X. Attention-based CNNs for image classification: A survey. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2022; p. 012068. [Google Scholar]
Zhang, H.; Jiang, L.; Li, C. CS-ResNet: Cost-sensitive residual convolutional neural network for PCB cosmetic defect detection. Expert Syst. Appl. 2021, 185, 115673. [Google Scholar] [CrossRef]
Zhang, Q.; Xiao, J.; Tian, C.; Xu, J.; Zhang, S.; Lin, C.-W. A parallel and serial denoising network. Expert Syst. Appl. 2023, 231, 120628. [Google Scholar] [CrossRef]
Baldominos, A.; Saez, Y.; Isasi, P. Evolutionary convolutional neural networks: An application to handwriting recognition. Neurocomputing 2018, 283, 38–52. [Google Scholar] [CrossRef]
Liu, Y.; Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G.; Tan, K.C. A survey on evolutionary neural architecture search. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 550–570. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Suganuma, M.; Shirakawa, S.; Nagao, T. A genetic programming approach to designing convolutional neural network architectures. In Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany, 15–19 July 2017; pp. 497–504. [Google Scholar]
Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G. Evolving deep convolutional neural networks for image classification. IEEE Trans. Evol. Comput. 2019, 24, 394–407. [Google Scholar] [CrossRef]
Jaafra, Y.; Laurent, J.L.; Deruyver, A.; Naceur, M.S. Reinforcement learning for neural architecture search: A review. Image Vis. Comput. 2019, 89, 57–66. [Google Scholar] [CrossRef]
Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; Wang, J. Efficient architecture search by network transformation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Zhong, Z.; Yan, J.; Wu, W.; Shao, J.; Liu, C.-L. Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2423–2432. [Google Scholar]
Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar] [CrossRef]
Santra, S.; Hsieh, J.-W.; Lin, C.-F. Gradient descent effects on differential neural architecture search: A survey. IEEE Access 2021, 9, 89602–89618. [Google Scholar] [CrossRef]
Oyelade, O.N.; Ezugwu, A.E. A bioinspired neural architecture search based convolutional neural network for breast cancer detection using histopathology images. Sci. Rep. 2021, 11, 19940. [Google Scholar] [CrossRef] [PubMed]
Arman, S.E.; Deowan, S.A. IGWO-SS: Improved grey wolf optimization based on synaptic saliency for fast neural architecture search in computer vision. IEEE Access 2022, 10, 67851–67869. [Google Scholar] [CrossRef]
Wen, X.; Song, Q.; Qian, Y.; Qiao, D.; Wang, H.; Zhang, Y.; Li, H. Effective Improved NSGA-II Algorithm for Multi-Objective Integrated Process Planning and Scheduling. Mathematics 2023, 11, 3523. [Google Scholar] [CrossRef]
Al-Saggaf, U.M.; Ahmad, J.; Alrefaei, M.A.; Moinuddin, M. Optimized Statistical Beamforming for Cooperative Spectrum Sensing in Cognitive Radio Networks. Mathematics 2023, 11, 3533. [Google Scholar] [CrossRef]
Ang, K.M.; Natarajan, E.; Isa, N.A.M.; Sharma, A.; Rahman, H.; Then, R.Y.S.; Alrifaey, M.; Tiang, S.S.; Lim, W.H. Modified teaching-learning-based optimization and applications in multi-response machining processes. Comput. Ind. Eng. 2022, 174, 108719. [Google Scholar] [CrossRef]
Ang, K.M.; Chow, C.E.; El-Kenawy, E.-S.M.; Abdelhamid, A.A.; Ibrahim, A.; Karim, F.K.; Khafaga, D.S.; Tiang, S.S.; Lim, W.H. A Modified Particle Swarm Optimization Algorithm for Optimizing Artificial Neural Network in Classification Tasks. Processes 2022, 10, 2579. [Google Scholar] [CrossRef]
Ang, K.M.; El-kenawy, E.-S.M.; Abdelhamid, A.A.; Ibrahim, A.; Alharbi, A.H.; Khafaga, D.S.; Tiang, S.S.; Lim, W.H. Optimal Design of Convolutional Neural Network Architectures Using Teaching–Learning-Based Optimization for Image Classification. Symmetry 2022, 14, 2323. [Google Scholar] [CrossRef]
Rao, R.V.; Savsani, V.J.; Vakharia, D.P. Teaching–learning-based optimization: A novel method for constrained mechanical design optimization problems. Comput.-Aided Des. 2011, 43, 303–315. [Google Scholar] [CrossRef]
Stanley, K.O.; Miikkulainen, R. Evolving neural networks through augmenting topologies. Evol. Comput. 2002, 10, 99–127. [Google Scholar] [CrossRef] [PubMed]
Siebel, N.T.; Sommer, G. Evolutionary reinforcement learning of artificial neural networks. Int. J. Hybrid Intell. Syst. 2007, 4, 171–183. [Google Scholar] [CrossRef]
Stanley, K.O.; D’Ambrosio, D.B.; Gauci, J. A hypercube-based encoding for evolving large-scale neural networks. Artif. Life 2009, 15, 185–212. [Google Scholar] [CrossRef]
Verbancsics, P.; Harguess, J. Generative neuroevolution for deep learning. arXiv 2013, arXiv:1312.5355. [Google Scholar] [CrossRef]
Rikhtegar, A.; Pooyan, M.; Manzuri-Shalmani, M.T. Genetic algorithm-optimised structure of convolutional neural network for face recognition applications. IET Comput. Vis. 2016, 10, 559–566. [Google Scholar] [CrossRef]
Chung, H.; Shin, K.-s. Genetic algorithm-optimized multi-channel convolutional neural network for stock market prediction. Neural Comput. Appl. 2020, 32, 7897–7914. [Google Scholar] [CrossRef]
Montecino, D.A.; Perez, C.A.; Bowyer, K.W. Two-level genetic algorithm for evolving convolutional neural networks for pattern recognition. IEEE Access 2021, 9, 126856–126872. [Google Scholar] [CrossRef]
Wang, B.; Sun, Y.; Xue, B.; Zhang, M. Evolving deep convolutional neural networks by variable-length particle swarm optimization for image classification. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
Darwish, A.; Ezzat, D.; Hassanien, A.E. An optimized model based on convolutional neural networks and orthogonal learning particle swarm optimization algorithm for plant diseases diagnosis. Swarm Evol. Comput. 2020, 52, 100616. [Google Scholar] [CrossRef]
Jayanthi, J.; Jayasankar, T.; Krishnaraj, N.; Prakash, N.; Sagai Francis Britto, A.; Vinoth Kumar, K. An intelligent particle swarm optimization with convolutional neural network for diabetic retinopathy classification model. J. Med. Imaging Health Inform. 2021, 11, 803–809. [Google Scholar] [CrossRef]
Shukla, S.K.; Koley, E.; Ghosh, S. Grey wolf optimization-tuned convolutional neural network for transmission line protection with immunity against symmetrical and asymmetrical power swing. Neural Comput. Appl. 2020, 32, 17059–17076. [Google Scholar] [CrossRef]
Xie, H.; Zhang, L.; Lim, C.P. Evolving CNN-LSTM models for time series prediction using enhanced grey wolf optimizer. IEEE Access 2020, 8, 161519–161541. [Google Scholar] [CrossRef]
Mohakud, R.; Dash, R. Designing a grey wolf optimization based hyper-parameter optimized convolutional neural network classifier for skin cancer detection. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 6280–6291. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Junior, F.E.F.; Yen, G.G. Particle swarm optimization of deep neural networks architectures for image classification. Swarm Evol. Comput. 2019, 49, 62–74. [Google Scholar] [CrossRef]
Chan, T.-H.; Jia, K.; Gao, S.; Lu, J.; Zeng, Z.; Ma, Y. PCANet: A simple deep learning baseline for image classification? IEEE Trans. Image Process. 2015, 24, 5017–5032. [Google Scholar] [CrossRef] [PubMed]
Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Vienna, Austria, 25–31 July 2011. [Google Scholar]
Larochelle, H.; Erhan, D.; Courville, A.; Bergstra, J.; Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 473–480. [Google Scholar]
Bruna, J.; Mallat, S. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1872–1886. [Google Scholar] [CrossRef] [PubMed]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv 2014, arXiv:1412.6806. [Google Scholar]

Figure 1. Overview of the proposed MTLBORKS-CNN.

Figure 2. Overview of a typical CNN architecture.

Figure 3. Solution encoding of each MTLBORKS-CNN learner to construct a candidate CNN model.

Figure 4. Decoding information from an MTLBORKS-CNN learner to construct a CNN model for different examples: (a) Example 1, (b) Example 2, and (c) Example 3.

Figure 5. Graphical illustration of computing the unique mean position

{\bar{X}}_{1}

of the worst learner.

Figure 5. Graphical illustration of computing the unique mean position

{\bar{X}}_{1}

of the worst learner.

Figure 6. Graphical illustration of computing the unique social exemplar

X_{1}^{S E}

of the worst learner.

Figure 6. Graphical illustration of computing the unique social exemplar

X_{1}^{S E}

of the worst learner.

Figure 7. Some examples of images retrieved from the benchmark datasets known as: (a) MNIST, (b) MNIST-RD, (c) MNIST-RB, (d) MNIST-BI, (e) MNIST-RD+BI, (f) Rectangles, (g) Rectangles-I, (h) Convex, and (i) MNIST-Fashion.

Figure 8. Test error distributions generated by MTLBORKS-CNN when classifying eight benchmark datasets known as: (a) MNIST, (b) MNIST-RD, (c) MNIST-RB, (d) MNIST-BI, (e) MNIST-RD+BI, (f) Rectangles, (g) Rectangles-I, and (h) Convex.

Figure 9. Convergence curves generated by MTLBORKS-CNN when searching for the optimal CNN architecture used to solve the nine datasets: (a) MNIST, (b) MNIST-RD, (c) MNIST-RB, (d) MNIST-BI, (e) MNIST-RD+BI, (f) Rectangles, (g) Rectangles-I, (h) Convex, and (i) MNIST-Fashion.

Figure 10. Optimal CNN architectures generated by MTLBORKS-CNN to solve the nine selected benchmark datasets with the best classification accuracy: (a) MNIST, (b) MNIST-RD, (c) MNIST-RB, (d) MNIST-BI, (e) MNIST-RD+BI, (f) Rectangles, (g) Rectangles-I, (h) Convex, and (i) MNIST-Fashion.

Table 1. Summary of network hyperparameters to be optimized in each section of CNN.

Sections	Network Hyperparameters to Be Optimized	Data Type	Encoded Dimension Index	Index Number	Lower Limit	Upper Limit
Convolutional Layer	$Number of convolutional layers, N^{C o n v}$	Integer Value	$d = 1$	-	$N_{m i n}^{C o n v}$	$N_{m a x}^{C o n v}$
	Number of filters in each l-th convolutional layer, $N_{l}^{F i l}$	Integer Value	$d = 2 l$	$l \in {1, N_{m a x}^{C o n v}}$	$N_{m i n}^{F i l}$	$N_{m a x}^{F i l}$
	$Kernel size (i . e ., r o w \times c o l u m n$ ) of filter in each l-th convolutional layer, $S_{l}^{K e r}$	Integer Value	$d = 2 l + 1$	$l \in {1, N_{m a x}^{C o n v}}$	$S_{m i n}^{K e r}$	$S_{m a x}^{K e r}$
Pooling Layer	Selection probability of pooling type connected to each l-th convolutional layer, $P_{l}^{P o o l}$	Real-Value	$d = 2 N_{m a x}^{C o n v} + 3 l - 1$	$l \in {1, N_{m a x}^{C o n v}}$	0	1
	$Kernel size (i . e ., r o w \times c o l u m n$ ) of pooling layer connected to each l-th convolutional layer, $S_{l}^{P o o l}$	Integer Value	$d = 2 N_{m a x}^{C o n v} + 3 l$	$l \in {1, N_{m a x}^{C o n v}}$	$S_{m i n}^{P o o l}$	$S_{m a x}^{P o o l}$
	$Stride size (i . e ., r o w \times c o l u m n$ ) of pooling layer connected to each l-th convolutional layer, $S_{l}^{S t r}$	Integer Value	$d = 2 N_{m a x}^{C o n v} + 3 l + 1$	$l \in {1, N_{m a x}^{C o n v}}$	$S_{m i n}^{S t r}$	$S_{m a x}^{S t r}$
Fully-Connected Layer	Number of fully-connected layers, $N^{F C}$	Integer Value	$d = 5 N_{m a x}^{C o n v} + 2$	-	$N_{m i n}^{F C}$	$N_{m a x}^{F C}$
	Number of neurons in the q-th fully-connected layer, $N_{q}^{N e u}$	Integer Value	$d = (5 N_{m a x}^{C o n v} + 2) + q$	$q \in {1, N_{m a x}^{F C}}$	$N_{m i n}^{N e u}$	$N_{m a x}^{N e u}$

Table 2. Maximum and minimum values of CNN network hyperparameters.

Sections	Network Hyperparameters	Values
Convolutional	Minimum number of convolutional layers $N_{m i n}^{C o n v}$	1
	Maximum number of convolutional layers $N_{m a x}^{C o n v}$	3
	$Minimum filter numbers N_{m i n}^{F i l}$	3
	$Maximum filter numbers N_{m a x}^{F i l}$	256
	$Minimum kernel size S_{m i n}^{K e r}$	$3 \times 3$
	$Maximum kernel size S_{m a x}^{K e r}$	$9 \times 9$
Pooling	$Minimum pooling size S_{m i n}^{P o o l}$	$1 \times 1$
	$Maximum pooling size S_{m a x}^{P o o l}$	$3 \times 3$
	$Minimum stride S_{m i n}^{S t r}$	$1 \times 1$
	$Maximum stride S_{m a x}^{S t r}$	$2 \times 2$
Fully connected	Minimum number of fully-connected layers $N_{m i n}^{F C}$	1
	Maximum number of fully-connected layers $N_{m a x}^{F C}$	2
	$Minimum neuron numbers N_{m i n}^{N e u}$	1
	$Maximum neuron numbers N_{m a x}^{N e u}$	300

Table 3. Summary of all selected benchmark datasets used for performance study.

Dataset	# Training Samples	# Testing Samples	# Classes
MNIST	60,000	10,000	10
MNIST-RD	12,000	50,000	10
MNIST-RB	12,000	50,000	10
MNIST-BI	12,000	50,000	10
MNIST-RD+BI	12,000	50,000	10
Rectangles	1200	50,000	2
Rectangles-I	12,000	50,000	2
Convex	8000	50,000	2
MNIST-Fashion	60,000	10,000	10

Table 4. Simulation Settings of MTLBORKS-CNN.

Parameters	Values
Dimensional size D	19
Population size N	20
$Maximum iteration number T^{m a x}$	10
Minimum number of convolutional layers $N_{m i n}^{C o n v}$	1
Maximum number of convolutional layers $N_{m a x}^{C o n v}$	3
$Minimum filter numbers N_{m i n}^{F i l}$	3
$Maximum filter numbers N_{m a x}^{F i l}$	256
$Minimum kernel size S_{m i n}^{K e r}$	$3 \times 3$
$Maximum kernel size S_{m a x}^{K e r}$	$9 \times 9$
$Minimum pooling size S_{m i n}^{P o o l}$	$1 \times 1$
$Maximum pooling size S_{m a x}^{P o o l}$	$3 \times 3$
$Minimum stride S_{m i n}^{S t r}$	$1 \times 1$
$Maximum stride S_{m a x}^{S t r}$	$2 \times 2$
Minimum number of fully-connected layers $N_{m i n}^{F C}$	1
Maximum number of fully-connected layers $N_{m a x}^{F C}$	2
$Minimum neuron numbers N_{m i n}^{N e u}$	1
$Maximum neuron numbers N_{m a x}^{N e u}$	300
Dropout rate	0.5
Epoch numbers during fitness evaluation $ε^{t r a i n}$	1
Epoch numbers during full training $ε^{F T}$	100

Table 5. Comparison of classification accuracies between MTLBORKS-CNN and its peers for solving the first eight image datasets.

Classifier	MNIST	MNIST-RD	MNIST-RB	MNIST-BI	MNIST-RD+BI
ScatNet-2	98.73% (+)	92.52% (+)	87.70% (+)	81.60% (+)	49.52% (+)
CAE-1	98.60% (+)	95.48% (+)	93.19% (+)	87.58% (+)	61.46% (+)
CAE-2	97.52% (+)	90.34% (+)	89.10% (+)	84.50% (+)	54.77% (+)
SAA-3	96.54% (+)	89.70% (+)	88.72% (+)	77.00% (+)	48.07% (+)
LDANet-2	98.95% (+)	92.48% (+)	93.19% (+)	87.58% (+)	61.46% (+)
SVM+Poly	96.31% (+)	84.58% (+)	83.38% (+)	75.99% (+)	43.59% (+)
SVM+RBF	96.97% (+)	88.89% (+)	85.42% (+)	77.49% (+)	44.82% (+)
NNet	95.31% (+)	81.89% (+)	79.96% (+)	72.59% (+)	37.84% (+)
RandNet-2	98.75% (+)	91.53% (+)	86.53% (+)	88.35% (+)	56.31% (+)
DBN-3	96.89% (+)	89.70% (+)	93.27% (+)	83.69% (+)	52.61% (+)
PCANet-2	98.60% (+)	91.48% (+)	93.15% (+)	88.45% (+)	64.14% (+)
AlexNet	96.28% (+)	91.35% (+)	94.38% (+)	93.40% (+)	78.11% (+)
ResNet-50	99.38% (+)	94.40% (+)	94.11% (+)	96.21% (+)	79.98% (+)
VGGNet-16	97.74% (+)	92.12% (+)	95.42% (+)	94.77% (+)	79.01% (+)
MobileNet-v2	99.42% (+)	94.26% (+)	97.21% (+)	96.37% (+)	80.19% (+)
GoogleNet	99.46% (+)	94.42% (+)	97.59% (+)	96.55% (+)	81.81% (+)
psoCNN	99.51% (+)	94.56% (+)	97.61% (+)	96.87% (+)	81.05% (+)
EvoCNN	98.82% (+)	94.78% (+)	97.20% (+)	95.47% (+)	64.97% (+)
MTLBORKS-CNN (Best)	99.65%	96.50%	98.11%	97.15%	82.28%
MTLBORKS-CNN (Mean)	99.59%	95.90%	97.82%	96.67%	81.17%
Classifier	Rectangles	Rectangles-I	Convex	w/t/l	#BCA
ScatNet-2	99.99% (−)	91.98% (+)	93.50% (+)	7/0/1	1
CAE-1	99.86% (+)	83.80% (+)	NA	7/0/0	0
CAE-2	98.46% (+)	78.00% (+)	NA	7/0/0	0
SAA-3	97.59% (+)	75.95% (+)	81.59% (+)	8/0/0	0
LDANet-2	99.86% (+)	83.80% (+)	92.78% (+)	8/0/0	0
SVM+Poly	97.85% (+)	75.95% (+)	80.18% (+)	8/0/0	0
SVM+RBF	97.85% (+)	75.96% (+)	80.87% (+)	8/0/0	0
NNet	92.84% (+)	66.80% (+)	67.75% (+)	8/0/0	0
RandNet-2	99.91% (+)	83.00% (+)	94.55% (+)	8/0/0	0
DBN-3	97.39% (+)	77.50% (+)	81.37% (+)	8/0/0	0
PCANet-2	99.51% (+)	86.61% (+)	95.81% (+)	8/0/0	0
AlexNet	97.88% (+)	93.85% (+)	94.34% (+)	8/0/0	0
ResNet-50	99.51% (+)	95.19% (+)	97.20% (+)	8/0/0	0
VGGNet-16	98.54% (+)	94.11% (+)	95.74% (+)	8/0/0	0
MobileNet-v2	99.61% (+)	95.47% (+)	97.29% (+)	8/0/0	0
GoogleNet	99.72% (+)	95.96% (+)	97.41% (+)	8/0/0	0
psoCNN	99.93% (+)	96.03% (+)	97.74% (+)	8/0/0	0
EvoCNN	99.99% (−)	94.97% (+)	95.18% (+)	7/0/1	1
MTLBORKS-CNN (Best)	99.98%	97.39%	97.95%	NA	7
MTLBORKS-CNN (Mean)	99.97%	96.00%	97.64%	NA	NA

Table 6. Pairwise comparison results obtained from Wilcoxon signed-rank test.

MTLBORKS-CNN vs.	$R^{+}$	p-Value	h-Value
ScatNet-2	28.0	$1.42 \times 10^{- 2}$	+
SAA-3	36.0	$9.58 \times 10^{- 3}$	+
LDANet-2	36.0	$9.58 \times 10^{- 3}$	+
SVM+Poly	36.0	$9.58 \times 10^{- 3}$	+
SVM+RBF	36.0	$9.58 \times 10^{- 3}$	+
NNet	36.0	$9.58 \times 10^{- 3}$	+
RandNet-2	36.0	$9.58 \times 10^{- 3}$	+
DBN-3	36.0	$9.58 \times 10^{- 3}$	+
PCANet-2	36.0	$9.58 \times 10^{- 3}$	+
AlexNet	36.0	$8.37 \times 10^{- 3}$	+
ResNet-50	36.0	$9.58 \times 10^{- 3}$	+
VGGNet-16	36.0	$9.58 \times 10^{- 3}$	+
MobileNet-v2	36.0	$9.58 \times 10^{- 3}$	+
GoogleNet	36.0	$7.26 \times 10^{- 3}$	+
psoCNN	36.0	$8.37 \times 10^{- 3}$	+
EvoCNN	28.0	$1.05 \times 10^{- 2}$	+

Table 7. Friedman test results.

Classifier	Ranking	Chi-Square Statistic	p-Value
ScatNet-2	9.6875	108.7892	$0.00 \times 10^{0}$
SAA-3	13.9375
LDANet-2	9.1250
SVM+Poly	15.4375
SVM+RBF	14.2500
NNet	17.0000
RandNet-2	9.8750
DBN-3	12.8125
PCANet-2	9.3125
AlexNet	10.0625
ResNet-50	5.8125
VGGNet-16	8.0000
MobileNet-v2	4.7850
GoogleNet	3.3750
psoCNN	2.8125
EvoCNN	5.5000
MTLBORKS-CNN	1.1250

Table 8. Post hoc test results.

MTLBORKS-CNN vs.	z	Unadjusted p	Bonferroni-Dunn p	Holm p	Hochberg p
NNet	$6.29 \times 10^{0}$	$0.00 \times 10^{0}$	$0.00 \times 10^{0}$	$0.00 \times 10^{0}$	$0.00 \times 10^{0}$
SVM+Poly	$5.67 \times 10^{0}$	$0.00 \times 10^{0}$	$0.00 \times 10^{0}$	$0.00 \times 10^{0}$	$0.00 \times 10^{0}$
SVM+RBF	$5.20 \times 10^{0}$	$0.00 \times 10^{0}$	$3.00 \times 10^{- 6}$	$3.00 \times 10^{- 6}$	$3.00 \times 10^{- 6}$
SAA-3	$5.07 \times 10^{0}$	$0.00 \times 10^{0}$	$6.00 \times 10^{- 6}$	$5.00 \times 10^{- 6}$	$5.00 \times 10^{- 6}$
DBN-3	$4.63 \times 10^{0}$	$4.00 \times 10^{- 6}$	$5.90 \times 10^{- 5}$	$4.40 \times 10^{- 5}$	$4.40 \times 10^{- 5}$
AlexNet	$3.54 \times 10^{0}$	$4.00 \times 10^{- 4}$	$6.40 \times 10^{- 3}$	$4.40 \times 10^{- 3}$	$4.40 \times 10^{- 3}$
RandNet-2	$3.47 \times 10^{0}$	$5.29 \times 10^{- 4}$	$8.47 \times 10^{- 3}$	$5.29 \times 10^{- 3}$	$5.29 \times 10^{- 3}$
ScatNet-2	$3.39 \times 10^{0}$	$6.96 \times 10^{- 4}$	$1.11 \times 10^{- 2}$	$6.26 \times 10^{- 3}$	$6.26 \times 10^{- 3}$
PCANet-2	$3.24 \times 10^{0}$	$1.18 \times 10^{- 3}$	$1.89 \times 10^{- 2}$	$9.47 \times 10^{- 3}$	$9.47 \times 10^{- 3}$
LDANet-2	$3.17 \times 10^{0}$	$1.53 \times 10^{- 3}$	$2.45 \times 10^{- 2}$	$1.07 \times 10^{- 2}$	$1.07 \times 10^{- 2}$
VGGNet-16	$2.72 \times 10^{0}$	$6.47 \times 10^{- 3}$	$1.04 \times 10^{- 1}$	$3.88 \times 10^{- 2}$	$3.88 \times 10^{- 2}$
ResNet-50	$1.86 \times 10^{0}$	$6.34 \times 10^{- 2}$	$1.01 \times 10^{0}$	$3.17 \times 10^{- 1}$	$3.17 \times 10^{- 1}$
EvoCNN	$1.73 \times 10^{0}$	$8.31 \times 10^{- 2}$	$1.33 \times 10^{0}$	$3.33 \times 10^{- 1}$	$3.33 \times 10^{- 1}$
MobileNet-v2	$1.49 \times 10^{0}$	$1.37 \times 10^{- 1}$	$2.20 \times 10^{0}$	$4.12 \times 10^{- 1}$	$4.12 \times 10^{- 1}$
GoogleNet	$8.91 \times 10^{- 1}$	$3.73 \times 10^{- 1}$	$5.97 \times 10^{0}$	$7.46 \times 10^{- 1}$	$5.04 \times 10^{- 1}$
psoCNN	$6.68 \times 10^{- 1}$	$5.04 \times 10^{- 1}$	$8.06 \times 10^{0}$	$7.46 \times 10^{- 1}$	$5.04 \times 10^{- 1}$

Table 9. Performance of MTLBORKS-CNN and its competitors in solving MNIST-Fashion.

Classifier	Classification Accuracy	# Trainable Parameters
HOG+SVM ¹	92.60%	NA
2C1P ¹	92.50%	100 k
2C1P2F+Dropout ¹	91.60%	3.27 M
AlexNet	89.90%	60 M
ResNet-50	91.12%	23 M
VGGNet-16	90.08%	41 M
MobileNet-v2	92.45%	7 M
GoogleNet	92.69%	16 M
SqueezeNet-200 [47]	90.00%	500 k
MLP 256-128-64 ¹	90.00%	41 k
MLP 256-128-100 ¹	88.33%	3 M
GRU+SVM+Dropout	89.70%	NA
GRU+SVM ¹	88.80%	NA
3C1P2F+Dropout ¹	92.60%	7.14 M
3C2F ¹	90.70%	NA
EvoCNN [13]	94.53%	6.68 M
psoCNN [42]	92.81%	2.58 M
Human Performance ¹	83.50%	NA
MTLBORKS-CNN (Best)	93.27%	1.98 M
MTLBORKS-CNN (Mean)	93.09%	2.31 M

¹ https://github.com/zalandoresearch/fashion-mnist (accessed on 3 June 2023).

Table 10. Optimal CNN architecture constructed by MTLBORKS-CNN in solving nine datasets.

Dataset	Layers	Optimized Network Hyperparameters
MNIST	Convolutional	$N_{1}^{F i l} = 97$ $, S_{1}^{K e r} = 9 \times 9$
	Maximum Pooling	$S_{1}^{P o o r} = 2 \times 2$ $, S_{1}^{S t r} = 1 \times 1$
	Convolutional	$N_{2}^{F i l} = 174$ $, S_{2}^{K e r} = 9 \times 9$
	Convolutional	$N_{3}^{F i l} = 149$ $, S_{3}^{K e r} = 8 \times 8$
	Average Pooling	$S_{3}^{P o o l} = 3 \times 3$ , $S_{3}^{S t r} = 1 \times 1$
	Fully-Connected	$N_{1}^{N e u} = 10$
MNIST-RD	Convolutional	$N_{1}^{F i l} = 143$ $, S_{1}^{K e r} = 9 \times 9$
	Convolutional	$N_{2}^{F i l} = 71$ $, S_{2}^{K e r} = 9 \times 9$
	Convolutional	$N_{3}^{F i l} = 155$ $, S_{3}^{K e r} = 9 \times 9$
	Average Pooling	$S_{3}^{P o o l} = 3 \times 3$ $, S_{3}^{S t r} = 1 \times 1$
	Fully-Connected	$N_{1}^{N e u} = 10$
MNIST-RB	Convolutional	$N_{1}^{F i l} = 115$ $, S_{1}^{K e r} = 3 \times 3$
	Maximum Pooling	$S_{1}^{P o o l} = 3 \times 3$ $, S_{1}^{S t r} = 1 \times 1$
	Convolutional	$N_{2}^{F i l} = 132$ $, S_{2}^{K e r} = 9 \times 9$
	Maximum Pooling	$S_{2}^{P o o l} = 3 \times 3$ $, S_{2}^{S t r} = 1 \times 1$
	Convolutional	$N_{3}^{F i l} = 37$ $, S_{3}^{K e r} = 9 \times 9$
	Fully-Connected	$N_{1}^{N e u} = 10$
MNIST-BI	Convolutional	$N_{1}^{F i l} = 247$ $, S_{1}^{K e r} = 3 \times 3$
	Convolutional	$N_{2}^{F i l} = 159$ , $S_{2}^{K e r} = 5 \times 5$
	Maximum Pooling	$S_{2}^{P o o l} = 2 \times 2$ $, S_{2}^{S t r} = 1 \times 1$
	Convolutional	$N_{3}^{F i l} = 107$ $, S_{3}^{K e r} = 9 \times 9$
	Average Pooling	$S_{3}^{P o o l} = 3 \times 3$ , $S_{3}^{S t r} = 1 \times 1$
	Fully-Connected	$N_{1}^{N e u} = 10$
MNIST-RD+BI	Convolutional	$N_{1}^{F i l} = 145$ $, S_{1}^{K e r} = 7 \times 7$
	Convolutional	$N_{2}^{F i l} = 161$ , $S_{2}^{K e r} = 7 \times 7$
	Average Pooling	$S_{2}^{P o o l} = 3 \times 3$ $, S_{2}^{S t r} = 1 \times 1$
	Convolutional	$N_{3}^{F i l} = 201$ , $S_{3}^{K e r} = 9 \times 9$
	Fully-Connected	$N_{1}^{N e u} = 10$
Rectangles	Convolutional	$N_{1}^{F i l} = 155$ $, S_{1}^{K e r} = 4 \times 4$
	Maximum Pooling	$S_{1}^{P o o l} = 1 \times 1$ $, S_{1}^{S t r} = 1 \times 1$
	Convolutional	$N_{2}^{F i l} = 114$ $, S_{2}^{K e r} = 6 \times 6$
	Average Pooling	$S_{2}^{P o o l} = 3 \times 3$ $, S_{2}^{S t r} = 2 \times 2$
	Convolutional	$N_{3}^{F i l} = 210$ $, S_{3}^{K e r} = 9 \times 9$
	Average Pooling	$S_{3}^{P o o l} = 3 \times 3$ $, S_{3}^{S t r} = 1 \times 1$
	Fully-Connected	$N_{1}^{N e u} = 2$
Rectangles-I	Convolutional	$N_{1}^{F i l} = 105$ $, S_{1}^{K e r} = 3 \times 3$
	Maximum Pooling	$S_{1}^{P o o l} = 3 \times 3$ $, S_{1}^{S t r} = 1 \times 1$
	Convolutional	$N_{2}^{F i l} = 97$ $, S_{2}^{K e r} = 9 \times 9$
	Convolutional	$N_{3}^{F i l} = 94$ $, S_{3}^{K e r} = 9 \times 9$
	Maximum Pooling	$S_{3}^{P o o l} = 1 \times 1$ $, S_{3}^{S t r} = 2 \times 2$
	Fully-Connected	$N_{1}^{N e u} = 2$
Convex	Convolutional	$N_{1}^{F i l} = 57$ , $S_{1}^{K e r} = 9 \times 9$ $,$
	Maximum Pooling	$S_{1}^{P o o l} = 3 \times 3$ $, S_{1}^{S t r} = 1 \times 1$
	Convolutional	$N_{2}^{F i l} = 63$ $, S_{2}^{K e r} = 9 \times 9$
	Maximum Pooling	$S_{2}^{P o o l} = 1 \times 1$ $, S_{2}^{S t r} = 1 \times 1$
	Convolutional	$N_{3}^{F i l} = 40$ $, S_{3}^{K e r} = 9 \times 9$
	Fully-Connected	$N_{1}^{N e u} = 2$
MNIST-Fashion	Convolutional	$N_{1}^{F i l} = 166$ $, S_{1}^{K e r} = 3 \times 3$
	Maximum Pooling	$S_{1}^{P o o l} = 2 \times 2$ $, S_{1}^{S t r} = 1 \times 1$
	Convolutional	$N_{2}^{F i l} = 181$ $, S_{2}^{K e r} = 5 \times 5$
	Average Pooling	$S_{2}^{P o o r} = 2 \times 2$ $, S_{2}^{S t r} = 1 \times 1$
	Fully-Connected	$N_{1}^{N e u} = 10$

Table 11. Ablation experiment results to investigate the effectiveness of each modification introduced into MTLBORKS-CNN.

Datasets	Metrics	TLBO-CNN	MTLBORKS-CNN-v1	MTLBORKS-CNN-v2	MTLBORKS-CNN-v3	MTLBORKS-CNN
MNIST	Classification Accuracy	98.54%	99.54%	99.55%	99.57%	99.65%
MNIST	# Trainable Parameters	12.10 M	9.10 M	6.88 M	6.12 M	3.50 M
MNIST-RD	Classification Accuracy	94.66%	95.90%	96.01%	96.48%	96.50%
MNIST-RD	# Trainable Parameters	10.10 M	4.24 M	3.99 M	3.57 M	2.77 M
MNIST-RB	Classification Accuracy	96.91%	97.47%	97.45%	98.07%	98.11%
MNIST-RB	# Trainable Parameters	7.23 M	2.72 M	3.11 M	2.84 M	2.68 M
MNIST-BI	Classification Accuracy	95.53%	96.82%	97.02%	97.13%	97.15%
MNIST-BI	# Trainable Parameters	5.02 M	3.48 M	4.31 M	3.21 M	3.03 M
MNIST-RD+BI	Classification Accuracy	77.58%	79.53%	79.66%	82.17%	82.28%
MNIST-RD+BI	# Trainable Parameters	3.71 M	5.93 M	4.03 M	3.55 M	1.84 M
Rectangles	Classification Accuracy	99.68%	99.88%	99.90%	99.97%	99.98%
Rectangles	# Trainable Parameters	12.60 M	4.72 M	4.87 M	3.64 M	2.63 M
Rectangles-I	Classification Accuracy	95.71%	96.03%	96.14%	97.30%	97.39%
Rectangles-I	# Trainable Parameters	6.63 M	1.17 M	4.62 M	2.84 M	1.60 M
Convex	Classification Accuracy	95.20%	96.41%	96.59%	97.85%	97.95%
Convex	# Trainable Parameters	1.54 M	4.00 M	2.14 M	1.48 M	0.55 M
MNIST-Fashion	Classification Accuracy	91.89%	92.44%	92.43%	92.70%	93.27%
MNIST-Fashion	# Trainable Parameters	4.31 M	3.09 M	3.21 M	2.69 M	1.98 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ang, K.M.; Lim, W.H.; Tiang, S.S.; Sharma, A.; Towfek, S.K.; Abdelhamid, A.A.; Alharbi, A.H.; Khafaga, D.S. MTLBORKS-CNN: An Innovative Approach for Automated Convolutional Neural Network Design for Image Classification. Mathematics 2023, 11, 4115. https://doi.org/10.3390/math11194115

AMA Style

Ang KM, Lim WH, Tiang SS, Sharma A, Towfek SK, Abdelhamid AA, Alharbi AH, Khafaga DS. MTLBORKS-CNN: An Innovative Approach for Automated Convolutional Neural Network Design for Image Classification. Mathematics. 2023; 11(19):4115. https://doi.org/10.3390/math11194115

Chicago/Turabian Style

Ang, Koon Meng, Wei Hong Lim, Sew Sun Tiang, Abhishek Sharma, S. K. Towfek, Abdelaziz A. Abdelhamid, Amal H. Alharbi, and Doaa Sami Khafaga. 2023. "MTLBORKS-CNN: An Innovative Approach for Automated Convolutional Neural Network Design for Image Classification" Mathematics 11, no. 19: 4115. https://doi.org/10.3390/math11194115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MTLBORKS-CNN: An Innovative Approach for Automated Convolutional Neural Network Design for Image Classification

Abstract

1. Introduction

1.1. Problem Background

1.2. Recent Progress in Network Architecture Design Techniques

1.3. General Challenges of MSA-Based Network Design Techniques

1.4. Drawbacks of Original TLBO in CNN Architecture Optimization

1.5. Research Significances and Contributions

2. Related Works

2.1. Conventional TLBO

2.2. Existing MSA-Based Approaches for CNN Optimization

3. Proposed MTLBORKS-CNN

3.1. Overview of CNN Architecture

3.1.1. Convolutional Layer

3.1.2. Pooling Layer

3.1.3. Fully-Connected Layer

3.2. Solution Encoding Scheme of MTLBORKS-CNN

3.3. Population Initialization of MTLBORKS-CNN

3.4. Fitness Evaluation of MTLBORKS-CNN

3.4.1. Stage 1: Construction and Training of Potential CNN Architecture

3.4.2. Stage 2: Evaluation of the Trained CNN Architecture

3.4.3. Design Consideration of Epoch Numbers during Fitness Evaluation Process

3.5. Modified Teacher Phase of MTLBORKS-CNN

3.5.1. Construction of Unique Mean Positions

3.5.2. Construction of Unique Social Exemplar

3.5.3. Construction of New CNN Architecture

3.6. Modified Learner Phase of MTLBORKS-CNN

3.6.1. Self-Learning Scheme

3.6.2. Adaptive Peer Learning Scheme

3.7. Dual-Criterion Selection Scheme of MTLBORKS-CNN

3.8. Complete Mechanisms of MTLBORKS-CNN

3.9. Complexity Analysis of MTLBORKS

4. Performance Analysis of Proposed Method

4.1. Selection of Benchmark Datasets

4.2. Simulation Settings

4.3. Simulation Results

4.3.1. Classification Accuracy in Handling the First Eight Benchmark Datasets

4.3.2. Classification Accuracy in Handling MNIST-Fashion Dataset

4.3.3. CNN Architectures Optimized by MTLBORKS-CNN for Different Datasets

4.3.4. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI