1.1. Background
Recently, deep learning with convolutional neural networks (CNNs) has proven to be highly effective for computer-aided detection (CAD) in medical image analysis. The trend in CNN architectures recently has been towards ever deeper and wider networks with dense connectivity. For example, ViT-G/14 [
1] and ViT-MoE-15B [
2] were the top two CNN architectures in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) competition in 2021 [
3]. The ViT-G/14 and ViT-MoE-15B architectures contain 1.843 G and 14.70 G parameters, respectively. Furthermore, ViT-G/14 requires 965.3 G floating point operations (FLOPs) per single image, which is a very computationally costly and power-hungry solution. Perhaps even more significantly, larger networks require more training data to be able to generalize to new data [
4]. In many medical image analysis applications, access to properly-labeled truth imagery is limited, especially for rare diseases [
4]. Data collection and truthing in medical imaging can be cost-intensive, time-consuming, and requires expert analysis. Transfer learning can help to reduce the amount of application-specific data required for training. However, large amounts of data may still be needed to obtain the desired reproducibility and generalizability, even with transfer learning [
5,
6]. Data augmentation is another approach to dealing with limited training data. However, data augmentation can be very challenging in some medical imaging modalities such as chest radiographs [
7].
Modular CNN architectures are a promising approach for complex problem-solving that may be able to help address the challenges described above. Some modular methods are inspired by the structure and function of the human brain. Recent findings in neuroscience reveal a high level of modularity and hierarchy of neural structure in the human brain [
8]. In the early 1980s, neuroscientific research categorized the central nervous system (CNS) in the human brain as a massively parallel and self-organizing modular system [
9,
10,
11]. The CNS consists of distinctive regions. Each region develops as a functional module. The modules are densely connected and interact with one another to accomplish complex perception and cognitive tasks in an efficient manner [
9]. Traditional CNN architectures often use repeating structures such as layers or groups of layers. However, the networks are generally trained as one monolithic entity with all learnable parameters being updated simultaneously.
1.2. Applications
Malaria is a deadly disease that is considered endemic in many countries around the world [
12]. In the year 2020, the World Health Organization (WHO) reported an estimated 229 million cases of malaria worldwide, which caused an estimated 409,000 deaths [
13]. Malaria occurs in humans via protozoa within the blood cells of the genus Plasmodium. These parasites are transmitted by the bite of a female Anopheles mosquito [
14]. The mosquito bite injects the Plasmodium into the affected person’s blood, and then the Plasmodium parasites pass quickly to the liver to mature and replicate [
15]. The most common imaging modality for detecting parasites in a thin blood smear sample is microscopical imaging [
16]. While microscopy is relatively low-cost and widely accessible, diagnosis efficiency depends on the experience of parasitologists [
17]. False-positive or false-negative diagnoses can lead to inappropriate or unnecessary prescriptions that can cause side effects in patients. Due to the global shortage of parasitologists in impoverished urban areas accurately processing the large number of specimens encountered is not always possible [
18,
19]. Thus, CAD systems can be highly beneficial in this application.
Another disease for which CAD systems can help by providing accurate early detection is diabetic retinopathy (DR). This condition is a typical development of diabetes, affecting the retina’s small blood vessels, leading to vision deterioration [
20]. The research described in [
21] studies the offloading footwear to prevent and lower mortality rates in high-risk diabetic feet. A recent study has reported that DR affects the vision of 2.6 million people in the world [
20,
22]. Several retinal imaging systems can be utilized to detect the indication of diabetic retinopathy, including color fundus photography, fluorescein angiography, B-scan ultrasonography, and optical coherence tomography [
23]. The retina images that we use in our study have been captured using fundus photography under a variety of imaging conditions. Early-stage diagnosis of DR grading is integral to prevent the occurrence of blindness. Hence, CAD systems could help save millions of people from potentially preventable vision loss and blindness by improving early detection.
The third and final application we consider here is pulmonary tuberculosis (TB). This disease is a significant public health issue causing more than 9 million expected new cases and roughly 1.4 million deaths every year [
24]. The detection of TB on chest radiographs (CRs) is essential for diagnosing TB. Chest radiography imaging (e.g., X-ray or computed tomography (CT) imaging) is easy to perform with fast diagnosis and has a high sensitivity for diagnosing TB infection. However, CRs are the fastest and most affordable form of imaging and require significantly less radiation, data memory, and processing time than CT scans [
25]. The WHO recommends using CRs to screen and triage people for TB [
26]. Note that CRs are among the first procedures of examination related to suspects’ lung disease. They are low-cost and widely accessible for health care providers. The use of CAD systems for TB detection can help radiologist workflow so they may be able to process more cases with greater accuracy.
1.3. Related Works
Various approaches have been proposed in the literature for medical imaging CAD systems, including those for malaria, DR and TB [
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43]. The work described in [
28] aims to improve malaria parasite detection using tiny red blood smear patches; they utilize several existing deep convolutional neural networks in place of handcrafted feature extraction. The study claims that using preprocessing techniques such as standardization, normalization, and stain normalization does not improve the overall performance model. An effective multi-magnification deep residual neural network (MM-ResNet) has been trained on microscopic phone image datasets of malaria blood smears obtained from the AI research group at Makerere University. The MM-ResNet
end-to-end framework takes three different images of size as inputs. It concatenates each ResNet
at the second to the last layer, followed by a final fully connected layer [
29]. VGG-16 and VGG-19 have been trained on the National Institutes of Health (NIH) malaria dataset using hyperparameter tuning techniques described in [
30], the CNN used to automate the screening of malaria in low-resource countries achieves an accuracy of 0.9600. A survey article on image analysis of microscopic blood slides uses many machine learning techniques for malaria detection [
31]. Patient information was considered, such as nationality, age, gender, body region, and symptomatology of a patient as a part of features engineering for malaria detection. Furthermore, they examined six machine learning algorithms, including support vector machine (SVM), random forest (RF), multilayer perceptron (MLP), AdaBoost, gradient boosting (GB), and CatBoost to classify infected and non-infected cells [
32]. A fast CNN architecture present in [
33] is used to classify thin blood smeary images. This paper studied the performance of transfer learning approaches for various pre-trained CNN architectures, including AlexNet, ResNet
, VGG-16, and DenseNet
. Furthermore, they studied the performance of a traditional machine learning algorithm using a bag-of-features model with SVM.
Many deep learning methods, originally proposed for the ILSVRC [
44], have been adapted to the medical image application. Among these are meta-algorithms for DR detection, which combine five CNN architectures into one predictive model [
34]. Zhang et al. [
35] fine-tuned ResNet
that pre-trained on the ImageNet dataset. The work described in [
36] developed a real-time smartphone app to detect and classify DR by using a pre-trained Inception v3 model with a transfer learning technique. A hybrid machine learning technique is introduced in [
37] to detect and grade DR severity level. The study compares simple transfer learning-based approaches using seven pre-trained networks. Another fine-tuned, pre-trained approach for DR detection is presented in [
38] using a cosine annealing strategy to decay the learning rate. The transfer learning method for TB described in [
39] was used to neutrophil cluster detection. An automatic TB screening system presented in [
40] is based on transfer learning from lower convolutional layers of pre-trained networks. The method in [
42] uses a simple segmentation approach to classify the images’ foreground and background. The segmented objects are then fed to a trained CNN to classify the objects into bacilli and non-bacilli. A total of four state-of-the-art 3D CNN models are used to detect the spatial location of lesions and classify the candidates into miliary, infiltrative, caseous, tuberculoma, and cavitary types in [
43]. A multi-strategy fast non-dominated solution ranking algorithm with high robustness is described in [
45].
Of particular relevance to our work is the Net2Net method introduced by Chen et al. [
46]. The method is modular in that it allows two neural networks to mimic the behavior of a more complex network. The Net2Net is an effective technique to transfer the prior knowledge from a trained neural network (teacher network) to a new deeper, or wider network (student network). The Net2Net approach implemented in [
46] combines two neural networks to form a larger network. It does so by either increasing the width or the depth of the network. The method replicates the teacher network weights to expand the student network size either in width or depth. After replicating, the new addition is initialized to be an identity network. This method can guarantee that the student model can perform just as well as the teacher network at the start of training. The student model obtains good accuracy much faster than training the larger network from scratch. While Net2Net is a practical and innovative approach that works very well for knowledge transfer, the Net2Net method has a few limitations. For example, the current implementation of Net2Net in [
46] uses only two networks. Furthermore, there are restrictions on the networks in terms of kernel sizes, activation functions, and initialization, so as to achieve the stated network properties.
Another related modular technique that is designed to work with small amounts of training data is presented in [
47]. The module uses the entire CNN network as modules. It combines pre-trained modules with untrained modules, allowing the new network to learn discriminative features. The pre-trained models VGG-16 and ResNet
were used. The module fine-tunes the VGG-16 model on the Stanford Cars dataset by replacing the last three layers with two consecutive fully connected layers, softmax, and loss function. Then, the module merges the fixed VGG16 features with a ResNet
. The output of both models was then fed to two fully connected layers, softmax, and loss function.