1. Introduction
Medical microwave radiometry (MWR) is used to obtain the internal tissue temperature of the body [
1]. This is done by measuring the naturally emitted thermal radiation from the tissues. Additionally, it is a noninvasive, nonionizing, and cost-effective approach. Due to the device’s accuracy, there are already multiple clinical applications using the temperature readings and patterns to identify various conditions [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]. In this paper, we focus on using MWR to detect breast cancer. This is viable since the growth rate of tumors is correlated with the tissues’ temperature [
12,
13]. In addition to the thermal information of the tissue, we can derive from MWR the cancer cells’ reproduction rate and mutagenesis risk levels [
14].
MWR is a relatively new clinical imaging technique. Thus, for it to be adopted successfully, an artificial intelligence (AI) diagnostic tool needs to be developed in parallel. The diagnostic tool alleviates the need for training clinicians to interpret the data and prevents workload increase, while also providing a more accurate prediction. Thus, our first objective is to further improve the diagnostic accuracy of the model. Furthermore, while this research is focused on breast cancer, MWR has clinical applications at different anatomical locations [
1,
14] and in various conditions. To reduce the development time of models for each of these cases, we explore adapting automatic machine learning (AutoML) techniques for MWR data.
There has been previous work with AutoML for MWR using a cascade correlation neural network (CCNN) [
15]. Subsequent improvements were made by expanding the pool of layers and activation functions the model could explore [
15,
16]. Despite the improvements, it was not able to outperform predefined architectures [
16]. However, it resulted in a small network that was desirable when considering edge computing and hardware limitations. Various classification models have been explored in the past, such as deep neural networks, convolution neural networks, support vector machines, and random forests [
15,
16]. Additionally, a rule-based classification model was introduced that improved the interpretability of the results [
16].
In summary, our contributions in the field of MWR for breast cancer detection are two-fold. First, we evaluate the application of weight agnostic Neural Network (WANN) [
17] on MWR data and compare it against the CCNN that was used in previous research [
15]. Secondly, we improve the WANN model for MWR classification. Once the topology of the network is found using WANN, we use the bi-population covariance matrix adaptation evolution strategy (BIPOP-CMA-ES) [
18] to find the optimal weight candidates. Combining the WANN and BIPOP-CMA-ES strategies, we obtain state-of-the-art classification performance on MWR breast cancer data. Furthermore, we conclude that a random search strategy to optimize the weights yields better results than those achieved by a gradient descent method for architectures generated by WANNs for MWR data.
2. Methods
2.1. Cascade Correlation Neural Network
Cascade correlation neural network (CCNN) is an early neural architecture search (NAS) technique for supervised tasks [
19]. The idea of a CCNN is to start with a minimum-sized network, input and output layers, and add one additional node at a time until convergence. The steps of the algorithm are as follows [
19]:
Initialize network topology with input and output nodes.
Create a pool of candidate hidden layer nodes initialized at different starting weights. The hidden layer node takes input from all previous layers. Its output is connected to the output layer nodes. Each candidate node is trained until convergence.
From the pool of candidates, select and add to the network the candidate node that maximizes the magnitude of the correlation between the output and target on the validation set. The input weights of the added hidden layer nodes are frozen.
If the correlation does not improve or improves by a small margin, then terminate the algorithm. Otherwise, proceed to step two.
An example of the connections created after adding the 3rd hidden layer can be seen in
Figure 1.
The loss function we used is the cross-correlation loss and optimize using stochastic gradient descent. The optimizer’s learning rate was set to . Each of the nodes used a sigmoid activation function. Furthermore, their weights were sampled from a Gaussian distribution with a mean of 0 and 0.5 standard deviation, and the bias was set to 0. A combination of batch normalization and dropout layers (rate of 0.5) was added after the node. The candidate node pool size was set to 30. Additionally, we reinitialized the weights of the output layers after each iteration to avoid being stuck in a bad local minimum.
2.2. Weight Agnostic Neural Network
Another NAS method is the weight agnostic neural network (WANN) approach [
17]. The main difference between WANN and CCNN is that during the architecture search, the weights of the model are not trained. Instead, a set of fixed shared weights are used to evaluate the average performance. According to the authors, the idea behind this is to automatically find architectures that have inductive biases and can perform well in their given task without training.
Inspired by genetic evolution, WANN starts with a population of small initial networks. The initial networks consist of the input and output layers. However, the nodes between the layers are sparsely connected. Then, once the population of networks is established, a series of fixed shared-value weights, which we set to [−2, −1.5, −1, −0.5, +0.5, +1, +1.5, +2], are used to evaluate the performance of each topology. The evaluation metric we used is the geometric mean between the predicted and actual values.
The topologies were ranked on the basis of three criteria: mean performance across all fixed shared weights, best performance between any of the fixed shared weights, and the number of connections. Similar to the authors of WANN, we used the Non-dominated Sorting Genetic Algorithm II (NSGA-II) [
20] to sort the network topologies based on the previous criteria. NSGA-II is a multi-objective sorting genetic algorithm that combines elitism and does not require a priori selection of shared parameters. The highest-ranking topologies were selected for the next step using the tournament algorithm [
21].
Once the new population was selected, they were subsequently varied to generate the next generation of population. There were three mutation operations used to increase the complexity of the model. First, a node can be inserted between two connected nodes. Secondly, a new connection can be added between two existing nodes. Finally, the activation function of a node can be changed according to the list in
Table 1. This process of evaluating, ranking, and generating a new population was repeated until there was no longer improvement.
The hyperparameters for the WANN model are summarized in
Table 2. These hyperparameters were defined through extensive experimentation. Specifically, we searched 200 generations, each having a population size of 200. The existence of the initial connections between input and output layers was set to 0.2. For topology variation, we set a likelihood of changing the activation function to 0.5, adding a node to 0.25, and creating a new connection to 0.25. Finally, for the tournament algorithm, we set the size to 4.
2.3. Weight Agnostic Neural Network BIPOP-CMA-ES
For MWR breast cancer data, we can determine patterns and relationships between the points to identify high-risk patients. By using a WANN model, the resulting architecture relies on creating node connections to identify similar properties, in addition to new ones. Thus, the architecture acts as a prior. However, there are more subtle cases to distinguish between those that are low- and high-risk. Specifically, these cases will be when the tumors’ growth rate slows down. This can be achieved through weight optimization once the optimal architecture has been found.
Based on the research results of WANN, while it performs better than chance in most cases, it is not able to outperform fixed topologies that have had their parameters tuned [
17]. A way to circumvent this is by taking the best topology found by the WANN model and proceeding to optimize the parameters via a gradient descent algorithm. However, a network with various activation functions results in a difficult gradient traversal [
17].
Thus, a better way of optimizing the weights is through a black-box optimization method such as the CMA-ES algorithm [
17,
22]. With the randomized search of the CMA-ES, it is suitable for a rigged landscape in which there are many bad local minima, discontinuities, and noise. The steps of the CMA-ES algorithm are summarized in
Figure 2.
We used a variant of the CMA-ES algorithm, the bi-population covariance matrix adaptation evolution strategy (BIPOP-CMA-ES) [
18]. Through our experiments, we found it to perform better than CMA-ES. BIPOP-CMA-ES uses a variable population size. It initially starts with a small population size, which we set to 50, and doubles after each restart. Additionally, to speed up convergence, we fine-tuned the initial single shared weight of the model. We achieved this by linearly evaluating values between −2 and 2. Finally, the cross-entropy loss was used to find the best fit.
4. Discussion
Similar to the observations of the authors of WANN [
17], the generated model we obtained can achieve better performance than that of chance based on the generated topology of the network. We can further improve the performance by optimizing the weights. However, despite the improvement when we used a gradient descent optimizer, the performance is still subpar to that of the other models. We gained a much larger improvement in performance when utilizing a random search evolution strategy.
Our proposed model obtains the best performance on all metrics evaluated and has the least number of connections. The WANN and CCNN models have a small number of connections as they start from a minimum-sized network and gradually expand. In contrast, FC-Evolution, FC-TPE, and FC-DARTS search from a predefined architecture space that allows them to start from a large or small network. The trend of these approaches was to opt for larger network sizes early in their architecture search, as they have a higher learning capacity. Additionally, we showed that the predicted results from the network are statistically significant when paired against all other evaluated models. However, while the WANN’s performance is the worst, it has the highest p-value. This is probably an indication of the importance of the architecture and that inductive biases are maintained to some degree, despite weight training.
A general summary of the advantages and disadvantages of all models is shown in
Table 6. Furthermore, there are domain-specific advantages of WANN BIPOP-CMA-ES and an extension of NAS for healthcare applications. First, the generated topology of the network is optimized to have a small number of parameters and sparse connections due to the inclusion of the model size as a minimization objective [
17]. This allows the model to be deployed on low-end devices and on already existing clinical hardware, which is particularly important for accessibility to low- and mid-income countries. Second, we decrease the development time, as architecture tunning through manual trial-and-error is reduced. Additionally, the model becomes more accessible to nontechnical experts, such as clinicians, as they do not require vast knowledge of machine learning to develop a model. Without this barrier, they can more effectively contribute to and improve the diagnostic tool. Finally, from the aforementioned benefits, it reduces the complexity and time of adapting MWR for different anatomical locations and pathologies.
The mutation operations of WANN only increase the complexity of the network topology. For future work, we will expand it to include operations such as deleting nodes and deleting connections so there is more flexibility in defining the architecture. In addition, there is no crossover mutation operation in the WANN model, which will reduce population diversity. Hence, we will explore different crossover mutation operations [
29,
30,
31] and restart techniques to increase model performance [
32,
33]. Finally, we will investigate cross-trial information sharing, such as including in the mutation pool more complicated building blocks generated from previous trials.
Furthermore, we will look at improving our model by searching for the best loss function [
34], utilizing a one-shot learning search [
35,
36], and conducting a hyperparameter search [
37]. We will also compare against additional NAS methods such as reinforcement learning-based searching methods [
38,
39], ensemble methods [
40], and transfer learning [
41]. Our main aim is to determine if a performance improvement can be made without significantly increasing computational complexity.