4.1. Hyperparameter Tuning
One of the objectives of this paper is to identify architectures and training parameters that are best suited to the OAM pattern classification task. There are a variety of hyperparameters that effect how quickly an architecture can train as well as how well it can perform. In order to keep the parameter tuning space constrained, the following process is followed.
We first select four optimizers and do a brief case study to determine whether one does better than another. One of the more challenging data sets is selected (TB5) and the ResNeXt 50 architecture is used. ResNeXt 50 has over 23 million trainable parameters, so the architecture itself is sufficiently complex to provide some challenge while training. These selections are designed to help identify whether one optimizer works better than another. After evaluating the results, that optimizer is used for the remainder of the training.
Once the optimizer is selected, we proceed to finding learning rates for each of the architectures. To do this, we use the optimizer previously selected and then use middle complexity data sets from the underwater (AL4) and free-space (TB5) sets.
Batch size is a hyperparameter that can be changed as well. The batch size is kept the same for all architectures, minus DenseNet. This is due to the fact that memory requirements during training with DenseNet are significantly higher than the other architectures and exceed available memory in our systems.
With the optimizer, learning rates, and batch sizes all selected for each architecture and data source, we are ready to compare how well the various architectures compare against each other. It is important to highlight the fact that finding comparable hyperparameter settings is critical for objective comparison of architectures. If a learning rate, for example, is set too high or too low on one architecture it is likely to underperform relative to its peers. The underperformance in that case is not because the architecture is any worse than another, but because the hyperparameters were not properly selected. For this reason, time and effort are expended on identifying settings that will allow the architectures to perform their best.
A variety of parameters exist that influence the training of a CNN. Training a CNN generally happens over many epochs, where an epoch consists of using all training data once. In an epoch, the data set is generally divided up into smaller groups called batches. Each time a batch is passed through the CNN during training, the difference between its predictions, and the actual classes generates an error. The error is backpropagated through the CNN and is used to update the weights in the CNN. The rate at which updates are made is controlled by the learning rate. Accuracy refers to the percent of the time that the CNN assigns the correct class to an image.
Architectures are initialized with pre-trained weights from ImageNet training, and are used as the starting point for training the OAM patterns. As the ImageNet competition has 1000 classes and a fixed input size of 224 × 224, the CNN input and output layers were modified for 128 × 128 sized input images and output dimensions for classes of 16 (underwater) and 32 (free-space).
As hyperparameter searches can quickly become computationally expensive, a two-tier approach is taken to narrow the field. The first tier is to identify an optimizer that provides the best results. The second tier it to identify the best learning rate for each architecture using the selected optimizer. For the hyperparameter study, the ResNeXt 50 architecture is used, training is limited to 5 epochs, and the TB5 data set is used. Batch size is set to 32 for DenseNet because of its memory requirements during the training process. All other architectures use batch sizes of 128.
Optimizers are selected from the set of Adam, RMSProp, AdaMax, and Nadam. The ResNeXt 50 architecture is used to train data from the TB5 free-space data set. Learning rates are selected from a range of to for each optimizer. A quick random search was first performed to find a few good performing starting values for each optimizer. Those values are then used as best guesses to seed a Hyperopt search to support the optimizer analysis. The Hyperopt search is allowed to run 25 iterations to identify the best performing learning rate.
Figure 7 plots accuracies achieved using the four optimizers from learning rates selected by Hyperopt search. The
x-axis shows the learning rates on a log scale, while the
y-axis represents the accuracy achieved on the holdout set after 5 epochs of training. It is interesting to note that all of the optimizers achieve similar peak accuracies and the overall distribution of accuracies is very similar. The primary difference being the offset of the accuracy curves relative to the learning rate. These offsets are primarily due to how the learning rates are scaled within the optimizer algorithms.
Figure 8 shows an accuracy curve for each optimizer over the course of 60 epochs. The TB5 data set is used for training the ReNeXt architecture in this figure. Learning rates for each optimizer are derived from the peaks from
Figure 5.
Figure 8 shows similar convergence rates for all of the optimizers. In this figure, Nadam reaches peak accuracy the quickest, while Adamax takes a few more epochs to reach the same level. This points to the potential advantage of using Nadam over Adamax to reduce compute time.
As
Figure 8 shows similar performance between the optimizers, a simple statistical analysis is employed to make the selection of which optimizer to use.
Table 2 shows the average and standard deviations of accuracies for epochs 20–70 from
Figure 8. Results in the table show that Nadam gets better average accuracy and lower standard deviation than the other optimizers. Consequently, Nadam is selected as the default optimizer for all subsequent training in this paper.
With the optimizer selected, the second tier of the hyperparameter search is to identify the best learning rates for each CNN architecture. Given that there are potential differences between the free-space and underwater data sets, this search is applied to each domain to see whether there are any significant differences in learning rate selection.
Figure 9a,b show Hyperopt results for accuracy vs. learning rate. The training is limited to 7 epochs, which is sufficient to generate curves showing relative training responses for different learning rates. The region of the figures of primary interest is the rising portion of the curve as these regions suggest the most efficient place to draw learning rates from. As the learning rates increase, they also show the tipping points where training becomes unstable, weights diverge, and learning ceases. Ideal learning rates for each architecture differ from each other. This is because the number of trainable parameters and the way information flows through the architectures are different.
For the underwater set,
Figure 9a shows very similar curves for the ResNet family of architectures. ShuffleNet shows the most difference as its learning rates are shifted to the right. Differences between the architectures are more pronounced in the free-space data as shown in
Figure 9b. Again, the ResNet family of architectures are similar at the same range of learning rates while ShuffleNet is also shifted far to the right in its learning rate curve. SqueezeNet appears to learn significantly slower than the other architectures. This graph turns out to be indicative of its overall performance later in the paper.
These curves provide an idea of what learning rate to use for training the architectures. Learning rates are selected moving from the left side of the curve (which begins at
) and are selected at approximately 95% of the peak value. This allows selection of learning rates with good efficiency, but are not so high as to create convergence problems. This learning rate selection approach was established by Ref. [
35].
Final learning rates used for each architecture, in underwater and free-space environments, are derived from these figures. Results are shown in
Table 3. These are the learning rates used for the rest of the training in this paper. It is interesting to note that the learning rates between the two data sets are fairly similar to each other.
Using the established learning rates, accuracy curves were generated for each architecture. This provides an initial comparison of how quickly the architectures learn and the levels that they converge to. To perform this initial study, only one AL and one TB data set was used to provide a high-level view to compare the architectures. The middle set was selected for each environment to provide enough attenuation and turbulence to highlight differences between architecture performance.
Figure 10a shows accuracy per training epoch curves for the underwater AL4 data set for each architecture (at the learning rates indicated in
Table 3). It is apparent from the curves that, over time, all of the architectures achieve fairly similar accuracies. While there are some differences in the initial slope of the accuracy curves, they all wind up converging at a high accuracy.
Figure 10b shows accuracy per training epoch curves for the free-space TB5 data set for each architecture. Most of the architectures settle in at approximately the same end accuracy, the lone difference being SqueezeNet.
Aside from SqueezeNet, there does not appear, at this point, to be a great deal of difference from one architecture to another when using the AL4 (underwater) and TB5 (free-space) data sets. With hyperparameters selected, the architectures are ready for training against the data sets.
4.3. Inter-Set Performance Analysis
Section 4.2 shows that most of the architectures perform well when tested with the holdout test sets from the original data set. In real environments the trained classifiers are likely to be presented with images that have been distorted by larger turbulence and attenuation than what was present in the training set. The tests in this section explore the architectures and how well they perform when presented with images outside of their training set. For example, how well does an architecture trained with the AL0 data set classify attenuated images from the AL4, AL8, and AL12 holdout sets?
For this analysis, both underwater and free-space data sets are evaluated. For the underwater sets, the AL0 and the AL0-4 trained architectures are used. These architectures are evaluated against the AL0, AL4, AL8, and AL12 holdout sets. For the free-space data sets, the TB5 and TB5-10 trained architectures are used. These architectures are evaluated with the TB5, TB10, and TB15 holdout sets. In both cases, the results of interest are with data sets that fall outside the training sets. In the following tables, the results are ordered by ascending accuracies.
Table 7 includes results from architectures that have been trained with the AL0 data set. The four columns include accuracies from AL0, AL4, AL8, and AL12 holdout sets. In looking at the performance of the AL4 test set, DenseNet and Wide ResNet give the best accuracies at 63.7% and 81.2% respectively.
In evaluating architectures trained with the combined AL0 and AL4 data sets,
Table 8 shows results ordered according to ascending results for the AL8 data set. DenseNet and ShuffleNet take the lead spots with 80.0% and 84.4% respectively.
Table 9 and
Table 10 have similarly organized results for free-space data sets. ResNet and DenseNet (97.8% and 97.9%) have the best results for TB10 in
Table 9, while DenseNet and ResNet (81.8% and 84.8%) take the lead spots in
Table 10.