1. Introduction
In the realm of computer vision, image recognition has perennially stood as a cornerstone of scholarly inquiry. The advent and swift progression of deep learning technologies have catalyzed remarkable strides in image recognition methodologies, particularly those underpinned by Convolutional Neural Networks (CNNs). The watershed moment arrived in 2021 with Krizhevsky et al.’s unveiling of the AlexNet model, which not only clinched remarkable success in the ImageNet competition [
1] but also heralded CNNs as the preeminent paradigm for a spectrum of tasks including image classification, object detection, and image segmentation [
2,
3,
4,
5].
However, despite the remarkable performance of Convolutional Neural Networks (CNNs) in certain image classification tasks, they are with limitations. A significant challenge lies in the potential loss of fine details and the disappearance of critical features during the convolution and pooling processes [
6,
7,
8]. The convolution operation, which extracts image features [
9,
10], and the pooling operation, which reduces computational complexity [
11], both inherently risk discarding edge information and high-frequency details. This compromises the accuracy and precision of classification, particularly in tasks demanding fine-grained distinctions, such as the identification of small objects or the analysis of complex scenes containing multiple objects [
12]. Consequently, the pursuit of methods to effectively extract discriminative features while preserving intricate image details has emerged as a pivotal research direction for advancing image classification performance.
In recent years, the escalating computational capabilities of deep learning models have spurred the development of novel architectures that preliminarily mitigate the issue of feature vanishing. Among these, U-Net [
13] and its enhanced variant, U-Net++ [
14], stand out as the most prominent examples. Originally introduced by Olaf Ronneberger et al. in 2015 [
13], U-Net was initially designed for biomedical image segmentation. Its encoder–decoder architecture [
13] adeptly integrates deep feature extraction with precise image reconstruction, enabling exceptional performance across a wide range of image segmentation tasks. Building on this foundation, subsequent research efforts led to the development of U-Net++, which incorporates additional skip connections [
15] and a modular design. These innovations significantly improve the model’s ability to process fine details and fuse features effectively.
However, the skip connections in U-Net and U-Net++ are confined to linking features at the same hierarchical levels. As the encoder performs downsampling, critical features may be lost, hindering their effective propagation to subsequent layers. To address this challenge, we propose an Asymmetric Skip Connection (ASC) mechanism. Unlike traditional skip connections, our ASC mechanism offsets connections forward from the same level, enabling the direct integration of shallow encoder features with deeper, more abstract features.
Related research highlights [
16,
17,
18,
19] that homogeneous ensemble learning offers a viable approach to address limitations in accuracy for specific tasks [
18]. By aggregating the predictions of multiple models of the same type, ensemble learning effectively mitigates the bias and variance inherent in individual models, thereby enhancing overall performance. Compared to heterogeneous ensemble learning, homogeneous ensemble learning exhibits a notable limitation: its reduced resistance to overfitting [
20]. This vulnerability arises because homogeneous ensemble models, sharing identical architectures, are prone to converging on similar local optima during training [
21]. To enhance model flexibility and counteract overfitting, we propose a Misaligned Merging Mechanism (MMM). This mechanism introduces misaligned connections between different encoder and decoder models, allowing outputs from encoders linked to distinct decoders to be merged. By disrupting the uniformity of connections, the MMM reduces the likelihood of base models converging on identical local optima, thereby bolstering the model’s resistance to overfitting. Rooted in homogeneous ensemble learning, this approach addresses the low correlation among base networks, effectively harnessing the strengths of diverse base networks while mitigating information loss caused by excessive downsampling in single encoder–decoder architectures.
The overall architecture of Football Net comprises multiple encoder–decoder networks, with Convolutional Neural Networks (CNNs) serving as the backbone. These base networks, termed Football Edge Modules (FEMs), incorporate the Asymmetric Skip Connection (ASC) mechanism within each FEM to counteract feature vanishing during downsampling. By integrating the principles of ensemble learning, Football Net enhances model generalization through the parallel operation of multiple base learners. Each base network adopts a U-Net-like CNN structure, leveraging batch normalization and the ReLU6 activation function to maintain high accuracy while mitigating overfitting. Furthermore, Football Net introduces a Misaligned Merging Mechanism (MMM), which enables decoders to merge outputs from different FEM encoders rather than relying exclusively on outputs from their corresponding encoders. This innovative mechanism addresses the issue of low correlation among base networks. By effectively leveraging the strengths of diverse base networks, the MMM enhances the model’s resistance to overfitting and reduces information loss caused by excessive downsampling in single encoder–decoder architectures. Together, these design elements enable Football Net to achieve robust performance in complex image processing tasks.
The main contributions of this paper are as follows: (1) To improve processing accuracy through homogeneous ensemble learning while avoiding diminishing returns and enhancing error correction capabilities, a novel image classification model, Football Net, is proposed; (2) To address the issue of low correlation among base networks due to their identical structures in homogeneous ensemble learning, an effective mechanism called the Misaligned Merging Mechanism (MMM) is introduced. This mechanism leverages the advantages of other base networks, resolving the problem of information loss caused by excessive downsampling in a single deep encoder–decoder network; (3) To preserve fine details during feature transmission to deeper layers, a specialized skip connection path, Asymmetric Skip Connections (ASCs), is introduced, allowing shallow features to be directly passed to deeper networks.
2. Related Work
2.1. Shortcut and Skip Connections
Since the introduction of ResNet [
15], the design of residuals has profoundly influenced the construction of deep neural networks. The mapping relationship it achieves is shown in the Equation (
1), where
x is the input and
is the output after passing through a layer. This design allows the input to propagate forward more efficiently through cross-layer data pathways. In expanding ResNet, DenseNet made a slight adjustment to the residuals by replacing ResNet’s simple addition with concatenation [
22]. Consequently, after applying an increasingly complex sequence of functions, the mapping of
x to its expansion follows the Equation (
2) shown. Incorporating the expansion into a multi-layer perceptron further reduces the number of features.
In downstream image classification tasks, U-Net has shown impressive results in medical image segmentation. U-Net also introduces a skip connection mechanism [
13], allowing the encoder and decoder at the same depth to use skip connections when passing feature maps rather than simply downsampling and then upsampling. This design partially mitigates the problem of feature loss during downsampling. Building on this, U-Net++ introduced additional skip connections [
14] similar to DenseNet, further reducing feature loss.
However, the major limitation of previous studies, whether U-Net or U-Net++, is that they restore features only at the same layer level, overlooking feature loss during downsampling, which is the most pronounced limitationand one of the motivations for proposing ResNet. In the encoder–decoder structure of U-Net, the issue of feature loss within the encoder has not been effectively addressed.
To tackle this, we propose a novel skip connection mechanism based on the encoder–decoder architecture. The main concept of this mechanism is to introduce skip connections within the encoder, beyond the conventional feature transmission process. Using cross-layer data pathways, shallow-layer features are directly passed to the deeper network while retaining the shallow-layer pathways that transmit features to the decoder. Visually, this design is asymmetrical compared to U-Net’s skip connections; hence, we call it the Asymmetric Skip Connection (ASC). This mechanism is superior to previous skip connections, as it provides data pathways not only from the encoder to the decoder but also cross-layer pathways within the encoder itself. Through these pathways, input feature maps can propagate downward more quickly while retaining their original characteristics. In terms of feature extraction, this mechanism performs excellently. As illustrated in
Figure 1, this asymmetric encoder–decoder network design facilitates the fusion of multi-level features, significantly enhancing the model’s capacity to mitigate feature vanishing and improve feature representation. Through the ASC, we effectively address the issue of feature loss during downsampling.
2.2. Homogeneous Ensemble Learning
In the development of machine learning, various methods have been proposed to improve the predictive or classification accuracy of models, with ensemble learning being one of the classic approaches. Ensemble learning improves performance by constructing and combining multiple learners. Common strategies for combining learners include averaging and voting. The traditional philosophy of ensemble learning often follows the principle that “to achieve a good ensemble, individual learners should have a certain level of accuracy and diversity among them” [
23].
Some researchers have introduced diversity among learners by using different base models, a technique known as heterogeneous ensemble learning [
23]. Heterogeneous ensemble learning improves the accuracy of base learners to a certain extent. However, using different base learners in heterogeneous ensemble learning reduces the correlation among them [
18,
24]. In other words, the different base learners can only improve the overall model accuracy through competition rather than by learning from each other’s strengths. This limits the potential accuracy of heterogeneous ensembles to the accuracy limits of the base learners themselves.
The challenge this model aims to address is how to enable base learners with a certain level of accuracy to learn from each other while maintaining diversity. To facilitate mutual learning among base learners, our model adopts a homogeneous ensemble learning approach, using multiple identical base models. During the process of mutual learning, we propose a Misaligned Merging Mechanism (MMM), as shown in the
Figure 2. This mechanism allows each decoder to receive and merge feature map outputs from two base models. By decoding the merged feature maps of the two base models, this approach simultaneously considers both the deep and shallow features learned by the two different encoders. This mechanism combines the strengths of different base learners, effectively preventing the degradation in model performance caused by feature loss in a single encoder during downsampling. Moreover, it successfully overcomes the limitation that the accuracy of ensemble learning is constrained by the accuracy of individual base learners.
3. Materials and Methods
In this section, we first introduce the overall architecture of Football Net to provide a quick understanding of how Football Net operates. Next, we present the key component of Football Net, the Football Edge Module, with a focus on the Asymmetric Skip Connections (ASCs) and the Misaligned Merging Mechanism (MMM).
In this section, all convolution operations consist of three steps:
The convolution layer with a kernel size of 3, stride of 1, and padding of 1;
The batch normalization layer, where the computation process is as follows:
- Step 1:
Compute the mean and variance of the mini-batch.
For input
, where
m denotes the current batch size, and
d is the size of the input feature map, the mean
E and variance
are computed using Equations (
3) and (
4), respectively.
- Step 2:
Normalization process.
To accelerate convergence, each element of the input vector is normalized individually. The normalization process is given by Equation (
5).
For a mini-batch x, the parameters and are learned during batch normalization.
After batch normalization, the ReLU6 activation function is applied. ReLU6 is a variant of the Rectified Linear Unit (ReLU), characterized by outputting 0 for negative input values and capping the output at 6 for positive inputs. The mathematical expression for ReLU6 is given in Equation (
6).
Additionally, all pooling layers use the max pooling method to reduce the size of feature maps and alleviate position sensitivity, while all upsampling operations use bilinear interpolation to restore the size of the feature maps.
3.1. Net Architecture Overview
Inspired by the truncated icosahedron, this paper presents a three-dimensional network structure resembling a soccer ball, named Football Net. The overall architecture is illustrated in
Figure 3. Football Net is based on ensemble learning, with its base network composed of five modules called Football Edge Modules (FEMs). These five FEMs are arranged in a ring along the vertical axis. Horizontally, the model consists of 8 layers, where layers 1, 2, 7, and 8 each contain five modules, and layers 3, 4, 5, and 6 contain ten modules each. Data pass through the five 8-layer FEMs, generating five outputs. These outputs are merged along the channel dimension and fed into the classifier. The classifier, shown in
Figure 4, is constructed from a convolution operation and a set of multilayer perceptrons.
3.2. FEM and ASCs
Since the introduction of U-Net, encoder–decoder networks have achieved state-of-the-art (SOTA) performance in most image classification and semantic segmentation tasks. Compared to traditional CNNs, their unique skip connection mechanism enables the transfer of image details between different layers and the combination of features across layers. Inspired by this, the Football Edge Module (FEM) incorporates an encoder–decoder structure, improving the skip connections by changing them into Asymmetric Skip Connections (ASCs). In the Football Net model, the first six layers function as the encoder, while the last two layers form the decoder. The ASCs connect layers 1 and 2 of the encoder with layers 8 and 7 of the decoder and internally link layers 3 and 4 with layers 6 and 5 in the encoder. The detailed structure of the Football Edge Module is shown in
Figure 5.
During the downsampling process in CNNs, image resolution gradually decreases, leading to the potential loss of high-frequency information and details. To address this issue, the model directly transfers high-resolution features from the encoder to deeper layers, allowing the model to retain and utilize high-frequency features even during downsampling.
3.3. Misaligned Merging Mechanism, MMM
Since the structures of the five FEMs are identical, to enable each base network to effectively leverage the advantages of other FEMs and enhance their ability to work together, while avoiding the loss of details in any single FEM from reducing the overall contribution to the model, a Misaligned Merging Mechanism (MMM) is introduced between the encoder and decoder. The final layer of the encoder (layer 6) has two outputs,
A and
B, which are passed to the upsampling module of layer 7 in two different decoders. The first layer of the decoder (layer 7) has two inputs,
and
, which come from the outputs of two different encoders. The detailed processes of both the overall FEM and an individual FEM is illustrated in
Figure 3 and
Figure 5, respectively. The feature propagation process is described in Equation (
7).
Here, represents the j-th input of the K-th layer in the i-th FEM. When , since both outputs of each FEM are identical, the index is omitted.
4. Experiments
All experiments in this paper were conducted on a personal computer, with the main hardware consisting of a GeForce RTX 4080 GPU (16 GB) (NVIDIA, Santa Clara, CA, USA), an Intel Core i9-14900K @ 3.20 GHz, and 32 GB of RAM (Intel, Santa Clara, CA, USA). The experiments were performed using the Python v3.9 programming platform, with the PyTorch v2.2.2 framework and CUDA v12.4. Due to the limitations of GPU memory, all datasets with image resolutions higher than 64 × 64 pixels were resized to a square with a maximum side length of 64 pixels.
4.1. Dataset and Data Augmentation
In this experiment, to evaluate the model’s performance on both small and large training sets, three datasets were selected: CIFAR-10 [
25], ImageNet-100, and ImageNet-1k [
26]. For the CIFAR-10 dataset, which consists of 32-pixel square images, the original image size was maintained. However, due to the high resolution of the ImageNet-100 and ImageNet-1k datasets, all images were uniformly resized to 64-pixel squares. Data augmentation techniques, including random horizontal flipping and random scale cropping, were applied to these datasets. It has been verified that scaling images has only a slight impact on prediction accuracy [
27,
28,
29]. Additionally, random rotation was added as an augmentation method for all datasets, with a rotation angle range of up to
.
4.2. Hyperparameter and Evaluation Metrics
Before the main experiments, the MNIST and FashionMNIST datasets were used as benchmarks to initially assess the range of hyperparameter settings. To validate the success of the experiments, this paper compares the results of the proposed model with other models, including Vision Transformer (ViT) and its variant RepViT, U-Net and its variant U-Net++, as well as YOLOv8, using the same experimental setup.
All of the experiments used the SGD optimizer, which was found to outperform the Adam optimizer for this model. The optimal hyperparameter settings determined with the SGD optimizer were as follows: a learning rate of 0.01 and a maximum of 70 training epochs. Due to hardware limitations, the maximum batch size for 64-pixel images was 20. Five different batch sizes were tested (4, 8, 16, 18, and 20), with 16 providing the highest efficiency.
For the learning rate, 0.001 and 0.01 were tested, and a final decision was made to use a learning rate of 0.003. Although a learning rate of 0.01 led to faster convergence, 0.001 sometimes resulted in the model getting stuck in local optima, preventing it from achieving satisfactory accuracy.
In terms of training epochs, based on the benchmark results, the CIFAR-10 dataset typically converged within 10 epochs, so a slightly higher range of 10 to 20 epochs was used. For the ImageNet-100 dataset, 30 epochs were sufficient to achieve good performance and avoid overfitting. The ImageNet-1k dataset, with its 1000 classes, posed potential memory issues due to large weight matrices, leading to a reduction in the batch size to 8.
After multiple rounds of training, the model with the highest accuracy was saved for testing on the test set. The evaluation of the experimental results was based on two metrics: accuracy and precision.
Accuracy: Accuracy is the proportion of correctly classified samples to the total number of samples, used to evaluate the overall classification capability of the model. The formula is as follows:
where
represents the number of correctly predicted samples, and
N represents the total number of samples. Since the dataset in this experiment does not exhibit class imbalance, this metric can effectively evaluate the experimental results. Additionally, the Top-5 accuracy metric is introduced, which measures the proportion of samples where the correct class is within the top 5 predicted classes. Its formula is:
where
represents the number of samples where the true class is among the top five predicted probabilities.
Precision: In multiclass classification tasks, precision refers to the proportion of samples predicted as a certain class that are actually of that class. Precision in multiclass problems is usually computed using either the macro average or micro average. For class-balanced datasets, macro average precision is more suitable, and its formula is:
where
and
represent the number of true positives and false positives for class
i, respectively, and
N is the total number of classes.
5. Results
In the comparative experiments, four other vision models—ViT, RepViT, U-Net, and U-Net++—were primarily used for image classification comparisons. Additionally, YOLOv8 [
30] was included as a benchmark to further evaluate the performance of Football Net. Since Football Net was built from scratch, to ensure a fair comparison, the other models were also constructed from scratch, without using pretrained weights. As these networks are primarily designed for image segmentation, global pooling and fully connected layers were introduced in the final decoder stage to enable them to perform image classification. The table presents the averaged results from 10 trials for each model.
As shown in
Table 1, experiments conducted on the CIFAR-10 dataset demonstrate that Football Net achieves a Top-5 accuracy of 98.7% and a Top-1 Accuracy of 82.1%. In comparison to other models, Football Net exhibits competitive performance.
Experiments were conducted on the ImageNet-100 and ImageNet-1k datasets, with results shown in
Table 2. Football Net achieved a Top-5 accuracy of 98.8% and a Top-1 Accuracy of 92.2%. For datasets with fewer categories, accuracy tends to be more sensitive to dataset size. In the case of large datasets like ImageNet-1k, training is often carried out using pretrained models followed by fine-tuning. As a result, when trained from scratch, the accuracy on ImageNet-1k is lower compared to the ImageNet-100 and CIFAR-10 datasets.
During the 10 experiments, the training records of different models on various datasets were saved, resulting in three sets of comparison plots, as shown in
Figure 6,
Figure 7 and
Figure 8. In the experiments conducted on CIFAR-10, Football Net exhibited a faster convergence rate compared to the other models, while its accuracy remained comparable. This observation clearly shows that the model’s accuracy converges around the 15th epoch. On the ImageNet-100 dataset, the accuracy curve of Football Net converged progressively over each epoch, successively surpassing U-Net, RepViT, YOLOv8, and U-Net++.
The Football Net model showed comparable performance in terms of accuracy and macro-average precision compared to other models, and even outperformed them in certain metrics. On the CIFAR-10 dataset, Football Net demonstrated a significant advantage in Top-5 accuracy. For the ImageNet-100 dataset, since the images were downscaled to 64 pixels and trained from scratch, none of the models performed exceptionally well, but Football Net still outperformed the other models in most evaluation metrics. The lower overall scores on the ImageNet-100 dataset were primarily due to the compressed images, leading to a loss of detail, which was expected.
For the ViT model, besides being more affected by image compression, it also exhibited a higher dependency on larger datasets compared to other models [
31]. As a result, ViT’s performance on the relatively simpler datasets, CIFAR-10 and ImageNet-100, was less impressive. Fortunately, in both comparative experiments, Football Net showed strong performance, with accuracy comparable to other models.
In terms of accuracy, Football Net is not outstanding compared with other models, but it is very competitive, and its accuracy can be ranked in the top three among the compared models. And it achieves the best precision on the ImageNet-100. In addition, another advantage of Football Net is worth introducing. Its high robustness makes the model accuracy, Top-5 accuracy, and precision have the lowest variance after multiple training (except that the accuracy on ImageNet-1k fails to exceed ViT). We think the problem is that the ViT model is a strong data-driven model, and ImageNet-1k has a large enough number of data that it makes sense for it to perform better on large data sets.
6. Discussion
This paper proposes an image classification model, Football Net, inspired by the geometric structure of a truncated icosahedron. Its unique geometric design allows the model to extract features and classify more efficiently. Unlike traditional Convolutional Neural Networks (CNNs), Football Net’s modular structure reduces the loss of detail during convolution and pooling processes, thus enhancing its ability to handle complex images. This structural design is not only novel but also provides a more logical and interpretable framework for deep learning models through its two-dimensional expansion approach.
Additionally, this paper introduces an innovative combination of the Misaligned Merging Mechanism (MMM) with skip connections, addressing the common issue of detail loss when extracting deep features in CNNs. By aggregating features from different convolutional layers and directly passing early high-resolution information during the decoding process, the model retains efficient feature extraction while avoiding detail loss, thereby exhibiting stronger robustness and accuracy in image classification tasks.
Furthermore, the paper brings innovation to ensemble learning strategies by employing a multi-layer homogeneous ensemble learning approach. Five base learners process input data in parallel, and when combined with skip connections and the MMM, this significantly enhances the model’s generalization ability. This ensemble structure not only improves accuracy but also reduces the risk of overfitting during training, due to the independence of the base learners, offering a new perspective for CNN applications in image classification tasks.
6.1. Effectiveness and Applicability
These innovations enable Football Net to achieve competitive performance on several common data sets, such as CIFAR-10 and ImageNet-100, and it also demonstrates considerable advantages on more complex datasets like ImageNet-1k. This proves the effectiveness and broad applicability of the model design.
6.2. Limitations
The limitations of this model lie in the fact that, as an image classification model, the number of parameters varies depending on the input size and class. For image processing tasks with higher resolution, more parameters may be used.
From the experimental results, it is evident that the model exhibits improved robustness compared to the other five models, which can be attributed to the use of ensemble learning. However, in terms of accuracy, the performance does not surpass that of ViT, indicating that there is still room for improvement in enhancing accuracy.