Football Net: Leveraging the Structure of Truncated Icosahedron in Convolutional Neural Network Design

Zhu, Zhijian; Wang, Qinghui

doi:10.3390/app15031369

Open AccessArticle

Football Net: Leveraging the Structure of Truncated Icosahedron in Convolutional Neural Network Design

by

Zhijian Zhu

and

Qinghui Wang

^*

School of Information Engineering, Shenyang University of Chemistry Technology, Shenyang 110142, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1369; https://doi.org/10.3390/app15031369

Submission received: 4 November 2024 / Revised: 22 January 2025 / Accepted: 27 January 2025 / Published: 28 January 2025

(This article belongs to the Special Issue Deep Learning for Natural Language Processing (NLP) and Image Classification (IC))

Download

Browse Figures

Versions Notes

Abstract

:

Deep neural networks often suffer from the degradation of fine-grained features during feature transmission. To mitigate this issue, we propose an innovative CNN architecture, Football Net, which is designed to enhance feature propagation. By introducing Asymmetric Skip Connections, Football Net effectively captures and preserves fine-grained details. Another significant challenge in deep neural networks is achieving robustness. Football Net addresses this challenge by incorporating a novel misaligned feature merging mechanism and a new homogeneous ensemble learning strategy. The experimental results indicate that this improved ensemble strategy significantly reduces both bias and variance, thereby enhancing overall classification performance. We conducted extensive experiments on the CIFAR-10, ImageNet-100, and ImageNet-1k datasets. The results demonstrate the competitiveness of Football Net in image classification tasks, achieving accuracy comparable to state-of-the-art models such as ResNet, U-Net, and U-Net++ while also improving robustness.

Keywords:

Misaligned Merging Mechanism; Asymmetric Skip Connections; homogeneous ensemble learning; image classification

1. Introduction

In the realm of computer vision, image recognition has perennially stood as a cornerstone of scholarly inquiry. The advent and swift progression of deep learning technologies have catalyzed remarkable strides in image recognition methodologies, particularly those underpinned by Convolutional Neural Networks (CNNs). The watershed moment arrived in 2021 with Krizhevsky et al.’s unveiling of the AlexNet model, which not only clinched remarkable success in the ImageNet competition [1] but also heralded CNNs as the preeminent paradigm for a spectrum of tasks including image classification, object detection, and image segmentation [2,3,4,5].

However, despite the remarkable performance of Convolutional Neural Networks (CNNs) in certain image classification tasks, they are with limitations. A significant challenge lies in the potential loss of fine details and the disappearance of critical features during the convolution and pooling processes [6,7,8]. The convolution operation, which extracts image features [9,10], and the pooling operation, which reduces computational complexity [11], both inherently risk discarding edge information and high-frequency details. This compromises the accuracy and precision of classification, particularly in tasks demanding fine-grained distinctions, such as the identification of small objects or the analysis of complex scenes containing multiple objects [12]. Consequently, the pursuit of methods to effectively extract discriminative features while preserving intricate image details has emerged as a pivotal research direction for advancing image classification performance.

In recent years, the escalating computational capabilities of deep learning models have spurred the development of novel architectures that preliminarily mitigate the issue of feature vanishing. Among these, U-Net [13] and its enhanced variant, U-Net++ [14], stand out as the most prominent examples. Originally introduced by Olaf Ronneberger et al. in 2015 [13], U-Net was initially designed for biomedical image segmentation. Its encoder–decoder architecture [13] adeptly integrates deep feature extraction with precise image reconstruction, enabling exceptional performance across a wide range of image segmentation tasks. Building on this foundation, subsequent research efforts led to the development of U-Net++, which incorporates additional skip connections [15] and a modular design. These innovations significantly improve the model’s ability to process fine details and fuse features effectively.

However, the skip connections in U-Net and U-Net++ are confined to linking features at the same hierarchical levels. As the encoder performs downsampling, critical features may be lost, hindering their effective propagation to subsequent layers. To address this challenge, we propose an Asymmetric Skip Connection (ASC) mechanism. Unlike traditional skip connections, our ASC mechanism offsets connections forward from the same level, enabling the direct integration of shallow encoder features with deeper, more abstract features.

Related research highlights [16,17,18,19] that homogeneous ensemble learning offers a viable approach to address limitations in accuracy for specific tasks [18]. By aggregating the predictions of multiple models of the same type, ensemble learning effectively mitigates the bias and variance inherent in individual models, thereby enhancing overall performance. Compared to heterogeneous ensemble learning, homogeneous ensemble learning exhibits a notable limitation: its reduced resistance to overfitting [20]. This vulnerability arises because homogeneous ensemble models, sharing identical architectures, are prone to converging on similar local optima during training [21]. To enhance model flexibility and counteract overfitting, we propose a Misaligned Merging Mechanism (MMM). This mechanism introduces misaligned connections between different encoder and decoder models, allowing outputs from encoders linked to distinct decoders to be merged. By disrupting the uniformity of connections, the MMM reduces the likelihood of base models converging on identical local optima, thereby bolstering the model’s resistance to overfitting. Rooted in homogeneous ensemble learning, this approach addresses the low correlation among base networks, effectively harnessing the strengths of diverse base networks while mitigating information loss caused by excessive downsampling in single encoder–decoder architectures.

The overall architecture of Football Net comprises multiple encoder–decoder networks, with Convolutional Neural Networks (CNNs) serving as the backbone. These base networks, termed Football Edge Modules (FEMs), incorporate the Asymmetric Skip Connection (ASC) mechanism within each FEM to counteract feature vanishing during downsampling. By integrating the principles of ensemble learning, Football Net enhances model generalization through the parallel operation of multiple base learners. Each base network adopts a U-Net-like CNN structure, leveraging batch normalization and the ReLU6 activation function to maintain high accuracy while mitigating overfitting. Furthermore, Football Net introduces a Misaligned Merging Mechanism (MMM), which enables decoders to merge outputs from different FEM encoders rather than relying exclusively on outputs from their corresponding encoders. This innovative mechanism addresses the issue of low correlation among base networks. By effectively leveraging the strengths of diverse base networks, the MMM enhances the model’s resistance to overfitting and reduces information loss caused by excessive downsampling in single encoder–decoder architectures. Together, these design elements enable Football Net to achieve robust performance in complex image processing tasks.

The main contributions of this paper are as follows: (1) To improve processing accuracy through homogeneous ensemble learning while avoiding diminishing returns and enhancing error correction capabilities, a novel image classification model, Football Net, is proposed; (2) To address the issue of low correlation among base networks due to their identical structures in homogeneous ensemble learning, an effective mechanism called the Misaligned Merging Mechanism (MMM) is introduced. This mechanism leverages the advantages of other base networks, resolving the problem of information loss caused by excessive downsampling in a single deep encoder–decoder network; (3) To preserve fine details during feature transmission to deeper layers, a specialized skip connection path, Asymmetric Skip Connections (ASCs), is introduced, allowing shallow features to be directly passed to deeper networks.

2. Related Work

2.1. Shortcut and Skip Connections

Since the introduction of ResNet [15], the design of residuals has profoundly influenced the construction of deep neural networks. The mapping relationship it achieves is shown in the Equation (1), where x is the input and

f (x)

is the output after passing through a layer. This design allows the input to propagate forward more efficiently through cross-layer data pathways. In expanding ResNet, DenseNet made a slight adjustment to the residuals by replacing ResNet’s simple addition with concatenation [22]. Consequently, after applying an increasingly complex sequence of functions, the mapping of x to its expansion follows the Equation (2) shown. Incorporating the expansion into a multi-layer perceptron further reduces the number of features.

\vec{x} \to f (\vec{x}) + \vec{x}

(1)

\vec{x} \to [\begin{matrix} \vec{x} \\ f_{1} (\vec{x}) \\ f_{2} ([\vec{x}, f_{1} (\vec{x})]) \\ f_{3} ([\vec{x}, f_{1} (\vec{x}), f_{2} ([\vec{x}, f_{1} (\vec{x})])]) \end{matrix}]

(2)

In downstream image classification tasks, U-Net has shown impressive results in medical image segmentation. U-Net also introduces a skip connection mechanism [13], allowing the encoder and decoder at the same depth to use skip connections when passing feature maps rather than simply downsampling and then upsampling. This design partially mitigates the problem of feature loss during downsampling. Building on this, U-Net++ introduced additional skip connections [14] similar to DenseNet, further reducing feature loss.

However, the major limitation of previous studies, whether U-Net or U-Net++, is that they restore features only at the same layer level, overlooking feature loss during downsampling, which is the most pronounced limitationand one of the motivations for proposing ResNet. In the encoder–decoder structure of U-Net, the issue of feature loss within the encoder has not been effectively addressed.

To tackle this, we propose a novel skip connection mechanism based on the encoder–decoder architecture. The main concept of this mechanism is to introduce skip connections within the encoder, beyond the conventional feature transmission process. Using cross-layer data pathways, shallow-layer features are directly passed to the deeper network while retaining the shallow-layer pathways that transmit features to the decoder. Visually, this design is asymmetrical compared to U-Net’s skip connections; hence, we call it the Asymmetric Skip Connection (ASC). This mechanism is superior to previous skip connections, as it provides data pathways not only from the encoder to the decoder but also cross-layer pathways within the encoder itself. Through these pathways, input feature maps can propagate downward more quickly while retaining their original characteristics. In terms of feature extraction, this mechanism performs excellently. As illustrated in Figure 1, this asymmetric encoder–decoder network design facilitates the fusion of multi-level features, significantly enhancing the model’s capacity to mitigate feature vanishing and improve feature representation. Through the ASC, we effectively address the issue of feature loss during downsampling.

2.2. Homogeneous Ensemble Learning

In the development of machine learning, various methods have been proposed to improve the predictive or classification accuracy of models, with ensemble learning being one of the classic approaches. Ensemble learning improves performance by constructing and combining multiple learners. Common strategies for combining learners include averaging and voting. The traditional philosophy of ensemble learning often follows the principle that “to achieve a good ensemble, individual learners should have a certain level of accuracy and diversity among them” [23].

Some researchers have introduced diversity among learners by using different base models, a technique known as heterogeneous ensemble learning [23]. Heterogeneous ensemble learning improves the accuracy of base learners to a certain extent. However, using different base learners in heterogeneous ensemble learning reduces the correlation among them [18,24]. In other words, the different base learners can only improve the overall model accuracy through competition rather than by learning from each other’s strengths. This limits the potential accuracy of heterogeneous ensembles to the accuracy limits of the base learners themselves.

The challenge this model aims to address is how to enable base learners with a certain level of accuracy to learn from each other while maintaining diversity. To facilitate mutual learning among base learners, our model adopts a homogeneous ensemble learning approach, using multiple identical base models. During the process of mutual learning, we propose a Misaligned Merging Mechanism (MMM), as shown in the Figure 2. This mechanism allows each decoder to receive and merge feature map outputs from two base models. By decoding the merged feature maps of the two base models, this approach simultaneously considers both the deep and shallow features learned by the two different encoders. This mechanism combines the strengths of different base learners, effectively preventing the degradation in model performance caused by feature loss in a single encoder during downsampling. Moreover, it successfully overcomes the limitation that the accuracy of ensemble learning is constrained by the accuracy of individual base learners.

3. Materials and Methods

In this section, we first introduce the overall architecture of Football Net to provide a quick understanding of how Football Net operates. Next, we present the key component of Football Net, the Football Edge Module, with a focus on the Asymmetric Skip Connections (ASCs) and the Misaligned Merging Mechanism (MMM).

In this section, all convolution operations consist of three steps:

The convolution layer with a kernel size of 3, stride of 1, and padding of 1;
The batch normalization layer, where the computation process is as follows:
Step 1:
Compute the mean and variance of the mini-batch.
For input $x \in R^{m \times d}$ , where m denotes the current batch size, and d is the size of the input feature map, the mean E and variance $Var$ are computed using Equations (3) and (4), respectively.

$E (x^{k}) = \frac{1}{m} \sum_{i = 1}^{m} x_{i}^{k}$

(3)

$Var [x^{k}] \leftarrow \frac{1}{m} \sum_{i = 1}^{m} {(x_{i}^{k} - E [x^{k}])}^{2}$

(4)

Step 2:
Normalization process.
To accelerate convergence, each element of the input vector is normalized individually. The normalization process is given by Equation (5).

$y^{(k)} = γ^{(k)} \cdot {\hat{x}}^{(k)} + β^{(k)}$

(5)

For a mini-batch x, the parameters $γ$ and $β$ are learned during batch normalization.
After batch normalization, the ReLU6 activation function is applied. ReLU6 is a variant of the Rectified Linear Unit (ReLU), characterized by outputting 0 for negative input values and capping the output at 6 for positive inputs. The mathematical expression for ReLU6 is given in Equation (6).

$f (x) = \min (\max (0, x), 6)$

(6)

Additionally, all pooling layers use the max pooling method to reduce the size of feature maps and alleviate position sensitivity, while all upsampling operations use bilinear interpolation to restore the size of the feature maps.

3.1. Net Architecture Overview

Inspired by the truncated icosahedron, this paper presents a three-dimensional network structure resembling a soccer ball, named Football Net. The overall architecture is illustrated in Figure 3. Football Net is based on ensemble learning, with its base network composed of five modules called Football Edge Modules (FEMs). These five FEMs are arranged in a ring along the vertical axis. Horizontally, the model consists of 8 layers, where layers 1, 2, 7, and 8 each contain five modules, and layers 3, 4, 5, and 6 contain ten modules each. Data pass through the five 8-layer FEMs, generating five outputs. These outputs are merged along the channel dimension and fed into the classifier. The classifier, shown in Figure 4, is constructed from a convolution operation and a set of multilayer perceptrons.

3.2. FEM and ASCs

Since the introduction of U-Net, encoder–decoder networks have achieved state-of-the-art (SOTA) performance in most image classification and semantic segmentation tasks. Compared to traditional CNNs, their unique skip connection mechanism enables the transfer of image details between different layers and the combination of features across layers. Inspired by this, the Football Edge Module (FEM) incorporates an encoder–decoder structure, improving the skip connections by changing them into Asymmetric Skip Connections (ASCs). In the Football Net model, the first six layers function as the encoder, while the last two layers form the decoder. The ASCs connect layers 1 and 2 of the encoder with layers 8 and 7 of the decoder and internally link layers 3 and 4 with layers 6 and 5 in the encoder. The detailed structure of the Football Edge Module is shown in Figure 5.

During the downsampling process in CNNs, image resolution gradually decreases, leading to the potential loss of high-frequency information and details. To address this issue, the model directly transfers high-resolution features from the encoder to deeper layers, allowing the model to retain and utilize high-frequency features even during downsampling.

3.3. Misaligned Merging Mechanism, MMM

Since the structures of the five FEMs are identical, to enable each base network to effectively leverage the advantages of other FEMs and enhance their ability to work together, while avoiding the loss of details in any single FEM from reducing the overall contribution to the model, a Misaligned Merging Mechanism (MMM) is introduced between the encoder and decoder. The final layer of the encoder (layer 6) has two outputs, A and B, which are passed to the upsampling module of layer 7 in two different decoders. The first layer of the decoder (layer 7) has two inputs,

A^{'}

and

B^{'}

, which come from the outputs of two different encoders. The detailed processes of both the overall FEM and an individual FEM is illustrated in Figure 3 and Figure 5, respectively. The feature propagation process is described in Equation (7).

\begin{matrix} I n p u t_{3 [i] [0]} = I n p u t_{3 [i] [1]} & = O u t p u t_{2 [i]} \\ I n p u t_{4 [i] [j]} & = O u t p u t_{3 [i] [j]} \\ I n p u t_{5 [i] [j]} & = O u t p u t_{4 [i] [j]} \\ I n p u t_{6 [i] [j]} & = O u t p u t_{5 [i] [j]} \\ I n p u t_{7 [i] [0]} & = O u t p u t_{6 [i] [1]} \\ I n p u t_{7 [i] [1]} & = O u t p u t_{6 [(i + 1) \mod 5] [0]} \end{matrix}

(7)

Here,

I n p u t_{K [i] [j]}

represents the j-th input of the K-th layer in the i-th FEM. When

K = 2

, since both outputs of each FEM are identical, the

[j]

index is omitted.

4. Experiments

All experiments in this paper were conducted on a personal computer, with the main hardware consisting of a GeForce RTX 4080 GPU (16 GB) (NVIDIA, Santa Clara, CA, USA), an Intel Core i9-14900K @ 3.20 GHz, and 32 GB of RAM (Intel, Santa Clara, CA, USA). The experiments were performed using the Python v3.9 programming platform, with the PyTorch v2.2.2 framework and CUDA v12.4. Due to the limitations of GPU memory, all datasets with image resolutions higher than 64 × 64 pixels were resized to a square with a maximum side length of 64 pixels.

4.1. Dataset and Data Augmentation

In this experiment, to evaluate the model’s performance on both small and large training sets, three datasets were selected: CIFAR-10 [25], ImageNet-100, and ImageNet-1k [26]. For the CIFAR-10 dataset, which consists of 32-pixel square images, the original image size was maintained. However, due to the high resolution of the ImageNet-100 and ImageNet-1k datasets, all images were uniformly resized to 64-pixel squares. Data augmentation techniques, including random horizontal flipping and random scale cropping, were applied to these datasets. It has been verified that scaling images has only a slight impact on prediction accuracy [27,28,29]. Additionally, random rotation was added as an augmentation method for all datasets, with a rotation angle range of up to

10^{\circ}

.

4.2. Hyperparameter and Evaluation Metrics

Before the main experiments, the MNIST and FashionMNIST datasets were used as benchmarks to initially assess the range of hyperparameter settings. To validate the success of the experiments, this paper compares the results of the proposed model with other models, including Vision Transformer (ViT) and its variant RepViT, U-Net and its variant U-Net++, as well as YOLOv8, using the same experimental setup.

All of the experiments used the SGD optimizer, which was found to outperform the Adam optimizer for this model. The optimal hyperparameter settings determined with the SGD optimizer were as follows: a learning rate of 0.01 and a maximum of 70 training epochs. Due to hardware limitations, the maximum batch size for 64-pixel images was 20. Five different batch sizes were tested (4, 8, 16, 18, and 20), with 16 providing the highest efficiency.

For the learning rate, 0.001 and 0.01 were tested, and a final decision was made to use a learning rate of 0.003. Although a learning rate of 0.01 led to faster convergence, 0.001 sometimes resulted in the model getting stuck in local optima, preventing it from achieving satisfactory accuracy.

In terms of training epochs, based on the benchmark results, the CIFAR-10 dataset typically converged within 10 epochs, so a slightly higher range of 10 to 20 epochs was used. For the ImageNet-100 dataset, 30 epochs were sufficient to achieve good performance and avoid overfitting. The ImageNet-1k dataset, with its 1000 classes, posed potential memory issues due to large weight matrices, leading to a reduction in the batch size to 8.

After multiple rounds of training, the model with the highest accuracy was saved for testing on the test set. The evaluation of the experimental results was based on two metrics: accuracy and precision.

Accuracy: Accuracy is the proportion of correctly classified samples to the total number of samples, used to evaluate the overall classification capability of the model. The formula is as follows:

$Acc = \frac{C_{1}}{N} \times 100 %$

(8)

where $C_{1}$ represents the number of correctly predicted samples, and N represents the total number of samples. Since the dataset in this experiment does not exhibit class imbalance, this metric can effectively evaluate the experimental results. Additionally, the Top-5 accuracy metric is introduced, which measures the proportion of samples where the correct class is within the top 5 predicted classes. Its formula is:

$Top - 5 Acc = \frac{C_{5}}{N} \times 100 %$

(9)

where $C_{5}$ represents the number of samples where the true class is among the top five predicted probabilities.
Precision: In multiclass classification tasks, precision refers to the proportion of samples predicted as a certain class that are actually of that class. Precision in multiclass problems is usually computed using either the macro average or micro average. For class-balanced datasets, macro average precision is more suitable, and its formula is:

$Macro Precision = \frac{1}{N} \sum_{i = 1}^{N} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}} \times 100 %$

(10)

where ${TP}_{i}$ and ${FP}_{i}$ represent the number of true positives and false positives for class i, respectively, and N is the total number of classes.

5. Results

In the comparative experiments, four other vision models—ViT, RepViT, U-Net, and U-Net++—were primarily used for image classification comparisons. Additionally, YOLOv8 [30] was included as a benchmark to further evaluate the performance of Football Net. Since Football Net was built from scratch, to ensure a fair comparison, the other models were also constructed from scratch, without using pretrained weights. As these networks are primarily designed for image segmentation, global pooling and fully connected layers were introduced in the final decoder stage to enable them to perform image classification. The table presents the averaged results from 10 trials for each model.

As shown in Table 1, experiments conducted on the CIFAR-10 dataset demonstrate that Football Net achieves a Top-5 accuracy of 98.7% and a Top-1 Accuracy of 82.1%. In comparison to other models, Football Net exhibits competitive performance.

Experiments were conducted on the ImageNet-100 and ImageNet-1k datasets, with results shown in Table 2. Football Net achieved a Top-5 accuracy of 98.8% and a Top-1 Accuracy of 92.2%. For datasets with fewer categories, accuracy tends to be more sensitive to dataset size. In the case of large datasets like ImageNet-1k, training is often carried out using pretrained models followed by fine-tuning. As a result, when trained from scratch, the accuracy on ImageNet-1k is lower compared to the ImageNet-100 and CIFAR-10 datasets.

During the 10 experiments, the training records of different models on various datasets were saved, resulting in three sets of comparison plots, as shown in Figure 6, Figure 7 and Figure 8. In the experiments conducted on CIFAR-10, Football Net exhibited a faster convergence rate compared to the other models, while its accuracy remained comparable. This observation clearly shows that the model’s accuracy converges around the 15th epoch. On the ImageNet-100 dataset, the accuracy curve of Football Net converged progressively over each epoch, successively surpassing U-Net, RepViT, YOLOv8, and U-Net++.

The Football Net model showed comparable performance in terms of accuracy and macro-average precision compared to other models, and even outperformed them in certain metrics. On the CIFAR-10 dataset, Football Net demonstrated a significant advantage in Top-5 accuracy. For the ImageNet-100 dataset, since the images were downscaled to 64 pixels and trained from scratch, none of the models performed exceptionally well, but Football Net still outperformed the other models in most evaluation metrics. The lower overall scores on the ImageNet-100 dataset were primarily due to the compressed images, leading to a loss of detail, which was expected.

For the ViT model, besides being more affected by image compression, it also exhibited a higher dependency on larger datasets compared to other models [31]. As a result, ViT’s performance on the relatively simpler datasets, CIFAR-10 and ImageNet-100, was less impressive. Fortunately, in both comparative experiments, Football Net showed strong performance, with accuracy comparable to other models.

In terms of accuracy, Football Net is not outstanding compared with other models, but it is very competitive, and its accuracy can be ranked in the top three among the compared models. And it achieves the best precision on the ImageNet-100. In addition, another advantage of Football Net is worth introducing. Its high robustness makes the model accuracy, Top-5 accuracy, and precision have the lowest variance after multiple training (except that the accuracy on ImageNet-1k fails to exceed ViT). We think the problem is that the ViT model is a strong data-driven model, and ImageNet-1k has a large enough number of data that it makes sense for it to perform better on large data sets.

6. Discussion

This paper proposes an image classification model, Football Net, inspired by the geometric structure of a truncated icosahedron. Its unique geometric design allows the model to extract features and classify more efficiently. Unlike traditional Convolutional Neural Networks (CNNs), Football Net’s modular structure reduces the loss of detail during convolution and pooling processes, thus enhancing its ability to handle complex images. This structural design is not only novel but also provides a more logical and interpretable framework for deep learning models through its two-dimensional expansion approach.

Additionally, this paper introduces an innovative combination of the Misaligned Merging Mechanism (MMM) with skip connections, addressing the common issue of detail loss when extracting deep features in CNNs. By aggregating features from different convolutional layers and directly passing early high-resolution information during the decoding process, the model retains efficient feature extraction while avoiding detail loss, thereby exhibiting stronger robustness and accuracy in image classification tasks.

Furthermore, the paper brings innovation to ensemble learning strategies by employing a multi-layer homogeneous ensemble learning approach. Five base learners process input data in parallel, and when combined with skip connections and the MMM, this significantly enhances the model’s generalization ability. This ensemble structure not only improves accuracy but also reduces the risk of overfitting during training, due to the independence of the base learners, offering a new perspective for CNN applications in image classification tasks.

6.1. Effectiveness and Applicability

These innovations enable Football Net to achieve competitive performance on several common data sets, such as CIFAR-10 and ImageNet-100, and it also demonstrates considerable advantages on more complex datasets like ImageNet-1k. This proves the effectiveness and broad applicability of the model design.

6.2. Limitations

The limitations of this model lie in the fact that, as an image classification model, the number of parameters varies depending on the input size and class. For image processing tasks with higher resolution, more parameters may be used.

From the experimental results, it is evident that the model exhibits improved robustness compared to the other five models, which can be attributed to the use of ensemble learning. However, in terms of accuracy, the performance does not surpass that of ViT, indicating that there is still room for improvement in enhancing accuracy.

Author Contributions

Conceptualization, Z.Z. and Q.W.; methodology, Z.Z.; software, Z.Z.; validation, Q.W.; formal analysis, Q.W.; investigation, Z.Z.; resources, Z.Z.; data curation, Q.W.; writing—original draft preparation, Z.Z.; writing—review and editing, Z.Z.; visualization, Z.Z.; supervision, Q.W.; and project administration, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in CIFAR-10 (accessed on 13 September 2024) at https://www.cs.toronto.edu/~kriz/cifar.html, and ImageNet-100 and ImageNet-1k (accessed on 30 September 2024) at https://www.image-net.org. No new datasets were generated during this study.

Acknowledgments

We would like to express our sincere gratitude to the anonymous reviewers for their insightful comments and constructive suggestions, which have significantly improved the quality of our manuscript. Their expertise and attention to detail helped refine our arguments and enhanced the clarity of our presentation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASC	Asymmetric Skip Connection
MMM	Misaligned Merging Mechanism
FEM	Football Edge Module

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
Li, Z.; Yang, W.; Peng, S.; Liu, F. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 6999–7019. [Google Scholar] [CrossRef]
Devnath, L.; Luo, S.; Summons, P.; Wang, D. An accurate black lung detection using transfer learning based on deep neural networks. In Proceedings of the 2019 International Conference on Image and Vision Computing New Zealand (IVCNZ), Dunedin, New Zealand, 2–4 December 2019; pp. 1–6. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Zhao, L.; Zhang, Z. A improved pooling method for convolutional neural networks. Sci. Rep. 2024, 14, 1589. [Google Scholar] [CrossRef] [PubMed]
Sediqi, K.M.; Lee, H.J. A Novel Upsampling and Context Convolution for Image Semantic Segmentation. Sensors 2021, 21, 2170. [Google Scholar] [CrossRef]
Fortuna-Cervantes, J.M.; Ramírez-Torres, M.T.; Mejía-Carlos, M.; Murguía, J.S.; Martinez-Carranza, J.; Soubervielle-Montalvo, C.; Guerra-García, C.A. Texture and Materials Image Classification Based on Wavelet Pooling Layer in CNN. Appl. Sci. 2022, 12, 3592. [Google Scholar] [CrossRef]
Zhang, W.; Hou, Y.; Fan, W.; Yang, X.; Zhou, D.; Zhang, Q.; Wei, X. Perception-oriented Single Image Super-Resolution Network with Receptive Field Block. Neural Comput. Appl. 2022, 34, 14845–14858. [Google Scholar] [CrossRef]
Zhou, L.; Cai, H.; Gu, J.; Li, Z.; Liu, Y.; Chen, X.; Qiao, Y.; Dong, C. Efficient Image Super-Resolution Using Vast-Receptive-Field Attention. In Lecture Notes in Computer Science, Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 256–272. [Google Scholar]
Yu, C.; Jiang, X.; Wu, F.; Fu, Y.; Pei, J.; Zhang, Y.; Li, X.; Fu, T. A Multi-Scale Feature Fusion Based Lightweight Vehicle Target Detection Network on Aerial Optical Images. Remote Sens. 2024, 16, 3637. [Google Scholar] [CrossRef]
Meng, L.; Zhou, L.; Liu, Y. SODCNN: A Convolutional Neural Network Model for Small Object Detection in Drone-Captured Images. Drones 2023, 7, 615. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Nosrati, V.; Rahmani, M. Diversity improvement in homogeneous ensemble feature selection: A case study of its impact on classification performance. Neural Comput. Appl. 2023, 35, 15647–15665. [Google Scholar] [CrossRef]
Kumar, M.; Jeet Singh, A. Process-Based Multi-level Homogeneous Ensemble Predictive Model for Analysing Student’s Academic Performance. In Lecture Notes in Networks and Systems, Proceedings of the International Conference on Innovative Computing and Communications; Gupta, D., Khanna, A., Bhattacharyya, S., Hassanien, A.E., Anand, S., Jaiswal, A., Eds.; Springer: Singapore, 2023; pp. 139–159. [Google Scholar]
Ganaie, M.; Hu, M.; Malik, A.; Tanveer, M.; Suganthan, P. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Devnath, L.; Luo, S.; Summons, P.; Wang, D.; Shaukat, K.; Hameed, I.A.; Alrayes, F.S. Deep Ensemble Learning for the Automatic Detection of Pneumoconiosis in Coal Worker’s Chest X-ray Radiography. J. Clin. Med. 2022, 11, 5342. [Google Scholar] [CrossRef]
Lu, J.; Pang, Z.; Xiao, M.; Zhu, Y.; Xia, R.; Zhang, J. Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models. arXiv 2024, arXiv:2407.06089. [Google Scholar]
Wachter, R.F.; Briggs, G.P.; Pedersen, C.E. Precipitation of phase I antigen of Coxiella burnetii by sodium sulfite. Acta Virol. 1975, 19, 500. [Google Scholar] [PubMed]
Huang, G.; Liu, Z.; Maaten, L.v.d.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993. [Google Scholar] [CrossRef]
Zhou, Z.H. Machine Learning; Springer: Singapore, 2021. [Google Scholar]
Sabzevari, M.; Martínez-Muñoz, G.; Suárez, A. Building heterogeneous ensembles by pooling homogeneous ensembles. Int. J. Mach. Learn. Cybern. 2022, 13, 551–558. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Saponara, S.; Elhanashi, A. Impact of Image Resizing on Deep Learning Detectors for Training Time and Model Performance. In Lecture Notes in Electrical Engineering, Proceedings of the Applications in Electronics Pervading Industry, Environment and Society; Saponara, S., De Gloria, A., Eds.; Springer: Cham, Switzerland, 2022; pp. 10–17. [Google Scholar]
Rathee, N.; Pahal, S.; Sheoran, D. Evaluating the Uncertainty of Classification Due to Image Resizing Techniques for Satellite Image Classification. MAPAN 2021, 36, 243–251. [Google Scholar] [CrossRef]
Shahnaz, M.; Mollah, A.F. On the Performance of Convolutional Neural Networks with Resizing and Padding. In Lecture Notes in Networks and Systems, Proceedings of the Proceedings of International Conference on Frontiers in Computing and Systems; Basu, S., Kole, D.K., Maji, A.K., Plewczynski, D., Bhattacharjee, D., Eds.; Springer: Singapore, 2023; pp. 51–62. [Google Scholar]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2409.07813. [Google Scholar]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]

Figure 1. An example of an asymmetric encoder–decoder network with 6 layers for the encoder and 2 layers for the decoder. Skip connections not only pass between feature maps in the same layer of the encoder and decoder but also pass between different layers inside the encoder.

Figure 2. An example of the misalignment merging mechanism; red on the left represents the encoder last layer output in the two base models, and purple on the right represents the decoder input in the three base models. The figure illustrates the merging of different encoder outputs at the first layer of the decoder.

Figure 3. Football Net, which consists of five Football Edge modules.

Figure 4. Classifier.

Figure 5. Football Edge Module. The largest numbers, from 1 to 8, represent the layer numbers. The number beneath each feature map indicates the number of channels before passing through the current operation, while the number beside each feature map represents the image size after passing through the current operation.

Figure 6. Train on CIFAR.

Figure 7. Train on ImageNet-100.

Figure 8. Train on ImageNet-1k.

Table 1. Contrast Experiments Results on CIFAR-10.

Dataset	Model	Accuracy (%)	Top-5 Accuracy (%)	Precision (%)
CIFAR-10	Football Net	82.1 ± 0.07	98.7 ± 0.028	82.0 ± 0.10
	ViT	83.8 ± 0.10	99.6 ± 0.033	82.5 ± 0.19
	RepViT	80.4 ± 0.22	99.0 ± 0.037	79.8 ± 0.25
	U-Net	76.4 ± 0.35	97.7 ± 0.032	75.5 ± 0.22
	U-Net++	81.4 ± 0.30	98.1 ± 0.030	80.4 ± 0.20
	YOLOv8	76.3 ± 0.32	98.6 ± 0.041	74.3 ± 0.22

Table 2. Contrast experiment results on ImageNet.

Dataset	Model	Accuracy (%)	Top-5 Accuracy (%)	Precision (%)
ImageNet-100	Football Net	$92.2 \pm 0.12$	$98.8 \pm 0.06$	$88.5 \pm 0.20$
	ViT	$94.6 \pm 0.18$	$96.5 \pm 0.09$	$87.6 \pm 0.22$
	RepViT	$89.5 \pm 0.25$	$93.2 \pm 0.13$	$82.1 \pm 0.30$
	U-Net	$89.6 \pm 0.20$	$93.1 \pm 0.11$	$82.0 \pm 0.51$
	U-Net++	$91.8 \pm 0.14$	$93.2 \pm 0.09$	$84.5 \pm 0.38$
	YOLOv8	$89.0 \pm 0.21$	$95.5 \pm 0.10$	$85.5 \pm 0.50$
ImageNet-1k	Football Net	$87.5 \pm 0.30$	$91.2 \pm 0.20$	$85.0 \pm 1.01$
	ViT	$87.0 \pm 0.29$	$91.0 \pm 0.22$	$84.6 \pm 1.05$
	RepViT	$87.7 \pm 0.50$	$91.4 \pm 0.34$	$86.8 \pm 1.20$
	U-Net	$67.5 \pm 2.30$	$87.5 \pm 0.88$	$77.0 \pm 3.03$
	U-Net++	$71.0 \pm 1.05$	$89.0 \pm 0.37$	$79.5 \pm 2.10$
	YOLOv8	$88.5 \pm 0.40$	$91.7 \pm 0.31$	$84.0 \pm 1.08$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Z.; Wang, Q. Football Net: Leveraging the Structure of Truncated Icosahedron in Convolutional Neural Network Design. Appl. Sci. 2025, 15, 1369. https://doi.org/10.3390/app15031369

AMA Style

Zhu Z, Wang Q. Football Net: Leveraging the Structure of Truncated Icosahedron in Convolutional Neural Network Design. Applied Sciences. 2025; 15(3):1369. https://doi.org/10.3390/app15031369

Chicago/Turabian Style

Zhu, Zhijian, and Qinghui Wang. 2025. "Football Net: Leveraging the Structure of Truncated Icosahedron in Convolutional Neural Network Design" Applied Sciences 15, no. 3: 1369. https://doi.org/10.3390/app15031369

APA Style

Zhu, Z., & Wang, Q. (2025). Football Net: Leveraging the Structure of Truncated Icosahedron in Convolutional Neural Network Design. Applied Sciences, 15(3), 1369. https://doi.org/10.3390/app15031369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Football Net: Leveraging the Structure of Truncated Icosahedron in Convolutional Neural Network Design

Abstract

1. Introduction

2. Related Work

2.1. Shortcut and Skip Connections

2.2. Homogeneous Ensemble Learning

3. Materials and Methods

3.1. Net Architecture Overview

3.2. FEM and ASCs

3.3. Misaligned Merging Mechanism, MMM

4. Experiments

4.1. Dataset and Data Augmentation

4.2. Hyperparameter and Evaluation Metrics

5. Results

6. Discussion

6.1. Effectiveness and Applicability

6.2. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI