2.3.1. Basic Model
ResNet164 is a member of the deep residual network (ResNet) family and a variant of the deep learning model proposed by Kaiming He et al. in 2015 [
14]. In the ImageNet and MS COCO competitions, ResNets with depths of more than 100 layers exhibited state-of-the-art accuracy in several challenging recognition tasks.
The original ResNet paper introduced multiple versions (ResNet18,34,50,101,152, etc.), but the commonly mentioned ResNet164 refers explicitly to a variant used in a specific study. It uses a bottleneck structure similar to that introduced after ResNet50; within each residual module, a smaller number of convolution kernels are used to reduce the computational complexity while maintaining the expression ability of the model.
ResNet164 has significant advantages over shallow models as a deep residual network family member.
ResNet164 solves the deep network degradation problem; with the increase in network depth, traditional neural networks often encounter the problem of performance saturation and even decline, that is, ‘deep network degradation’. ResNet enables the network to learn deeper representations without encountering serious degradation problems by introducing residual learning units. Each residual block allows the network to directly learn the residual between input and output. If the residual is zero, it means identity mapping, which ensures that the network can at least work like a shallow network, thus solving the problem that the deep network is challenging to train.
ResNet164 improves the model representation, and depth is one of the critical factors in improving the model’s ability to express. ResNet164 has a depth of 164 layers, which enables it to learn more complex feature representations. Compared with shallow models, it can capture multi-level abstract features in images or data to achieve better performance in image classification, target detection, and other tasks.
ResNet164 optimizes the training process; through the residual structure, ResNet164 can effectively alleviate the gradient disappearance and gradient explosion problems, making the model training more stable and faster. The residual connection is equivalent to providing a ‘highway’ for the gradient, ensuring that the gradient can be directly transmitted from the previous layer to the next layer and is not affected by the increase in network depth.
ResNet164 has a better generalization ability; the deep model typically performs better on unseen data due to its capacity to learn richer features, which means it performs better than shallow models on unseen data. Its computational efficiency is constantly improving. Although ResNet164 is deep, using techniques such as bottleneck design optimizes the use of computing resources while maintaining depth, ensuring the model is still competitive regarding computational efficiency.
Compared with shallow models, ResNet164 overcomes the challenges of deep network training through its unique residual structure and deep design. It significantly improves performance on complex tasks, becoming a milestone in deep learning.
Table 1 compares the ResNet164 model and other models regarding their effectiveness on the classification task.
Figure 1 shows the ResNet164 model structure.
The two 1X1 convolutional layers in the bottleneck [
15] are used to reduce and increase the feature dimension, respectively. The primary purpose is to reduce the number of parameters, thereby reducing the number of calculations required. After dimensionality reduction, data training and feature extraction can be performed more effectively and intuitively.
In deep learning, a ‘bottleneck’ refers to a network module or design that is mainly used to reduce the number of computations and parameters, thereby improving the performance and efficiency of the model. This design first appeared in ResNet and was widely used in ResNet v2. Conv means to carry out a convolution operation here. Among them, a convolution group from Conv to BatchNorm2d to ReLu in the table model includes one downsampling operation, which halves the size of the feature map and realizes the convolution operation through maximum pooling.
Specifically, the bottleneck design is used in ResNet to replace the traditional superficial convolution layer. The traditional convolutional layer applies larger filters (such as 3 × 3 or 5 × 5) at each position to obtain local features. However, such convolutional layers may sometimes generate too many calculations and parameters, especially in deep networks, leading to a slow training process, and they are prone to problems such as gradient disappearance or explosion.
The idea of the bottleneck design is to introduce a bottleneck layer, which consists of a series of filters of different sizes, usually 1 × 1, 3 × 3, and 1 × 1 convolutional layer sequences. This sequence first uses a convolution kernel of 1 × 1 to reduce the dimension, then uses a convolution kernel of 3 × 3 to extract the feature, and finally uses a convolution kernel of 1 × 1 to increase the dimension. This design can effectively reduce the dimensions of the feature map, thereby reducing the number of calculations and the number of parameters. In addition, the 1 × 1 convolutional layer can also be used to introduce nonlinear transformations. Such a structure enables the model to train and reason more efficiently while maintaining good performance, especially in deep networks.
2.3.2. Hierarchical Strategy
Previous studies have shown that greedy methods [
16] can draw conclusions from analyzing shallow models, and greedy hierarchical methods can map these results to larger architectures. Compared with the traditional method, the greedy hierarchical strategy dramatically reduces the dependence on obtaining the entire gradient information. Most intermediate gradients do not need to be stored or computed, so they are instrumental in memory-constrained scenarios.
The core idea of the hierarchical greedy learning method is to decompose the training task of deep neural networks into multiple tasks involving the training of shallow networks. The entire network is built layer-by-layer, with each layer being an independently trained shallow module that relies on the previous layer’s output as the input. By combining these modules, a deep network is ultimately formed.
Training starts with a shallow model until it converges. Then, a new layer is added to the converged model, and only this new layer is trained. Typically, a new auxiliary classifier is built for each added layer, which is used to output predictions and calculate the training loss. Therefore, these classifiers provide multiple exits for the inference process, with each layer corresponding to an exit.
In typical deep learning application scenarios such as image recognition [
17], there are shared knowledge resources, such as pre-trained models or public datasets with similar characteristics to users’ private data. These public resources are used as ‘prior knowledge,’ effectively guiding and accelerating the model training process. However, this knowledge is contained in the first layer of the model, which is usually responsible for capturing the basic features of the data, such as low-level visual elements such as edges and textures. These features are generally applicable to a variety of tasks. In particular, in deep models such as ResNet164, the initial layer has learned these essential and universal feature representations on large-scale datasets. These low-level features form the basis for more advanced abstractions in subsequent layers. Therefore, we freeze the pre-trained first-layer model parameters and only train the last few layers of the global model on the client side. This has several significant advantages: First, a reduced training burden. This avoids retraining these low layers on each client device, significantly reducing the consumption of computational resources, especially on resource-limited edge devices. Second, prevention of overfitting. Stable features trained on a wide range of data are retained, which helps reduce the risk of overfitting when the model faces private user data. Third, accelerated convergence. The model can quickly focus on high-level features related to specific tasks by fixing the known suitable feature extractor, accelerating the training process. Fourth, improve model consistency. It is ensured that all client models remain consistent regarding low-level feature extraction, which helps improve the overall coordination and model performance of federated learning.
In summary, the strategy of freezing the first-layer parameters of the model is based on the effective reuse of pre-training knowledge and acknowledging its utility. It aims to optimize resource utilization, accelerate training, and maintain the model’s generalization ability. It is a strategy that can balance efficiency and privacy protection in federated learning.
Therefore, we designed a hierarchical strategy for the ResNet164 model: freezing the parameters of the first convolutional layer and dividing the three bottleneck modules into separate layers. The structure of the model after stratification is shown in
Figure 2.
The tiering strategy is as follows: Firstly, the parameters of the first convolution layer are frozen (this layer does not participate in updates in all subsequent training steps; this is because the first layer is usually close to the data and can make better use of the low-level features of the pre-trained data). Secondly, the three bottleneck stages are divided into one layer each. Lastly, each layer is followed by an auxiliary classifier to output the prediction results for the current layer.
The training process is as follows: First, a network is built layer-by-layer. The initial input signal x0 passes through the frozen convolution layer and enters the first layer of bottleneck operation, , to obtain the first layer output x1. The first layer output x1 uses 329 as the input, and the second layer output x2 is obtained by the second layer bottleneck operation . As mentioned above, the results of the upper layer are continuously recursively used as the input of the next layer until all layers of the model network are constructed. Secondly, for each layer j, the parameters of the previous layer, , are fixed, and only the parameters of the current layer and the parameters of the auxiliary classifier are optimized. The output of the current layer is classified by an auxiliary classifier , and the classification loss is calculated.
The optimization goal is to minimize the classification loss of the current layer:
where l is the loss function (such as cross-entropy loss), x
j is the output of the current layer, and y
n is the corresponding label. The role of auxiliary classifiers is as follows: The output of the auxiliary classifier
is the prediction result of the current layer. By optimizing the loss of the auxiliary classifier, the feature extraction of each layer can be directly utilized to improve the expression ability of each layer. Among them, the Batchnorm and ReLU functions form a residual block group, and the output data are processed by the global average pooling layer (Avgpool) and output to the fully connected layer (Linear).
The hierarchical aggregation method is a commonly used clustering analysis method, through which clusters are formed by gradually merging or splitting data points. HAC is usually used in data mining and statistical analysis, especially when the precise number of clusters is not known. Its advantage is that it can avoid direct transmission and centralized data storage and protect data privacy. At the same time, the hierarchical aggregation method can also improve the accuracy and stability of the model because the model updates at different levels can complement each other to obtain a better global model.
In layer selection, the decision to freeze specific layers is based on their role in feature extraction. The initial layers, responsible for capturing fundamental features (e.g., edges and textures), are frozen. This prevents retraining these layers across different clients, which conserves computational resources and mitigates overfitting. The later layers, which capture more task-specific features, are unfrozen and optimized further. The optimization pathway is as follows: each layer is optimized sequentially by fixing the parameters of all previous layers and focusing the training on the current layer. This allows for a more manageable memory footprint, particularly in environments with limited resources like trusted execution environments (TEEs). The optimization objective at each step is to minimize the classification loss using an auxiliary classifier, ensuring that the features learned at each layer contribute effectively to the overall model performance.
It should be noted that in the hierarchical aggregation method, parameters, such as the number of layers and the importance of each layer, need to be adjusted according to the actual situation. In addition, in the hierarchical aggregation method, factors such as the computing power and communication bandwidth of the participants also need to be considered to maintain the training efficiency and accuracy of the model.
2.3.3. Model Application
We pre-trained and pruned the ResNet164 basic model and then designed its hierarchical model, which was finally applied in federated learning based on the Intel SGX trusted execution environment.
Figure 3 shows the process of model application.
①–② Server public knowledge learning: The original model uses the public dataset to learn, train, converge, and construct a pre-training model based on this dataset. Therefore, during initialization, the server selects a pre-trained model of public data which has a similar distribution to private data.
③ Broadcasting specific layer parameters: The server checks all available devices and constructs a set of participating clients to ensure that the TEE’s memory is greater than the memory usage of these clients. Then, the layer parameters within the trained model are broadcast to these participating clients.
④ After model transmission and configuration using gPRC remote communication, each client model starts local training of its private data on this layer.
⑤ After the client completes the local training of the layer, all participating clients encrypt and upload the layer parameters to the server through GPRS remote communication.
⑥ Finally, the server safely aggregates and decrypts the received parameters in its TEE and applies the FedAvg algorithm to achieve aggregation, thereby safely generating a new global model layer.
The training of steps ③–⑥ of the global model is repeated until the training of all the layers of the hierarchical model is completed.
During the experiment, we observed the following characteristics of the hierarchical model: the parameters of the bottom layer proliferated, the correlation with the original features of the data weakened, and the data features were not vulnerable to attack.
Therefore, the following security decisions were made: the third-layer parameters were aggregated locally, TEE memory usage was optimized, overall security was ensured, and the computing efficiency and privacy protection were maintained.