3.4.1. Design of Detection Model
This section introduces the neural network model designed in this paper, which is named MHSA-ResNet, according to its characteristics. The neural network model was used to detect X-ray images of COVID-19. The neural network model is composed of the following parts, including the convolutional layer, pooling layer, bottleneck layer, and fully connected layer. Among them, the convolution layer in the residual network is divided into several groups to facilitate the calculation of residual values. In the bottleneck layer, the convolution kernel of 1 × 1 is replaced by the multi-head self-attention module to facilitate the extraction of features, shorten the training time and improve the operation efficiency of the neural network. This section also describes the model training process in detail. The sample diagram of the MHSA-ResNet neural network is shown in
Figure 7.
Convolution layer: The convolution layer performs convolution operation on the input data and extracts useful information from it. The convolution parameters include the size of the convolution kernel, the step size of each movement of the convolution kernel, and other parameters.
Pooling layer: The pooling layer provides a nonlinear operation that can be performed from a sequence of values in the input matrix and returned. The parameters in the pooling layer include kernel size, which means the window size in the pooling layer. In the model, means the length and width of the pooling window are . The stride parameter represents the moving distance of each step of the pooling window. In the model, 2 represents the moving of two pixels after each pooling operation. The type parameter indicates the type of the pooling operation, including the maximum pooling operation and the average pooling operation. The maximum pooling layer refers to the retention of the maximum output value in the pooling window matrix. In the average pooling operation, all the values in the pooling window matrix are averaged for retention.
Fully connected layer: parameter num output represents the number of neurons in the fully connected layer, and parameter activation function represents the activation function of the fully connected layer. The ReLu activation function is used in fully connected layer 1 and fully connected layer 2, and the Softmax activation function is used in the last fully connected layer. In the fully connected layer, L2 regularization will be used, that is, the Euclidean norm penalty with weakened weight. The L2 regularization operation will cause the objective function to input a penalty factor for the sum of squares of the weight data so that the weight parameter is closer to the origin.
Bottleneck layer: The bottleneck layer has the structure of a bottleneck block. Specifically, it uses convolution block, which is applied in the convolutional neural network and set between two convolution layers of different sizes and different channels.
The training process is as follows: in the first convolution layer, the size of the convolution kernel in this layer is set as
, the stride length is 2, and the output image size is
. The vector operated through the convolution layer is input into the
maximum pooling layer, and the stride length in the pooling layer is 2. The same convolution operation is used in the second convolutional layer, but the bottleneck structure is added in this layer. The bottleneck structure is composed of three convolution blocks. In the bottleneck layer, a
convolution block is first passed to change the number of channels, and then a
convolution block is used to complete the convolution operation. Finally, a
convolution block is used to restore the number of channels. There are multiple bottleneck blocks in the residual network, which can complete the task of calculating the residual value. The same operation is performed in the third convolutional layer, with the difference in the number of bottleneck blocks and the number of channels. The multi-head self-attention module is added to convolutional layer 4 and convolutional layer 5. The specific model architecture is shown in
Table 2.
According to the model listed in
Table 2, the residual network is divided into five convolutional layers. In addition to the first convolutional layer, the remaining four convolutional layers complete the operation of calculating the residual value through the bottleneck block, and there is no significant difference in parameters such as convolution kernel, convolution kernel movement step size, and pooling mode. Compared with ResNet50, ResNet152 thickens the third and fourth convolutional layers by increasing the number of bottleneck blocks based on its settings, and the number of output channels and the size of output images do not change.
The MHSA-ResNet model designed in this paper is based on ResNet152, and a multi-head attention module is added to the bottleneck block in the last two convolution layers so as to reduce the computation amount and improve the training accuracy of the neural network model without changing the number of channels and the size of each convolution layer.
3.4.2. Multi-Head Self-Attention Module Design
The attention mechanism adopted in this paper is the multi-head self-attention mechanism [
30], which combines the characteristics of self-attention and multi-head attention, and its structure is shown in
Figure 8. ⨁ means the sum of the matrix elements by elements, and ⨂ means the multiplication of each matrix. When the image is executed on the 2D feature map, the height and width of the image feature are input to calculate the range of the segmented receptive field so as to obtain the relative position codes
h and
w. The relative position code on the left side of the figure is the matrix calculated by calculating
h and
w. Through the addition of the relative position matrix, query matrix, and key matrix, the calculation of attention is completed. The multi-head self-attention formula is as:
R in the formula is the relative position matrix of the width and height of the 2D feature map obtained by adding elements [
31]. Using the relative position matrix to help the image to recognize attention features can effectively improve the efficiency of using attention. By location coding can effectively obtain the model’s ability to capture the order of sequence. In the multi-head self-attention, each matrix is carried out by the method of
point-by-point convolution to ensure the recognition accuracy of attention.
In order to fuse the multi-head self-attention module with the bottleneck layer in the residual network, the MHSA attention module is designed in the bottleneck layer.
In the design of the MHSA module, it is necessary to generate the feature map according to the length and width of the image, and the dimension and other features are also used as the parameters of the feature map to calculate the location matrix. In order to combine the MHSA module with the residual network, this paper uses the MHSA module in the last two layers of the residual neural network embedded in the bottleneck layer of the residual network to complete the design of the neural network.
Figure 9 shows the structure of the bottleneck layer.
The calculation of the residual value needs to go through the convolution layer, and the task of modifying the number of channels and the shape of the image is completed through the convolution block in the convolution layer. In the middle is the MHSA bottleneck block based on attention. After feature extraction through the attention mechanism, the last convolution block is used to restore the number of channels. The output function after the last operation of the bottleneck block is . At the same time, the part of the input shape that has not been processed by the convolution block is set as x, and the sum of the two is . After the activation function and the regularization operation, we obtain the operation to compute the residual shortcut.
3.4.3. Evaluating the Complexity of the Detection Model
In order to verify the effectiveness of the neural network detection model designed in this paper, this section evaluates the complexity of the neural network through experiments. The complexity of the model is generally evaluated by calculating quantity and parameter quantity. The algorithm complexity is as:
Formula (5) describes the computational time complexity (computational quantity), and Formula (6) describes the computational space complexity (parameter quantity), where K represents the kernel parameter in the neural network, C represents the number of channels, L represents the number of layers in the neural network, and M represents the feature map parameter. The time complexity and space complexity of neural networks calculated by formulas still need to be evaluated by relevant indicators. The evaluation indicators include:
Params: The number of parameters of the model refers to the total number of parameters to be trained in the neural network, which directly determines the size of the model and also affects the memory usage during inference. The unit is generally M.
FLOPs: Floating-point operations per second (FLOPs) refers to the number of floating-point operations, which measures the time complexity of a network model, expressed in giga floating-point operations per second (GFLOPs).
In the convolution layer, because the weights in the convolution kernel are shared, the calculation formula for parameter number and the calculation formula for
FLOPs are as:
The number of parameters in the convolution layer is calculated by multiplying the number of output and output channels by the input convolution kernel and output convolution kernel and finally adding the output feature map. When computing FLOPs, you need to output the feature map’s height, width, and the number of channels to complete the calculation.
In the fully connected layer, because there is no weight sharing, the
FLOPs value of the layer is equal to the number of parameters in the layer. The fully connected layer is mainly used to compute the addition operation and the multiplication operation in each neuron. The calculation formula of parameter quantity and
FLOPs are as:
Through the above formulas, this paper can count the number of module parameters of the neural network through the tensorboard package. In
Table 3, the paper counts the number of parameters of the neural network designed in this paper.
Through experiments, it is found that the MHSA-ResNet neural network designed in this paper reduces the number of parameters needed to be calculated and improves the speed of network operation through the attention mechanism, which proves the optimization effect of the attention mechanism on the neural network, so that the recognition efficiency of COVID-19 images in this paper is higher.