3.2. MSSCNN Fault Diagnosis Model
As shown in
Figure 6, the MSSCNN fault diagnosis model proposed in this paper initially uses 3 × 1 convolution to extract shallow features from the sample data. With a stride of 2, the data length in the convolution layer is halved to a value of 125. The output data size of this layer is 24 (number of channels) × 125 (data length) × 1 (dimension). Furthermore, batch normalization (BN) is applied to normalize these shallow features [
24]. The BN layer is used to unify parameter magnitudes, which accelerates convergence and prevents network overfitting, with the expression as follows
where
m represents the number of samples computed at each iteration, which is 5 in this paper;
µ is the sample mean;
σ2 is the sample variance;
represents the
i-th data point of the
k-th channel in the
l-th layer of the model,
k∈{1, 2, …,
N}. The output data size of this layer remains unchanged.
γ and
β are, respectively, the scale and shift parameters, which can be learned through the network;
ε is to prevent the denominator in (9) from being zero.
The rectified linear unit (ReLU) activation function is used for non-linearity processing [
25], which helps alleviate the problem of gradient vanishing and is relatively simple to implement. ReLU can be expressed as follows
The output data size of this layer is 24 × 125 × 1. Finally, the features are dimensionally reduced through a maximum pooling (MaxPool) layer [
26] with a stride of 2, with the expression as follows
where
LMaxPool represents the window length of the MaxPool, which is 3 in this paper. The output size of this layer is 24 × 63 × 1. The above steps complete the preliminary extraction of current feature information.
To better distinguish open-circuit faults caused by different switching device damages, it is necessary to extract higher-dimensional current fault feature information. CNN models like ResNet and DenseNet exhibit excellent performance in fields such as image recognition and object detection, but these models often have high complexity. To achieve a lightweight network model while maintaining high accuracy, rapid diagnosis and strong noise resistance in the open-circuit fault diagnosis of three-level NPC inverters, this paper designs the MSSCNN basic module and downsampling module. These modules mainly include 1 × 1 convolution layers, 3 × 1 and 9 × 1 depthwise separable convolution layers, BN layer, ReLU activation function and channel shuffle. The overall structure is shown in
Figure 7.
Table 5 presents a comparison of the MSSCNN basic module and downsampling module designed in this paper with four other common CNN models. In traditional CNN architectures, each layer extracts input feature information through ordinary convolutions, but the learning capacity of features is insufficient. In ResNet, low-level feature information is directly mapped to high-level networks through short connections, which greatly improves the convergence speed and accuracy of the network. However, the large number of addition operations in the network leads to high computational complexity. ShuffleNet V2 is a lightweight network that replaces addition operations in ResNet with concat operations, thereby reducing the model’s computational load. MobileNet V3 replaces ordinary convolutions with depthwise separable convolution [
27] to reduce the computational load while maintaining good classification performance. The basic modules and downsampling modules designed in this paper use depthwise separable convolutions instead of standard convolutions in traditional CNNs. Compared to ShuffleNet V2 and MobileNet V3, to reduce computation, 1 × 1 convolutions are omitted before applying depthwise separable convolutions. Moreover, concat operations replace addition operations in ResNet and MobileNet V3 to further decrease the computational complexity. In the downsampling module, a combination of 3 × 1 convolutions and 9 × 1 large kernel convolutions effectively extracts current fault features at various scales, enhancing the model’s robust fault feature extraction capabilities.
The depthwise separable convolution adopted by the proposed model in this paper is divided into two steps: depthwise convolution and pointwise convolution. As shown in
Figure 7, each input channel is convolved by only one convolution kernel, so the number of output channels is exactly equal to the number of channels in the previous layer. Since each channel is convolved separately, the features in the channel direction are independent. Therefore, the second step of depthwise separable convolution is to use pointwise convolution, specifically 1 × 1 convolution, to fuse cross-channel information. The computational multiplication of standard convolution can be expressed as follows
where
K and
F are the sizes of the convolution kernel and the output feature map, respectively;
M and
N are the number of input channels and output channels, respectively.
The computational multiplication of depthwise separable convolution can be expressed as follows
The computational complexity of standard convolution and depthwise separable convolution can be compared as follows
According to (16), it is evident that the computational complexity of the depthwise separable convolution used in this paper is significantly reduced compared to standard convolution.
As shown in
Figure 7, the first step in feature depth extraction is to use a downsampling module to reduce dimensionality and extract information. The current feature information with an input size of 24 × 63 × 1 is convolved by 3 × 1 depthwise separable convolution and 9 × 1 depthwise separable convolution, respectively. This enables the network to capture current information features at different scales. During the pointwise convolution process, the input and output channels remain the same because having similar input and output channel counts minimizes memory usage and speeds up computation. After the convolutions, the two branches are concatenated, doubling the output channels and significantly enhancing the feature learning capability of the network. Another downsampling module is then used to extract feature information with an output size of 96 × 16 × 1.
In the basic module, the feature information is first subjected to channel shuffle. Channel shuffle [
22] involves randomly dividing the feature channels into two groups, with each group having half the number of input feature channels. The feature channels in one branch go directly to the next layer without any operation, thereby establishing connection relationships between different layers, allowing each layer to reuse half of the features from the previous layer. This characteristic, similar to DenseNet, contributes to the model’s high accuracy. The other branch uses 3 × 1 depthwise separable convolution. The outputs of these two branches are concatenated while maintaining the channel count at 96. The final high-dimensional information feature extraction is accomplished using the downsampling module again.
The final step in feature depth extraction involves a simple concatenation of the two branches. Therefore, in the feature aggregation and output part, 1 × 1 convolution is first used to enhance information exchange between the two branches, as shown in
Figure 8. To better distinguish the total of 13 conditions, including various types of single-switch open-circuit faults and normal operation in the three-level NPC inverter, a global average pooling (GAP) layer [
26] is utilized to integrate the global information of the features, with the expression as follows
where
nk represents the total amount of data in the
k-th channel. Finally, the 13 operating states are output through a fully connected (FC) layer.