1. Introduction
Fetal malformation is the structural and functional abnormality of the body due to factors including but not limited to the environment, infection, and diabetes [
1]. Fetal malformation can lead to perinatal death and life-long disability [
2]. Hence, prenatal diagnosis and management are necessary. The methods of prenatal diagnosis mainly include magnetic resonance imaging (MRI) and ultrasound screening techniques [
3,
4]. The ultrasound screening is widely used due to its advantages of a low cost and real-time visualization capability [
5,
6]. During the ultrasound imaging, the biometric parameters such as the biparietal diameter and head circumference can be used to detect fetal development and diagnose fetal malformation [
7]. Therefore, the fetal ultrasound standard planes for measuring such biometric parameters ought to be paid much attention for the reproducibility of the diagnosis [
7,
8]. In clinical settings, the way to obtain standard planes mainly depends on clinicians’ manual selection while scanning the pregnant women with ultrasound probes [
9]. In addition, the variety of standard planes and interference anatomical structures similar to those in the standard planes brings challenges to manual detection [
6,
8].
In recent years, deep-learning techniques including convolutional neural networks (CNNs) have been applied to the field of medical image processing including fetal ultrasound standard plane detection. For instance, Chen et al. proposed the multi-organ foundation (MOFO) model for ultrasound images segmentation [
10]. Chernyshov et al. introduced a U-Net based network for the segmentation and quantification of echocardiography [
11]. Su et al. proposed JANet for the segmentation of the left ventricle in ultrasound videos based on ResNet and U-Net [
12]. The standard plane detection has been formulated as the classification of several kinds of standard planes. Chen et al. combined different kinds of networks and proposed N-CNN and T-RNN [
13,
14]. Baumgartner et al. proposed SonoNet based on the Visual Geometry Group (VGG) network, the accuracy of which achieved 90.1% in the classification of 13 standard planes [
15]. Ye et al. proposed a network combining YOLOV3 and ResNeXt [
16]. Pu et al. proposed the FUSPR network and achieved an accuracy of 87.38% in the classification of four categories, including the fetal abdominal standard plane, fetal thalamus standard plane, fetal cerebellum standard plane, and fetal lumbosacral spine standard plane [
17]. Kong et al. proposed the MSDNet based on DenseNet, which was able to extract features from various scales, and achieved an accuracy of 98.26% [
18]. In addition to fetal plane detection, deep learning has been used in other tasks. Lin et al. proposed a method based on Faster R-CNN and MFR-CNN for standard plane and inner tissue detection [
19,
20]. The USPD proposed by Zhao et al. was able to detect standard planes and simultaneously explain the detection results [
8]. Cai et al. proposed the multi-task SonoEyeNet as an AI-powered tool that uses sonographer eye movements to create visual cues that help automate the process of finding the correct abdominal circumference measurement plane in ultrasound exams [
21].
In 2020, a dataset [
22] was made public by Burgos-Artizzu et al., encouraging the related research in fetal ultrasound standard plane detection. This dataset consists of six categories of standard planes including the fetal abdomen, fetal brain, fetal femur, fetal thorax, maternal cervix, and other. The category of fetal brain can further be divided into four categories: trans-ventricular, trans-thalamic, trans-cerebellum, and other brain standard planes; then, the total number of categories is nine.
With Burgos-Artizzu et al.’s dataset [
22], Krishna and Kokil proposed three kinds of deep-learning networks which combined AlexNet, VGG and ResNet, achieving an accuracy of 95.1%, 95.5%, and 95.7%, respectively, in the classification of six categories of standard planes [
5,
23,
24]. In addition to the classification of six categories, some researchers paid attention to the classification of three categories of brain standard planes, i.e., trans-ventricular, trans-thalamic, and trans-cerebellum. Coronado-Gutiérrez et al. used ResNet-18 pretrained by the ImageNet dataset to classify the three categories of brain standard planes, with an accuracy of 98.1% [
25]. Vetriselvi and Thenmozhi designed a binary-channel CNN and achieved an accuracy of 97.0% in the same classification task [
26]. In addition, some researchers chose to design a model for these two classification tasks (i.e., the classification of six categories of standard planes and the classification of three categories of brain standard planes) at the same time. Annamalai and Sindhu proposed an ensemble network with InceptionResNetV2, DenseNet121, and Xception and achieved an accuracy of 96.9% and 93.7%, respectively, in the classification of six categories and three categories [
27]. Zamojski et al. combined EfficientV2 and a recurrent neural network (RNN) to classify three and six categories of standard planes [
28].
It can be seen that the ensemble frameworks were preferred in the classification of six categories and achieved excellent performance due to its ability to extract features from various scales. However, the ensemble framework leads to large parameter sizes and a long inference time. In this paper, we proposed a lightweight network based on SonoNet [
15] and introduced light pyramid convolution (LPC) blocks inspired by the Simplified Spatial Pyramid Pooling Fast (SimSPPF) from the YOLOv6 [
29]. The proposed network was termed LPC-SonoNet, which was trained and tested using Burgos-Artizzu’s dataset [
22]. While the Burgos-Artizzu dataset [
22] encompasses nine distinct image categories, most of the research has focused on classifying either six or three categories. This preference for a smaller number of categories necessitates additional classification steps to identify specific standard planes, such as the trans-ventricular plane. Recognizing this limitation, we applied the proposed LPC-SonoNet to the classification of all nine categories of fetal ultrasound standard planes. The main contributions of this paper are as follows:
We proposed a lightweight deep-learning model based on LPC and SonoNet. Compared to SonoNet, the proposed LPC-SonoNet demonstrates a slight improvement in classifying six categories on the Burgos-Artizzu dataset [
22], while simultaneously reducing network complexity (i.e., requiring fewer parameters).
The proposed LPC-SonoNet was applied to the classification of nine categories on Burgos-Artizzu’s dataset [
22], enabling the direct identification of each of the nine kinds of standard planes.
4. Discussion
In this paper, we incorporated the LPC blocks into SonoNet64 and the proposed LPC-SonoNet for fetal ultrasound standard plane detection. The proposed network replaced the convolutional blocks of SonoNet64 with the LPC blocks. The pyramid architecture of the LPC blocks could leverage features from various scales and fuse them with few convolutional layers. The proposed LPC-SonoNet model was trained and tested on a public dataset containing six categories of standard planes, i.e., Burgos-Artizzu et al.’s dataset [
22]. Experimental results showed that LPC-SonoNet slightly outperformed SonoNet64 with much fewer network parameters. In addition, we further divided the dataset into nine categories and pioneered the nine-category classification using LPC-SonoNet, with a promising detection performance. This study has provided a lightweight network for deep-learning-based fetal ultrasound standard plane detection.
Compared with the convolutional layers in SonoNet64, the pyramid architecture in the proposed LPC-SonoNet enables most convolutional layers to process tensor data with less channels. The average number of channels of tensor data that SonoNet64 needs to process is about 307, while the counterparts of the proposed network is 230. Therefore, the proposed network has a much smaller parameter size than SonoNet64. However, the small parameter size may lead to disadvantages such as low sensitivity in the classification of nine categories (
Table 7). In addition, the proposed network had less satisfying performance in the category of other brain standard planes possibly due to the small image size of this category.
In previous work, the ensemble networks tended to combine the predictions of various base networks such as VGG [
30] and ResNet [
33] and concluded the final prediction. For example, the three networks proposed by Krishna and Kokil [
5,
23,
24] combined the feature vectors form VGG-19, ResNet-50, AlexNet, and DarkNet19 and fused these vectors with support vector machines or multi-layer perceptron. The network proposed by Sindhu et al. combined InceptionResNetV2, DenseNet121, and Xception [
27]. The architecture of these ensemble frameworks did leverage the features from different scales, bringing excellent classification performance but leading to larger model parameter sizes and the requirement for powerful hardware [
5]. In contrast, there was only one single base network in the proposed LPC-SonoNet and this design resulted in much smaller parameter sizes (
Table 6).
In order to address the data imbalance issue in Burgos-Artizzu et al.’s dataset [
22], data augmentation was applied in this work.
Table 8 compares the performance of the proposed network trained with and without data augmentation in the classification of six categories of standard planes. It can be seen that the data augmentation slightly improved the performance in classification. Note that the compared methods in
Table 6 have not used data augmentation. We argue that, if these methods have used data augmentation, their performance may be improved. In addition, the compared methods in
Table 6 used stochastic gradient descent with momentum (SGDM) as the optimizer. In this study, we used the Adam optimizer because we experimentally found that it yielded better performance for the proposed LPC-SonoNet than SGDM.
As described in
Section 1, previous studies have focused on classifying either six or three categories, without nine categories. This study pioneers the classification of nine categories. The possible reason why previous studies have not considered nine-category classification may lie in the fact that the sub-categories of the fetal brain standard plane, particularly for the fetal trans-ventricular standard plane and the other brain standard plane (
Table 2), have relatively small sizes of images. This issue may pose challenges for the direct classification of nine categories. In this study, such an issue has been addressed by the data augmentation method (
Table 4).
Compared with SonoNet64, the proposed network had a better ability in the detection of the fetal abdomen and fetal thorax standard planes (
Figure 4 and
Figure 5). To explore the interpretability of the proposed LPC-SonoNet and SonoNet64, the gradient-weighted class activation mapping (GradCAM) [
34] technique was used, and the heatmaps generated with GradCAM which used warm color to depict the attention of the network on the input data are shown in
Figure 6. The heatmaps of LPC-SonoNet are more concentrated in the relevant regions of the fetal abdomen and fetal thorax standard planes than SonoNet64. It is possible that the pyramid architecture in LPC blocks enable the proposed network to have a large receptive field so that it can focus on the right regions related to the class of standard plane. However, this architecture makes the proposed network ignore the boundary of tissue so that the proposed network performed worse than SonoNet64 in the detection of the outlines of the skull and femur which is important in the classification of brain and femur standard planes.
This study has limitations. First, LPC-SonoNet has the limitation of a weak generalization ability. Trained on the high-quality images from Burgos-Artizzu et al.’s dataset which are collected with the devices such as Voluson S8 and Voluson S10 [
22], the proposed LPC-SonoNet performed much worse on the low-quality images from another public dataset by Sendra-Balcells [
35] (
Table 9). This kind of reduction in detection accuracy can also be observed for SonoNet64 in
Table 9. The reason may be that the Sendra-Balcells dataset [
35] is quite different from Burgos-Artizzu et al.’s dataset [
22]. The images in the Sendra-Balcells dataset [
35] were collected with devices including but not limited to Mindray DC-N2 and Voluson P8 in resource-limited countries including Algeria, Egypt, and Malawi. The categories of images from this dataset included the fetal abdomen, fetal brain, fetal thorax, and fetal femur. Secondly, although the parameter size of the proposed network is much less than SonoNe64, it fails to significantly reduce the inference time (
Table 5). It is probable that the frequent searching and merging for tensors in the concatenation function consumes much time for LPC-SonoNet. In future work, the generalization ability of the proposed LPC-SonoNet can be improved by methods such as adding low-quality images into the training set. In addition, the inference time may be further decreased, possibly by optimizing the architecture of the network and decreasing the number of concatenation layers.