MAPPNet: A Multi-Scale Attention Pyramid Pooling Network for Dental Calculus Segmentation

Nie, Tianyu; Yao, Shihong; Wang, Di; Wang, Conger; Zhao, Yishi

doi:10.3390/app14167273

Open AccessArticle

MAPPNet: A Multi-Scale Attention Pyramid Pooling Network for Dental Calculus Segmentation

by

Tianyu Nie

¹

,

Shihong Yao

¹,

Di Wang

²,

Conger Wang

³ and

Yishi Zhao

^4,5,*

¹

School of Geography and Information Engineering, China University of Geosciences, Wuhan 430074, China

²

Hangzhou Jiesao Technology Co., Ltd., Hangzhou 311121, China

³

Hangzhou Eyar Digital Dental Co., Ltd., Hangzhou 311121, China

⁴

School of Computer Science, China University of Geosciences, Wuhan 430074, China

⁵

Engineering Research Center of Natural Resource Information Management and Digital Twin Engineering Software, Ministry of Education, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7273; https://doi.org/10.3390/app14167273

Submission received: 28 June 2024 / Revised: 11 August 2024 / Accepted: 16 August 2024 / Published: 19 August 2024

(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Dental diseases are among the most prevalent diseases globally, and accurate segmentation of dental calculus images plays a crucial role in periodontal disease diagnosis and treatment planning. However, the current methods are not stable and reliable enough due to the variable morphology of dental calculus and the blurring of the boundaries between the dental edges and the surrounding tissues; therefore, our hope is to propose an accurate and reliable calculus segmentation algorithm to improve the efficiency of clinical detection. We propose a multi-scale attention pyramid pooling network (MAPPNet) to enhance the performance of dental calculus segmentation. The network incorporates a multi-scale fusion strategy in both the encoder and decoder, forming a model with a dual-ended multi-scale structure. This design, in contrast to employing a multi-scale fusion scheme at a single end, enables more effective capturing of features from diverse scales. Furthermore, the attention pyramid pooling module (APPM) reconstructs the features on this map by leveraging a spatial-first and channel-second attention mechanism. APPM enables the network to adaptively adjust the weights of different locations and channels in the feature map, thereby enhancing the perception of important regions and key features. Experimental evaluation of our collected dental calculus segmentation dataset demonstrates the superior performance of MAPPNet, which achieves an intersection-over-union of 81.46% and an accuracy rate of 98.35%. Additionally, on two publicly available datasets, ISIC2018 (skin lesion dataset) and Kvasir-SEG (gastrointestinal polyp segmentation dataset), MAPPNet achieved an intersection-over-union of 76.48% and 91.38%, respectively. These results validate the effectiveness of our proposed network in accurately segmenting lesion regions and achieving high accuracy rates, surpassing many existing segmentation methods.

Keywords:

dental calculus segmentation; multi-scale fusion; attention mechanism; pyramid pooling; deep learning

1. Introduction

Dental calculus, also referred to as tartar, is a prevalent oral condition characterized by the solidification of plaque or soft deposits on teeth [1]. If this condition is not promptly addressed, it can persistently stimulate periodontal tissues, trigger bacterial infections, and induce inflammation and shrinkage of the gums, resulting in periodontitis and gingivitis [2]. However, the diagnosis of dental calculus is a manual, labor-intensive, and time-consuming task. Furthermore, dentists’ visual fatigue, coupled with the diverse forms of dental calculus, heightens the risk of overlooking minor lesions. This condition could result in less efficient and potentially more painful treatments for patients. Therefore, automated detection and segmentation methods for dental calculus must be explored to aid in treatment [3] and enhance screening precision.

Over the past few years, medical image segmentation techniques have become increasingly prevalent in aiding diagnoses. The traditional segmentation methods employed in this field include thresholding [4], edge detection [5,6], clustering [7,8], and region-growing [9,10]. Although these methods have attained a degree of success in dental calculus segmentation, they possess certain limitations. For instance, manual threshold selection is typically required, and they can be sensitive to noise and texture, among other factors. These limitations can affect the overall performance of segmentation results.

Convolutional neural networks (CNNs) have become an important tool for medical image segmentation due to the development of deep learning. CNNs offer the potential for more accurate and efficient segmentation compared with traditional segmentation techniques. A prominent CNN architecture is PSPNet [11], which leverages global pooling and a pyramidal structure. This model effectively enhances scene resolution and localization. The multi-scale pyramidal structure proposed by PSPNet proves particularly suitable for medical image segmentation tasks [12]. Additionally, a convolutional block attention module (CBAM) can be employed to extract image information before convolution and upsampling [13]. Moreover, supervised probabilistic gate modules can be constructed to dynamically and adaptively adjust the influence of different features [14]. For the task of dental calculus segmentation, an improved image cascade network [15] solves the problem of inaccurate lesion segmentation by introducing CBAM and asymmetric convolution. Another idea is to use the YOLOV8 model and image enhancement techniques based on bitewing (BW) images to improve diagnostic accuracy [16]. However, these methods still have much room for improvement in the prediction of dental calculus and lack validation of the generalization ability of the models.

PSPNet aims to tackle inaccuracies stemming from a lack of contextual information. While the pyramid pooling module (PPM) in the decoder effectively integrates features from various scales, it falls short of capturing multi-scale features within the encoder. This limitation impedes the network’s potential to grasp crucial information comprehensively. Building upon this research, our work introduces a novel approach termed the multi-scale attention pyramid pooling network (MAPPNet). This method is based on the PSPNet architecture; it incorporates multi-scale processing techniques in both the encoder and the decoder, establishing a dual-ended multi-scale processing structure for the model. Additionally, MAPPNet integrates attention modules to enhance its capability to precisely locate and recognize dental calculus. This comprehensive approach provides robust support for subsequent automatic removal procedures.

The main contributions of this paper can be summarized as follows:

An algorithm called MAPPNet is proposed for dental calculus image segmentation. This algorithm effectively captures global information from the images. Accordingly, targets with different scales can be accurately segmented.
In both the encoder and decoder stages, a multi-scale fusion strategy is employed to build a network with a dual-ended multi-scale structure. This comprehensive approach significantly enhances the model’s capability to acquire contextual features.
We incorporate channel attention, spatial attention, and combinations of these mechanisms to optimize the effectiveness of the pyramid pooling structure in dental calculus segmentation.

2. Related Work

Deep-learning-based medical image segmentation constitutes a vital component of computer-aided diagnosis, playing a significant role in clinical diagnosis. The purpose of this mechanism is to identify pixels corresponding to organs or lesions in medical images, assisting doctors in making more accurate judgments. In the context of dental imaging, cone beam CT (CBCT), X-ray, and visible light are commonly employed. Dentists can diagnose lesion locations in various types of imaging by utilizing image segmentation techniques [15,17,18]. The development of image segmentation methods for dental applications frequently revolves around three key aspects: encoder–decoder structures, multi-scale feature fusion, and attention mechanisms. The following sections outline the work undertaken in these areas.

2.1. Encoder–Decoder Architecture

In the field of image segmentation, the encoder is responsible for extracting feature information from the input data, while the decoder takes the feature map generated by the encoder and optimizes the features for the specific segmentation task, ultimately annotating each pixel. Encoders commonly utilize various backbone networks, including VGG [19], ResNet [20], and MobileNet [21]. Meanwhile, decoders are designed based on the specific requirements of the segmentation task. Fully Convolutional Networks (FCNs) [22] downsample the input information through an encoder to obtain a feature map with high semantic information. This network then utilizes a decoder to upsample the feature map, recover the original size, and produce the final output by mapping it back to the original format. SegNet [23], an improvement over FCN, addresses the issue of edge information loss by establishing a maximum pooling index and has found applications in certain areas, such as autonomous driving. U-Net [24], widely employed in 2D medical image segmentation, follows a compact and symmetric encoder–decoder structure. This network incorporates skip connections to preserve high-resolution features and spatial information, thereby enhancing segmentation accuracy. Furthermore, notable advancements have emerged in various medical image processing domains, including 3D medical image segmentation, neuronal structure segmentation, liver segmentation, etc., with the introduction of networks, such as 3D-U-Net [25], MDU-Net [26], U-Net++ [27], and V-Net [28], which are built upon the foundation of U-Net.

Nevertheless, these models are confined by the restrictions imposed by symmetric coding and decoding structures, limiting the algorithm’s design flexibility. By contrast, our proposed algorithm embraces an asymmetric coding and decoding structure. Here, global context information is harnessed during the coding phase, while the decoding phase processes feature maps of various scales in parallel, yielding comprehensive segmentation outcomes.

2.2. Multi-Scale Feature Fusion

The integration of feature information at different scales is essential because it captures the diverse semantic details of an image. This approach enables the extraction of contextual information, which has been proven to be effective in various tasks. The Feature Pyramid Network (FPN) structure [29] processes images into feature maps at different scales, which are then fused to obtain global contextual information. The Spatial Pyramid Pooling (SPP) structure [30] addresses the problem of image distortion caused by cropping and scaling operations. SPP achieves this task by utilizing multiple windows and outputs at a fixed size. DeepLabV2 [31] introduces the Atrous Spatial Pyramid Pooling (ASPP) structure, which builds upon the SPP concept. ASPP incorporates parallel atrous convolutions with different sampling rates. This approach allows for the extraction of feature maps with large receptive fields while preserving the resolution to a significant extent. In PSPNet, the pyramid pooling module (PPM) is proposed to address the feature loss issue associated with pooling operations. PPM conducts multi-scale pooling operations, followed by convolution and upsampling to recover the original size. Finally, the features from four different scales are fused, resulting in richer and more comprehensive feature information.

Many network structures incorporate only a single multi-scale fusion strategy, which, although capturing more feature information to some extent, still faces the challenge of information loss. To tackle this issue, we propose a multi-scale fusion structure at both the encoding and decoding ends. On the encoding part, the multi-scale fusion (MSF) module is employed to integrate contextual information. On the decoding part, the introduction of the APPM aids in synthesizing feature information at different scales, enabling a more comprehensive capture of context and details related to the target.

2.3. Attention Mechanism

The attention mechanism is a resource allocation mechanism in which the importance of features varies in different layers of the neural network. Meanwhile, the attention mechanism is used to generate a set of weights for resource allocation; accordingly, important feature information receives more attention. SENet [32] and ECANet [33] use the channel attention mechanism to obtain the importance of each channel of the feature graph by automatic learning and use this importance to assign a weight value to each feature. Accordingly, the neural network focuses on certain feature channels. STN [34] and GENet [35] use a spatial attention mechanism that assigns varying weights to different spatial locations in the feature map to find the most important parts of the network for processing. CBAM [36] is a lightweight convolutional attention module. This module combines the channel and spatial attention mechanisms to perform adaptive feature refinement on the input feature map and consider the feature map information from two dimensions to achieve a more comprehensive resource allocation.

Various datasets display different levels of sensitivity toward attention mechanisms, indicating the inadvisability of employing a universal approach while choosing an attention-mechanism-assisted model. This study positions the exploration around the CBAM as a cornerstone. Moreover, this study determines the attention-mechanism-aided model that best aligns with the dataset characteristics, consistently producing predicted images with a notable level of confidence through a methodical process of experimentation.

2.4. Medical Segmentation Model

With the development of deep learning technology, there is more and more research on medical segmentation models. TranUNet [37] is the earliest medical segmentation model that combines the Transformer structure with the Unet structure and achieves excellent results on a variety of medical segmentation datasets. MedT [38] proposes a gating-based axial attention model and introduces the local-global training strategy to improve the model performance. Another research idea is the treatment of model lightweight; for example, UNext [39] uses the combination of MLP and Unet to reduce the number of model parameters, while MALUNet [40] and EGEUNet [41], which are also based on the U-shaped structure, respectively, special attention mechanism and grouping ideas to capture feature information, to achieve a balance between accuracy and lightweight.

3. Methods

This section outlines the specifics of our proposed network, MAPPNet, encompassing two pivotal elements: the MSF module and the APPM. Subsequently, the network’s employed loss function will be deliberated upon.

3.1. MAPPNet Architecture

In the dental calculus dataset, the images of dental calculus reveal intricate features alongside considerable spatial location uncertainty. Consequently, achieving superior segmentation outcomes necessitates enhancing the network’s proficiency in extracting detailed semantic features and spatial localization during the network model construction. To address this challenge, the multi-scale attention pyramid pooling network (MAPPNet) is proposed, and channel attention (CA) and spatial attention (SA) are integrated within the model to improve feature perception; the model structure is depicted in Figure 1.

MAPPNet follows an encoder–decoder architecture. In the encoder part, ResNet50 is chosen as the backbone network, featuring five convolutional blocks. However, in our design, the 7 × 7 convolution used in ResNet50 is replaced by three consecutive 3 × 3 convolutions, followed by a maximum pooling operation to reduce the output size by half.

Subsequently, four residual convolutional blocks labeled as B1–B4 are incorporated. These blocks generate output feature maps with channel dimensions of 256, 512, 1024, and 2048. Based on the FPN structure, we establish connections among the outputs of four distinct blocks, thereby fostering the seamless integration of feature maps across diverse network layers, enhancing the network’s ability to extract global contextual information. Furthermore, we control the number of channels in the multi-scale fusion output to simplify the parameter quantity of the network model. Thus, four feature maps are fused to yield the ultimate output of the encoder part.

In the decoder part, APPM is tailored to manage the complexities of high-dimensional feature data, facilitating a comprehensive analysis of the global and local attributes within the feature map. Consequently, APPM significantly augments the overall segmentation performance. However, the intricate semantic information contained within these high-dimensional features can potentially result in information overload, hampering segmentation accuracy. An attention mechanism is integrated into the PPM to mitigate this challenge and achieve superior segmentation outcomes. This combination approach can help the model suppress or highlight certain feature information. The fusion of spatial and channel attention mechanisms is orchestrated to ensure optimal feature selection and incorporation. The goal is to emphasize relevant feature information while suppressing irrelevant or redundant data by adopting this approach, ultimately improving the accuracy of segmentation results.

3.2. MSF Module

In the context of semantic segmentation, global context information assumes a crucial role. To effectively discern salient regions, the introduction of an MSF module proves instrumental in merging the outputs (B1–B4) from the four residual blocks within the ResNet backbone network. The original channel numbers of these feature maps are 256, 512, 1024, and 2048. To initiate the process, a 1 × 1 convolution is applied to the feature maps for dimensionality reduction. This maneuver leads to adjusted channel numbers of 64, 128, 256, and 512. Such an alteration serves a twofold purpose: extracting pertinent feature information for subsequent operations and facilitating a reduction in the model’s parameter count. Subsequently, feature maps B2–B4 undergo a similar dual-step fusion. Taking B4 as an illustrative example, the initial step involves a 1 × 1 convolution to modify the channel count to 128. The subsequent stride includes resizing B4 to match B3’s dimensions using bilinear interpolation-based upsampling. Feature fusion is accomplished through an element-wise addition, effectively merging the upper and lower feature maps of comparable size. This phase finalizes the first fusion pathway. In the secondary pathway, the feature maps are directly integrated through a 3 × 3 convolution process. Herein, a control mechanism governs the number of output channels, resulting in further parameter reduction. This parallel process is replicated for B2 and B3, where they merge lower-level feature maps based on identical principles. One pathway advances through the fusion facilitated by 1 × 1 convolution and upsampling until its mergence with B1. The other pathway employs a 3 × 3 convolution technique while managing the output channel count at 128. B1 merges features from B2 to B4, encapsulating the most comprehensive semantic information.

After the four feature maps with all 128 channels are obtained, we use “Concat” to enhance the propagation of features by upsampling and merging them step by step to obtain a result with 512 channels as the output of the encoder part. In comparison with directly using the last layer of the ResNet backbone network with 2048 channels as the output, our MSF module offers several advantages. First, this network reduces the number of parameters in the model. Additionally, the module takes into account feature information at different scales by aggregating global contextual information. This mechanism ensures that targets of varying sizes can have appropriate feature representations at the corresponding scales, has good results for recognizing dental calculus images of different sizes, and improves the segmentation performance of the network.

3.3. APPM

CBAM is a lightweight attention module that enhances the network’s ability to identify targets. This module achieves this task by computing attention maps sequentially along two independent dimensions: channel and spatial. Distinct attention modules have varying capabilities for feature refinement. Accordingly, we can construct four different APPMs as decoders for the network. These modules include CA only, SA only, SA first and then CA, and CA first and then SA (i.e., CBAM). Figure 2 illustrates the aforementioned four different APPMs, showcasing the sequential computation of attention maps along the channel and spatial dimensions.

The CA module aims to identify meaningful information within the input feature map, effectively extracting semantic information from the image. This module achieves this task by compressing the spatial dimension of the input feature map to compute the channel attention. Specifically, the input dimensions are initially downscaled by using maximum pooling and average pooling operations, resulting in two pooled feature maps,

F_{m a x}^{c}

and

F_{a v g}^{c}

, individually. These pooled feature maps aggregate spatial information from the original feature maps. Subsequently, these results are fed into a shared network comprising a multilayer perceptron (MLP). The first layer of the MLP consists of C/r neurons (where C represents the number of channels, and r denotes the reduction ratio), utilizing the ReLU activation function. The second layer comprises C neurons, and the weight information of these two layers is shared. The outputs of the MLP are then combined through element-by-element summation, yielding the channel attention M_c ∈ R^C×1×1. This CA is generated using the sigmoid activation operation, as depicted in Figure 3a. The formula for the CA is presented as follows:

M_{C} (F) = σ (W_{1} (W_{0} (F_{\max}^{c})) + W_{1} (W_{0} (F_{a v g}^{c}))),

(1)

where M_c(F) denotes the output of feature map F after channel attention M_c, σ denotes the sigmoid function, W₀ and W₁ represent the weight matrices in the MLP, W₀ ∈ R^C/r×C, and W₁ ∈ R^C×C/r,

F_{m a x}^{c}

and

F_{a v g}^{c}

represent the output of feature map F after maximum pooling and average pooling. In the architecture APPM-CA, we incorporate the CA module into the feature maps of various sizes, along with the inclusion of the skip connection component of the PPM, as illustrated in Figure 2a.

The SA module is designed to emphasize the location information of the target. Unlike the CA module, which compresses the input channel dimension while keeping the spatial dimension constant, the SA module calculates attention based on the spatial dimension. First, the input information is max-pooled and average-pooled to obtain two feature maps,

F_{m a x}^{s}

and

F_{a v g}^{s}

, representing the max-pooled and average-pooled features across channels, respectively. Thereafter, these features are summed over the number of channels by a concatenation operation to achieve cascading. Furthermore, the number of channels is changed to 1 by one 7 × 7 convolution to achieve feature integration. Finally, the spatial attention M_s ∈ R^1×H×W is obtained by sigmoid activation, as shown in Figure 3b. The SA is calculated as follows:

M_{S} (F) = σ (f^{7 \times 7} ([F_{\max}^{s}; F_{a v g}^{s}])),

(2)

where M_s(F) denotes the output of feature map F after spatial attention M_s, σ denotes the sigmoid function, f^7×7 represents the standard convolution of 7 × 7,

F_{m a x}^{s}

and

F_{a v g}^{s}

represent the output of feature map F after maximum pooling and average pooling, and H and W represent the height and width of the feature map, respectively. Similarly, we incorporate the SA module into the pyramid pooling structure, resulting in the architecture APPM-SA, as illustrated in Figure 2b.

The CBAM module, depicted in Figure 4a, follows a tandem approach where CA is followed by SA. This order allows the network to learn and attend to the feature and location information in the channel and spatial dimensions. The process can be summarized using the following equation:

\begin{array}{l} F^{'} = M_{C} (F) \otimes F, \\ F^{″} = M_{S} (F^{'}) \otimes F^{'}, \end{array}

(3)

where M_c and M_s represent CA and SA, respectively; F represents the input feature map; and ⊗ denotes element-by-element multiplication. The input F of the first step is multiplied with the result M_c(F) after CA to obtain the intermediate result F′ along the channel dimension. Thereafter, F′ is multiplied with the output M_s(F′) of SA to obtain the final attention map F″. This result has the same dimension as the initial input F. Finally, a third structure APPM-CBAM can be obtained, as shown in Figure 2c.

The final structure (Figure 4b), APPM, utilizes SA, followed by CA. The calculation process is as follows:

\begin{array}{l} F^{'} = M_{S} (F) \otimes F, \\ F^{″} = M_{C} (F^{'}) \otimes F^{'} . \end{array}

(4)

This approach combines attention in an inverted manner based on the CBAM module, and it proves to be the most effective structure for the dental calculus dataset, as illustrated in Figure 2d. Input information generates four feature maps of sizes 1 × 1, 2 × 2, 3 × 3, and 6 × 6 through adaptive averaging pooling. These feature maps then undergo refinement using the inverted CBAM attentional approach. After refinement, four outputs of the same size are obtained and combined through 1 × 1 convolutional dimensionality reduction and bilinear interpolation upsampling. In the residual connection, the inverted CBAM module directly produces the corresponding outputs. These outputs are combined to yield the results of the APPM module.

3.4. Loss Function

In terms of the dental calculus image segmentation problem, which involves binary classification of identifying the target dental calculus and the background, we utilize the Binary Cross Entropy (BCE) loss function during network training [42]. The BCE loss function is defined as follows:

L o s s = - \frac{1}{n} \sum (y \ln y^{'} + (1 - y) \ln (1 - y^{'})),

(5)

where n represents the number of samples in a batch, y denotes the true value, and y′ represents the predicted value. Before applying the BCE loss function, the predicted value y′ must be passed through the sigmoid activation function. This step ensures that each element of the input vector is transformed into a probability value, thereby guaranteeing that the predicted value falls within the range of zero to one.

3.5. Experimental Configuration

The dental calculus dataset used in this study was obtained at Hangzhou Tongce Hospital by using an endoscope. Any blurred images were discarded to ensure data quality. Every dental calculus image was acquired using the image annotation software LabelMe 3.16.7 with the assistance of experienced dentists. A total of 1033 high-quality dental calculus images with corresponding labels were obtained, where the image size is 1280 × 720. The collected cases of dental calculus include different characteristics such as irregularity, oversized borders, and blurred areas. The dataset was randomly divided into three sets to facilitate the training and evaluation of the model: a training set, a validation set, and a test set, following a ratio of 6:2:2. This division ensures that the model is trained on a sufficient amount of data, while also providing separate subsets for model validation and performance testing.

The experiment is conducted on a Windows 10 PC using the PyTorch 1.12.1 framework with an NVIDIA GeForce RTX 3090 TI graphics card, which has 24 GB of video memory, for training the model. Stochastic gradient descent (SGD) is selected as the optimizer, with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0.0001. The learning rate is minimized by using the cosine annealing method. The batch size is set to 16, and the training is performed for 1000 epochs. Data augmentation techniques, such as random flip, Gaussian blur, and rotation, are applied during training to prevent overfitting and improve generalization. Additionally, a gray border is added to fill the image to maintain its aspect ratio, and it is resized to a fixed input size of 473 × 473 pixels.

3.6. Evaluation Metrics

We use five evaluation metrics to measure the segmentation performance of MAPPNet, including intersection-over-union (IoU), sensitivity (Sen), specificity (Spe), positive predictive value (PPV), and accuracy (Acc), to objectively evaluate the proposed method in this study. The IoU metric calculates the overlap ratio between the predicted segmentation mask and the true value, representing the similarity between the two. Sensitivity, also known as the true-positive rate, measures the ratio of correctly predicted positive samples among all samples that are positive. Specificity, or the true-negative rate, calculates the ratio of correctly predicted negative samples among all samples that are negative. PPV represents the percentage of correctly predicted positive samples among all samples predicted to be positive. Accuracy refers to the overall percentage of correctly predicted samples among all samples. The formulas of these metrics are provided as follows:

I o U = \frac{T P}{T P + F P + F N},

(6)

S e n = \frac{T P}{T P + F N},

(7)

S p e = \frac{T N}{T N + F P},

(8)

P P V = \frac{T P}{T P + F P},

(9)

A c c = \frac{T P + T N}{T P + T N + F N + F P} .

(10)

TP, FN, TN, and FP are all parameters in the confusion matrix, where TP represents the number of true-positives, FP represents the number of false-positives, FN represents the number of false-negatives, and TN represents the number of true-negatives. For example, a higher TP indicates a higher number of correctly identified dental calculus regions, while a higher TN denotes a higher accuracy in identifying the background. These indicators have values between zero and one, where a higher value indicates better segmentation performance.

4. Experiments

In this section, we perform ablation experiments on the proposed segmentation model MAPPNet and compare it with other state-of-the-art models as a reasonable way to evaluate the effectiveness of this work.

4.1. Ablation Experiments

In our ablation study, the performance of the various networks on the dental calculus dataset is evaluated. The baseline network for this study is PSPNet, which is combined with four variants of the APPM to form four distinct network setups: Baseline+APPM-CA, Baseline+APPM-SA, Baseline+APPM-CBAM, and Baseline+APPM. An MSF is integrated into the Baseline network for further performance enhancement, resulting in the creation of the Baseline+MSF network. This integration is followed by systematically incorporating the four APPM structures into the Baseline+MSF network, resulting in the establishment of four additional networks: Baseline+MSF+APPM-CA, Baseline+MSF+APPM-SA, Baseline+MSF+APPM-CBAM, and Baseline+MSF+APPM (MAPPNet). These diverse combinations of attention mechanisms and fusion modules aim to enhance the accuracy and overall performance of the baseline network. Each network configuration is trained under uniform experimental conditions to facilitate unbiased comparisons. The performance evaluation of each network configuration spans multiple metrics, offering comprehensive insights into the effectiveness of the proposed enhancements. The outcomes of this thorough ablation study are concisely tabulated in Table 1, providing a comprehensive summary of the achieved results.

Table 1, lines 1–5, demonstrate that the baseline network exhibits remarkable improvements in all four metrics when the four variants of the APPM are added. In the specificity metric, the attention mechanism effectively enhances the segmentation ability of the model. In particular, the Baseline+APPM-CA variant shows the most significant improvement compared with the baseline, indicating that the CA mechanism is highly effective in aggregating feature capabilities.

Table 1, lines 3–5 (the other three combination methods), yield a slightly lower performance. We attribute this result to the last layer of the output obtained by the encoder part, which represents the deepest layer of the feature map, containing rich semantic information but less spatial location information. The CA mechanism effectively handles this semantic information, resulting in better performance. By contrast, the SA mechanism has a lower processing capability due to its limited location information.

Table 1, lines 4–5, illustrate that the combination of both attention mechanisms can have a mutually detrimental effect, leading to weaker results compared with the combination with channel attention alone. Moreover, these four attention combinations have a minimal influence on the number of model parameters. This notion implies that these combinations can improve the segmentation performance without significantly increasing the network complexity. This approach is advantageous because it allows for enhanced performance without excessive computational requirements.

Table 1, line six, demonstrates that the baseline network has significant improvements in all five metrics after incorporating the MSF module, with particularly notable gains in IoU (4.68%) and Sen (3.39%). This result indicates that obtaining global contextual information greatly enhances the performance of the segmentation network.

Table 1, lines 7–10, building upon this improvement, the four combined forms of APPM further enhance the model’s performance. Baseline+MSF+APPM-SA and Baseline+MSF+APPM show the most substantial improvements, while the other two combinations exhibit relatively decreased performance. The combination form with the most significant improvement is Baseline+MSF+APPM, which is referred to as MAPPNet. MAPPNet achieves improvements of 5.25% in IoU, 4.29% in Sen, 0.17% in Spe, 3.28% in PPV, and 0.5% in Acc compared with the baseline. In addition, MAPPNet shows improvements in three metrics compared with the Baseline+MSF approach. Baseline+MSF+APPM-SA ranks as the second-best combination form after MAPPNet. Furthermore, the number of parameters in MAPPNet is 27.78 M, which is 40.5% lower than the 46.71 M of the baseline network. This reduction not only decreases the computational load of the network but also makes the model more compact for subsequent deployment and maintenance.

Based on the above research and analysis, the feature map output at the encoder end can contain semantic and spatial position information in the feature extraction stage by adopting a multi-scale fusion strategy. This situation addresses the limitation of insufficient spatial information caused by solely outputting the last layer of feature maps. The CA and SA mechanisms can optimally function when the semantic and spatial information is adequate, complementing each other and achieving the best performance.

In the case of dental calculus images, unlike complex street view and portrait datasets, the importance of semantic information is lower, while spatial location information is more crucial. Accordingly, Baseline+MSF+APPM-SA, which incorporates SA mechanisms, improves upon Baseline+MSF. Meanwhile, the combination of SA first, followed by CA, allows for the processing of critical spatial information before aiding in the aggregation of semantic information. This spatial-first and then channel attention approach is the most suitable. In the research of CBAM, a dataset with complex semantic information is used. Consequently, the attention combination form of channel-first and spatial-second is considered optimal. However, the spatial-first and then channel attention form for the medical–dental calculus data with lower weightage on semantic information is the most appropriate for the feature map with global information obtained. The above-mentioned experimental results confirm this statement.

The visualization of each network in the ablation experiment is shown in Figure 5, where GT stands for Ground Truth, and BL means Baseline. In the first row of Figure 5a, the baseline network can only segment the upper half of the calculus region, as it is influenced by the surrounding tissues and fails to capture the complete region. In the first row of Figure 5b, the segmentation model, with the addition of the MSF and APPMs, can identify more regions and capture more detailed information. Baseline+MSF+APPM yields the best result among the combinations tested. In the second row of Figure 5a, for more complex dental calculus images, discontinuous segmentation masks and unclear edge information persist even with the combined APPM, and the improvement is insignificant. However, the feature map acquires global information and increases the perceptual field with the addition of the MSF module. This phenomenon solves the issue of segmentation region discontinuity and brings the prediction result closer to the true value, supporting our previous notion. These visualizations highlight the importance of the combination of the MSF and APPM modules in dental calculus segmentation. The way these modules are combined plays a crucial role in achieving accurate segmentation results.

4.2. Comparison with Other Methods

We conduct a comparative analysis, with several state-of-the-art methods to provide further evidence of the outstanding performance of MAPPNet in dental calculus segmentation. The selected methods for comparison include BiseNetV1 [43], CGNet [44], ICNet [45], MobileNetV3 [21], UNext [39], BiseNetV2 [46], ERFNet [47], Fastscnn [48], PSPNet [43], DMNet [49], DeeplabV3+ [50], MALUNet [40], EGEUNet [41], HRNet [51], Fastfcn [52], ENCNet [53], CCNet [54], TransUNet [37], Segmenter [55], Vit [56], Segformer [57], Swin [58], DPT [59], SwinUNet [60], and MedT [38]. Table 2 presents the quantitative evaluations of these networks. Our proposed method, MAPPNet, achieved the highest ratings in multiple metrics, such as an IoU of 81.46%, a sensitivity of 90.2%, and an accuracy of 98.35%. The overall comparison highlights the substantial advantages of the MAPPNet algorithm over existing segmentation approaches. Figure 6 illustrates the loss and accuracy variation curves of MAPPNet.

In Figure 7, it can be seen that our proposed algorithm identifies dental calculus with higher accuracy in the first row of images due to the presence of melanin deposits on the surface of dental calculus, which poses a challenge for the model in identifying the lesion area. However, compared to the other models, MAPPNet’s predictions are closer to the real labels, presenting the most superior segmentation results. In the other three rows, the morphology of dental calculus presents a fine and continuous feature. The results of other segmentation networks generally show discontinuity, while our proposed algorithm still shows significant advantages in detecting dental calculus.

4.3. Generalization Ability

To evaluate the generalization ability of the proposed method, we perform an experimental evaluation on two medical public datasets, ISIC2018 and Kvasir-SEG; the datasets are described as follows:

ISIC2018: it is mainly used for dermatological diagnosis and research, with a total of 3694 skin lesion images, of which 2594 are training images, 1000 test images, and 100 validation set images.

Kvasir-SEG: it is an endoscopic dataset for gastrointestinal polyp segmentation, which includes a total of 1000 gastrointestinal polyp images and their corresponding segmentation masks. We divide it into 800, 100, and 100 images for training, validation, and testing.

As shown in Table 3, our method achieves the lead in most of the metrics, indicating that MAPPNet still maintains excellent prediction ability on different datasets and shows good generalization performance. In Figure 8 and Figure 9, we have selected some of the representative methods for comparison with MAPPNet; it can be observed that our method achieves better prediction results compared to other state-of-the-art algorithms, especially more significant for those cases of lesions with irregular shapes and fuzzy boundaries.

5. Conclusions

Recognizing dental calculus in oral cavity images poses numerous challenges due to obscured tissues, such as in the teeth and tongue, and the varied morphology of dental calculus itself. This study introduces a novel approach, MAPPNet, for dental calculus image segmentation. MAPPNet leverages an MSF module to perform feature extraction, progressively merging information to capture global context and generate a high-quality feature map. The PPM is also enhanced with different attention mechanisms to prioritize important feature information for accurate segmentation. Extensive ablation experiments demonstrate the efficacy of these modules in improving calculus segmentation performance. The most optimal combination of modules is determined for dental calculus image segmentation. Comparative experiments on some state-of-the-art segmentation networks further validate the superior performance of MAPPNet with multiple metrics. These results demonstrate the superiority of our proposed algorithm and make significant progress in improving the accuracy of edge information, showing that the proposed algorithm can accurately segment the calculus region and provide a reliable solution for future clinical diagnosis.

However, the algorithm still faces certain challenges in accuracy when dealing with irregular lesion area segmentation. Future research will focus on optimizing the network, particularly in improving the identification of detailed parts in dental calculus images. The applicability of MAPPNet in various medical segmentation domains will also be explored to assist healthcare professionals in making precise judgments.

Author Contributions

T.N.: Conceptualization, Methodology, Software, Validation, Writing—original draft, Visualization. S.Y.: Writing—Review and Editing. D.W.: Resources, Funding acquisition. C.W.: Data Curation. Y.Z.: Writing—Review and Editing, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

The research work reported in this paper is supported by the National Natural Science Foundation of China under Grants 62072342.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy issues.

Conflicts of Interest

Author Di Wang was employed by the company Hangzhou Jiesao Technology Co., Ltd. Author Conger Wang was employed by the company Hangzhou Eyar Digital Dental Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Aghanashini, S.; Puvvalla, B.; Mundinamane, D.B.; Apoorva, S.; Bhat, D.; Lalwani, M. A Comprehensive Review on Dental Calculus. J. Health Sci. Res. 2016, 7, 42–50. [Google Scholar] [CrossRef]
Dumitrescu, A.L.; Kawamura, M. Etiology of Periodontal Disease: Dental Plaque and Calculus. In Etiology and Pathogenesis of Periodontal Disease; Springer: Berlin/Heidelberg, Germany, 2010; pp. 1–38. [Google Scholar]
Lee, C.Y.; Chuang, C.C.; Chen, G.J.; Huang, C.C.; Lee, S.Y.; Lin, Y.H. Automated Segmentation of Dental Calculus in Optical Coherence Tomography Images. Sens. Mater. 2018, 30, 2517. [Google Scholar] [CrossRef]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Khan, J.F.; Bhuiyan, S.M.A.; Adhami, R.R. Image Segmentation and Shape Analysis for Road-Sign Detection. IEEE Trans. Intell. Transp. Syst. 2011, 12, 83–96. [Google Scholar] [CrossRef]
Yang, L. An improved Prewitt algorithm for edge detection based on noised image. In Proceedings of the 2011 4th International Congress on Image and Signal Processing, Shanghai, China, 15–17 October 2011; pp. 1197–1200. [Google Scholar]
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed]
Sheikh, Y.A.; Khan, E.A.; Kanade, T. Mode-seeking by Medoidshifts. In Proceedings of the IEEE 11th International Conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar]
Pham, D.L.; Xu, C.; Prince, J.L. Current Methods in Medical Image Segmentation. Annu. Rev. Biomed. Eng. 2000, 2, 315–337. [Google Scholar] [CrossRef]
Tremeau, A.; Borel, N. A region growing and merging algorithm to color segmentation. Pattern Recognit. 1997, 30, 1191–1203. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Du, X.; Wang, J.; Sun, W. UNet retinal blood vessel segmentation algorithm based on improved pyramid pooling method and attention mechanism. Phys. Med. Biol. 2021, 66, 175013. [Google Scholar] [CrossRef] [PubMed]
Yu, J.; Cheng, T.; Cai, N.; Zhou, X.G.; Diao, Z.; Wang, T.; Du, S.; Liang, D.; Zhang, D. Wheat Lodging Segmentation Based on Lstm_PSPNet Deep Learning Network. Drones 2023, 7, 143. [Google Scholar] [CrossRef]
Zhao, X.; Huang, M.; Li, L.; Qi, X.S.; Tan, S. Multi-to-binary network (MTBNet) for automated multi-organ segmentation on multi-sequence abdominal MRI images. Phys. Med. Biol. 2020, 65, 165013. [Google Scholar] [CrossRef]
Ma, T.; Zhou, X.; Yang, J.; Meng, B.; Qian, J.; Zhang, J.; Ge, G. Dental Lesion Segmentation Using an Improved ICNet Network with Attention. Micromachines 2022, 13, 1920. [Google Scholar] [CrossRef]
Lin, T.-J.; Lin, Y.-T.; Lin, Y.-J.; Tseng, A.-Y.; Lin, C.-Y.; Lo, L.-T.; Chen, T.-Y.; Chen, S.-L.; Chen, C.-A.; Li, K.-C.; et al. Auxiliary Diagnosis of Dental Calculus Based on Deep Learning and Image Enhancement by Bitewing Radiographs. Bioengineering 2024, 11, 675. [Google Scholar] [CrossRef] [PubMed]
Cui, Z.; Li, C.; Wang, W. ToothNet: Automatic Tooth Instance Segmentation and Identification From Cone Beam CT Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6368–6377. [Google Scholar]
Koch, T.L.; Perslev, M.; Igel, C.; Brandt, S.S. Accurate Segmentation of Dental Panoramic Radiographs with U-NETS. In Proceedings of the IEEE 16th International Symposium on Biomedical Imaging, Venice, Italy, 8–11 April 2019; pp. 15–19. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Athens, Greece, 17–21 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 424–432. [Google Scholar]
Zhang, J.; Zhang, Y.; Jin, Y.; Xu, J.; Xu, X. MDU-Net: Multi-scale densely connected U-Net for biomedical image segmentation. Health Inf. Sci. Syst. 2023, 11, 13. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial Transformer Networks. Neural Inf. Process. Syst. 2015, 2, 2017–2025. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2018, 31, 9401–9411. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Le, L.; Yuille, A.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2021; pp. 36–46. [Google Scholar]
Valanarasu, J.M.J.; Patel, V.M. Unext: MLP-based rapid medical image segmentation network. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 23–33. [Google Scholar]
Ruan, J.; Xiang, S.; Xie, M.; Liu, T.; Fu, Y. MALUNet: A Multi-Attention and Light-weight UNet for Skin Lesion Segmentation. In Proceedings of the IEEE International Conference Bioinformatics Biomedicine, Las Vegas, NV, USA, 6–8 December 2022; pp. 1150–1156. [Google Scholar]
Ruan, J.; Xie, M.; Gao, J.; Liu, T.; Fu, Y. EGE-UNet: An Efficient Group Enhanced UNet for Skin Lesion Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 481–490. [Google Scholar]
Creswell, A.; Arulkumaran, K.; Bharath, A.A. On denoising autoencoders trained to minimise binary cross-entropy. arXiv 2017, arXiv:1708.08487. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-scnn: Fast semantic segmentation network. arXiv 2019, arXiv:1902.04502. [Google Scholar]
He, J.; Deng, Z.; Qiao, Y. Dynamic multi-scale filters for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3562–3572. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yu, Y. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv 2019, arXiv:1903.11816. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12179–12188. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]

Figure 1. Overall architecture of MAPPNet.

Figure 2. Four architectures of the APPM: (a) PPM with CA; (b) PPM with SA; (c) PPM with CA first and then SA; and (d) PPM with SA first and then CA.

Figure 3. Architecture of two different attention modules: (a) CA module; (b) SA module.

Figure 4. Two different sequential combinations of the attention modules: (a) initially processed by the CA module and further processed by the SA module; (b) initially processed by the SA module and further processed by the CA module.

Figure 5. Visualization of the predicted results of the ablation experiments. (a,b) represent different rows.

Figure 6. Training loss (left panel), accuracy (right panel) curves of MAPPNet on the dental calculus dataset.

Figure 7. Visualization of the comparison results with other segmentation networks.

Figure 8. Visual comparisons with different state-of-the-art methods on the skin lesions segmentation task.

Figure 9. Visual comparisons with different state-of-the-art methods on the gastrointestinal polyp segmentation task.

Table 1. Quantitative results of the ablation experiments on the dental calculus dataset.

Items	Methods	IoU (%)	Sen (%)	Spe (%)	PPV (%)	Acc (%)	Params (M)
1	Baseline	76.21	85.91	98.89	87.09	97.85	46.71
2	Baseline+APPM-CA	77.95	87.85	98.89	87.37	98.0	47.26
3	Baseline+APPM-SA	77.32	86.66	98.95	87.77	97.96	44.64
4	Baseline+APPM-CBAM	77.70	87.64	98.88	87.27	97.98	47.26
5	Baseline+APPM	77.91	87.85	98.89	87.32	98.0	47.26
6	Baseline+MSF	80.89	89.30	99.09	89.58	98.31	27.62
7	Baseline+MSF+APPM-CA	80.11	89.22	99.01	88.70	98.22	27.78
8	Baseline+MSF+APPM-SA	81.20	89.23	99.14	90.02	98.34	27.62
9	Baseline+MSF+APPM-CBAM	80.51	88.67	99.01	88.75	98.26	27.78
10	Baseline+MSF+APPM	81.46	90.20	99.06	89.37	98.35	27.78

Table 2. Quantitative results of the different networks on the dental calculus dataset.

Methods	IoU (%)	Sen (%)	Spe (%)	PPV (%)	Acc (%)	Params (M)
BiseNetV1	41.45	56.91	96.74	60.41	93.55	56.856
CGNet	53.29	65.51	98.00	74.07	95.39	0.492
ICNet	57.67	70.16	98.11	76.41	95.86	47.528
MobileNetV3	58.28	72.55	97.86	74.77	95.83	3.282
UNext	59.06	72.52	98.01	76.09	95.96	1.471
BiseNetV2	60.44	70.55	98.54	80.84	96.29	3.341
ERFNet	63.17	78.93	97.82	75.98	96.31	2.082
Fastscnn	67.68	78.74	98.57	82.82	96.98	1.398
PSPNet	76.21	85.91	98.89	87.09	97.85	46.602
DMNet	76.37	82.37	99.31	91.28	97.95	50.803
DeeplabV3+	76.87	83.14	99.29	91.07	97.99	41.216
MALUNet	77.36	85.31	99.16	89.12	98.07	0.175
EGEUNet	77.86	86.12	99.08	89.01	98.16	0.053
HRNet	78.57	87.46	99.01	88.55	98.09	9.636
Fastfcn	79.61	88.13	99.07	89.17	98.19	66.338
ENCNet	80.06	88.85	99.04	88.99	98.22	33.510
CCNet	80.41	88.80	99.09	89.49	98.26	47.453
TransUNet	37.24	46.70	97.78	64.77	93.68	66.815
Segmenter	39.37	54.35	96.68	58.81	93.28	6.685
Vit	47.23	79.80	93.98	53.64	92.84	142
Segformer	55.66	64.17	98.67	80.76	95.90	3.716
Swin	56.49	67.26	98.34	77.92	95.84	58.942
DPT	58.71	78.59	97.04	69.89	95.56	110
SwinUNet	66.94	78.91	98.13	80.74	96.18	27.145
MedT	67.39	79.11	98.45	82.57	96.81	1.604
MAPPNet	81.46	90.20	99.06	89.37	98.35	27.784

Table 3. Comparison of different state-of-the-art methods for segmentation of skin lesions and gastrointestinal polyp tasks.

Methods	ISIC2018					Kvasir-SEG
Methods	IoU (%)	Sen (%)	Spe (%)	PPV (%)	Acc (%)	IoU (%)	Sen (%)	Spe (%)	PPV (%)	Acc (%)
BiseNetV1	67.01	86.55	87.74	74.80	87.38	69.69	76.31	97.64	88.92	93.40
CGNet	67.78	76.96	94.30	85.04	89.17	84.94	94.15	97.31	89.67	96.68
ICNet	69.65	81.23	93.01	83.01	89.52	85.89	91.57	98.36	92.36	97.01
MobileNetV3	73.42	78.84	96.89	91.43	91.55	78.19	80.76	99.19	96.10	95.53
UNext	74.46	84.17	94.52	86.59	91.45	83.48	88.18	98.60	93.99	96.53
BiseNetV2	71.22	76.57	96.84	91.06	90.84	81.96	90.67	97.36	89.50	96.03
ERFNet	73.81	81.65	95.53	88.49	91.42	87.53	92.77	98.52	93.94	97.37
Fastscnn	72.39	80.23	95.44	88.11	90.94	83.32	91.48	97.57	90.33	96.36
PSPNet	67.32	73.66	96.04	88.66	89.41	88.19	91.70	99.02	95.85	97.56
DMNet	67.20	75.66	94.71	85.74	89.07	88.36	92.50	98.84	95.17	97.58
DeeplabV3+	64.70	73.02	94.59	85.02	88.20	88.46	90.70	99.37	97.28	97.65
HRNet	70.30	76.36	96.38	89.87	90.45	87.34	90.37	99.14	96.29	97.40
Fastfcn	65.28	70.39	96.71	89.99	88.91	88.58	91.06	99.30	97.01	97.67
ENCNet	71.04	85.91	91.19	80.40	89.63	89.36	92.37	99.17	96.49	97.82
CCNet	73.36	83.82	94.00	85.46	90.99	89.73	92.41	99.26	96.87	97.90
TransUNet	68.00	74.29	96.11	88.92	89.65	71.71	77.59	97.97	90.45	93.92
Segmenter	67.34	84.98	88.98	76.44	87.80	38.76	46.58	94.99	69.76	85.37
Vit	73.75	85.69	93.19	84.11	90.97	53.47	61.44	96.30	80.46	89.37
Segformer	73.13	81.38	95.25	87.82	91.05	67.80	71.79	98.54	92.43	93.23
Swin	73.75	85.20	93.47	84.58	91.02	66.40	71.46	98.11	90.36	92.82
DPT	71.34	83.74	92.69	82.81	90.04	76.14	82.65	97.88	90.62	94.85
MAPPNet	76.48	87.08	94.17	86.27	92.07	91.38	94.16	99.25	96.87	98.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nie, T.; Yao, S.; Wang, D.; Wang, C.; Zhao, Y. MAPPNet: A Multi-Scale Attention Pyramid Pooling Network for Dental Calculus Segmentation. Appl. Sci. 2024, 14, 7273. https://doi.org/10.3390/app14167273

AMA Style

Nie T, Yao S, Wang D, Wang C, Zhao Y. MAPPNet: A Multi-Scale Attention Pyramid Pooling Network for Dental Calculus Segmentation. Applied Sciences. 2024; 14(16):7273. https://doi.org/10.3390/app14167273

Chicago/Turabian Style

Nie, Tianyu, Shihong Yao, Di Wang, Conger Wang, and Yishi Zhao. 2024. "MAPPNet: A Multi-Scale Attention Pyramid Pooling Network for Dental Calculus Segmentation" Applied Sciences 14, no. 16: 7273. https://doi.org/10.3390/app14167273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MAPPNet: A Multi-Scale Attention Pyramid Pooling Network for Dental Calculus Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Encoder–Decoder Architecture

2.2. Multi-Scale Feature Fusion

2.3. Attention Mechanism

2.4. Medical Segmentation Model

3. Methods

3.1. MAPPNet Architecture

3.2. MSF Module

3.3. APPM

3.4. Loss Function

3.5. Experimental Configuration

3.6. Evaluation Metrics

4. Experiments

4.1. Ablation Experiments

4.2. Comparison with Other Methods

4.3. Generalization Ability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI