Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation

Chen, Keju; Zhang, Yun; Zhong, Li; Liu, Yongguo

doi:10.3390/electronics14071457

Open AccessArticle

Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation

¹

Knowledge and Data Engineering Laboratory of Chinese Medicine, School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China

²

Innovation Center for Electronic Information & Traditional Chinese Medicine, University of Electronic Science and Technology of China, Chengdu 610054, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(7), 1457; https://doi.org/10.3390/electronics14071457

Submission received: 6 March 2025 / Revised: 31 March 2025 / Accepted: 2 April 2025 / Published: 3 April 2025

(This article belongs to the Special Issue Mechanism and Modeling of Graph Convolutional Networks)

Download

Browse Figures

Versions Notes

Abstract

:

While existing tongue detection methods have achieved good accuracy, the problems of low detection speed and excessive noise in the background area still exist. To address these problems, a fast tongue detection model based on a lightweight model and deep feature propagation (TD-DFP) is proposed. Firstly, a color channel is added to the RGB tongue image to introduce more prominent tongue features. To reduce the computational complexity, keyframes are selected through inter frame differencing, while optical flow maps are used to achieve feature alignment between non-keyframes and keyframes. Secondly, a convolutional neural network with feature pyramid structures is designed to extract multi-scale features, and object detection heads based on depth-wise convolutions are adopted to achieve real-time tongue region detection. In addition, a knowledge distillation module is introduced to improve training performance during the training phase. TD-DFP achieved 82.8% mean average precision (mAP) values and 61.88 frames per second (FPS) values on the tongue dataset. The experimental results indicate that TD-DFP can achieve efficient and accurate tongue detection, achieving real-time tongue detection.

Keywords:

object detection; tongue image; deep learning; lightweight; multi-scale feature

1. Introduction

Tongue diagnosis is an important branch of traditional Chinese medicine (TCM), belonging to the four diagnostic methods (observation, auscultation/olfaction, inquiry, pulse feeling/palpation), commonly used to identify diseases. Tongue diagnosis is a key process in disease diagnosis, which judges the state of the human body by observing tongue images [1]. With the advancement of objectification and the modernization of TCM, the process of the digitization of patient information is also accelerating. In response to the problem of tongue image acquisition, traditional methods require professionals to manually capture patients’ tongue images using instruments such as tongue diagnostic devices [2]. This process has the disadvantages of low efficiency and dependence on the personal levels of the collection personnel. The collection personnel need to take pictures of each patient one by one, which is inefficient and consumes a lot of time and humanpower. On the other hand, the auxiliary tongue diagnosis model also relies on high-quality tongue body images. Therefore, obtaining high-quality tongue body images is one of the important tasks in promoting the modernization of tongue diagnosis. Through computer vision and machine learning, achieving efficient, accurate, and automatic detection of the tongue body has significant significance [3].

Tongue detection firstly extracts features from tongue images, then the extracted feature information is fed into a detection head to output the type and position information. The existing tongue detection methods are mainly divided into traditional object detection algorithms and deep learning-based object detection algorithms. (1) Traditional object detection algorithms require manually designing features and combining them with machine learning classifiers to achieve object classification and localization. Bo et al. [4] used the HOG-SVM object detection algorithm to extract multiple sets of sample features from different angles in response to the problem of random and overlapping positions and angles of electronic components after flipping the vibrating feeder. The experiment showed that feature extraction with every 10 degrees of rotation can achieve a classification accuracy of 79.30%. Wang [5] proposed a parking event detection scheme based on an improved Haar-like feature extraction and Adaboost cascade classifier to address the impact of lighting, viewpoint, and scale changes on parking event detection in complex scenes. The experimental results showed that the method can achieve a positive detection rate of 80.97% even in video scenes with high interference. Zheng [6] binarized the tongue image and performed morphological filtering to extract color, contour, and texture features. Then, a random forest classifier was established to achieve tongue detection in the image, achieving an accuracy of 92.10%. Fu et al. [7] proposed a tongue image detection algorithm based on radial edge detection and the Snake model, which determined the initial value of the Snake model through the radial edge detection algorithm to improve the accuracy of detection. On the established dataset, the success rate of tongue segmentation was 94.00%. (2) Deep learning-based object detection algorithms can automatically extract features and achieve object classification and localization, without the need for manual feature design [8]. Tang et al. [9] proposed a two-stage method based on deep learning for fine-grained classification of the tooth marks of tongues: In the first stage, a cascaded convolutional neural network is used to simultaneously detect tongue regions and tongue landmarks. In the second stage, the detected tongue regions and markers are fed into a fine-grained classification network for final recognition. The experimental results showed that the detection accuracy and F1 score for the target tongue body reached 94.20% and 94.80%, respectively. Liu [10] proposed an object detection algorithm based on attention cross layer feature fusion for the detection of tongue crack and tooth mark areas in tongue images. Through the attention module, the feature extraction is strengthened, and multi-scale feature map connections are established. The GIoU (Generalized Intersection over Union) loss function is improved to solve the problem of anchor frame overlap in small object detection. Experimental results showed that the average accuracy for detecting tongue cracks and tooth marks reaches 86.37% and 89.40%, respectively. On the basis of the YoloV4 tiny model, Zhu [11] integrated the Coord attention mechanism to increase the perception field of the model, and achieved an accuracy of 98.89% on the collected tongue body image dataset. Zendehdel et al. [12] studied the classification and localizing of tools in industry with an improved YoloV5 model trained on a lightweight industrial tools image dataset, and the model achieved 98.3% in mAP. And with respect to security in AI, object detection models without corresponding preparation can be easily deceived. Kwon et al. [13] studied adversarial example generation in the evasion attack based on GNNs (Graph Neural Networks). Based on a two-stage generation, the method can generate adversarial samples similar to real samples, achieving a 92.4% attack success rate.

Traditional object detection algorithms and deep learning-based object detection algorithms have achieved significant results, but existing methods have certain limitations: (1) The large number of model parameters limits the deployment equipment, making it difficult to achieve the deployment with low compute capability [14]. (2) The detection speed needs to be improved. Tongue detection models mainly target static photo data, while their capabilities for dynamic tongue videos are limited [15]. In response to the above issues, this article proposes a fast tongue detection algorithm, TD-DFP, based on lightweight object detection and deep feature propagation. The main innovations are as follows: (1) We convert keyframe images into four channel images of R, G, B, and V to introduce more tongue features. We combine the NanoDet model and pyramid attention feature fusion network with depth-wise convolution to fuse multi-scale feature maps for reducing parameter and computational complexity and achieving tongue region detection in video images. (2) Inspired by Label Assignment Distillation [16], we propose a grouping attribute auxiliary guidance module based on knowledge distillation with teacher model generating for the student and a multi-path convolutional neural network to extract color, shape, texture, and other features of tongue images separately, in order to promote model learning and convergence. (3) By calculating the feature optical flow map between non-keyframes and keyframes of the tongue image, we align and propagate the features of the keyframes to the non-keyframes through the optical flow map, and output the tongue detection boxes in the keyframe and non-keyframe images through the detection head, achieved real-time video stream tongue detection.

2. Materials and Methods

2.1. Dataset

The tongue body dataset used in this experiment was composed of records of the tongue body images of patients during clinical diagnosis and the corresponding patient information. The data inclusion criteria are as follows: (1) Patients who meet the requirements of TCM tongue diagnosis and undergo the diagnostic process. (2) Patients with clear syndrome types. (3) Patients who sign the informed consent form. We implemented the following data exclusion criterion: patients with incomplete and unclear tongue body image acquisition. From December 2021 to January 2022, a total of 1452 clinical cases with an average age of 28.52 years were collected and organized, as shown in Table 1. The photo format was JPG, and 1452 patients participated in the collection. For each patient, lingual and sublingual images were collected, totaling 2904 tongue body images. The name of the ethics committee is the Ethics Committee of the University of Electronic Science and Technology of China, which approved the work on 11 March 2025 with approval code 106142025031133093. Written informed consent has been obtained from the patient(s).

The detection box positions that can identify the tongue region were annotated through manual annotation, and the annotated data were saved in JSON format. The annotation of the tongue image dataset was performed with the aim of providing annotation values for the proposed object detection network. For the tongue image annotation, we adopted the open-source annotation software Lableme 3.13.0, aiming to mark the position and category of each tongue body image. Each tongue body image was annotated as is shown in Figure 1, and after annotation, the position and category information of each image’s tongue body was output as a JSON file; the critical content is shown in Table 2. The labels of the tongue images include three parts: images, categories, and annotations. The images contain information about the image file, such as the height, width, id, and filename. The categories contain information about the set of categories in the dataset, such as the super-category, which is the parent class of the category, the id, and the name of the category. Annotations contain information about the tongue body region and its category, such as the id and area of the tongue body. The categoryid annotation shows the category of the image. The bbox annotation contains the coordinates of the top left corner, the height, and the width of the of the tongue body region.

This experiment randomly used 70% of the tongue image data as the training set, 20% of the tongue image data as the validation set, and the remaining data as the test set. Based on the input tongue image model for inference, the detection results of the tongue image position and category were obtained. In the experiment, the performance of TD-DFP was evaluated by the average accuracy, mAP, memory usage parameters, and the FPS. The optimization strategy was Adam gradient descent, and GIoU loss was selected as the loss function for the backpropagation of gradients.

2.2. Method

2.2.1. Overview

A fast tongue detection model, TD-DFP, based on lightweight object detection and deep feature propagation was proposed to achieve the real-time tongue detection of video streams. The overall framework of TD-DFP is shown in Figure 2, and mainly includes three parts: tongue detection based on improved NanoDet, auxiliary training based on knowledge distillation, and a keyframe detection module based on the inter frame difference. (1) In the process of tongue detection, the frame difference method is first used to select appropriate keyframes, and these keyframe images are converted into four channel images of R, G, B, and V. Then, the NanoDet network with a layered convolution module is used to predict the category and position of the tongue region in the image. (2) In order to further improve detection performance, a knowledge distillation-based grouping attribute auxiliary guidance module is introduced, and a multi-path convolutional neural network designed to extract unique features, such as color, shape, texture, etc., of keyframe tongue images is used. (3) The feature optical flow map between non-keyframes and keyframes of the tongue body image is calculated. The features of the keyframes are aligned and propagated to the non-keyframes through the optical flow map. After this, the detection head outputs the tongue body detection boxes in the keyframe and non-keyframe images.

2.2.2. Tongue Detection Based on Improved NanoDet

The core of tongue detection and recognition is the tongue image feature extractor. The essence of the feature extraction process is to train a neural network using a training dataset, and continuously adjust the weights of the convolution kernel through a backpropagation algorithm to filter out non-critical features and extract unique feature information of the tongue image. The extracted high-dimensional features are compressed and transformed, and expressed in the form of probability distributions corresponding to the classification and regression of the tongue body.

Before extracting features from the tongue body image, the tongue body region’s distinct characteristics in the V space of the HSV (hue, saturation, value) color space are utilized to convert the tongue body image from the three channels of R, G, and B to the four channels of RGBV. The value range of each RGB component is 0 to 255, and the value range of the V component is [0, 1]. The conversion formula is shown in Equation (1).

V = m a x (R, G, B) / 255 * 100 %

(1)

In order to accurately extract feature information of four channel tongue images with fewer network parameters, the ShuffleNet [17] network is used as the backbone layer to extract tongue image features. The feature extraction blocks of each layer in the backbone are shown in Figure 3.

Group convolution and channel rearrangement techniques are used in the feature extraction block to reduce the model’s parameter count and computational load. Different convolution operations share the same set of tongue image feature maps to reduce computational complexity. Specifically, grouped convolution divides the convolution kernels into two groups. One group uses depth-wise convolution (DWConv) with a stride of 2 and size of 3 × 3 to separate the tongue image feature map. After pooling through the Batch Normalization (BN) layer, the convolution is performed using a 1 × 1 convolution kernel. The calculation process is shown in Equation (2). The other group first applies a 1 × 1 convolution kernel (

C o n v

) to the feature maps, followed by BN and ReLU layers. Then, a 3 × 3 depth-wise convolution with stride 2 is performed, and after pooling through the BN operation, the feature maps are convolved using a 1 × 1 convolution kernel. The calculation process is shown in Equation (3). Finally, the two sets of calculated tongue image feature maps are concatenated through a BN layer to restore the input size of the feature maps. Finally, channel shuffling (CS) is used to shuffle the output channels and rearrange the feature maps in the convolutional layer in the channel dimension, allowing for information exchange between different groups of features and enhancing the network’s feature fusion ability. The calculation process is shown in Equation (4).

F_{1} = C o n v (B N (D W C o n v (F_{i n})))

(2)

F_{2} = C o n v (B N (D W C o n v (R e L U (B N (C o n v (F_{i n}))))))

(3)

F_{o u t} = C S (C o n c a t (R e L U (B N (F_{2})), R e L U (B N (F_{1}))))

(4)

F_{1}, F_{2}

are the middle-level output of the two branches in the backbone blocks shown in Figure 3.

F_{i n}

is input feature maps, and

F_{o u t}

is output feature maps.

C S

represents the channel rearrangement operation.

C o n c a t

represents the concatenate operation along the channel dimension. ShuffleNetV2 consists of four layers of feature extraction blocks, and the parameters of each layer of the network are shown in Table 3. Finally, different scales of tongue image feature maps are output from the feature extraction layers {Stage1, Stage2, Stage3, Stage4}, respectively {

F_{1}

,

F_{2}

,

F_{3}

,

F_{4}

}.

In the PANet Feature Pyramid Network (PAFPN), the network module in GhostNet [18] is used as the module for processing feature fusion between multiple layers. The basic structural unit consists of a set of depth-wise convolutions of 1 × 1 convolutions and 3 × 3 depth-wise convolutions, with very small parameters and computational complexity. The fusion process of the feature pyramid is shown in Figure 4, where down-sampling is used to reduce the dimensionality of the input feature map. The tongue image feature maps scaled to different scales are fused through concatenation, and then a composite operation of up-sampling, skip-connection, and convolution is used to achieve the combination of low-level tongue image features and high-level semantic information. Finally, the output multi-scale feature maps are fed into the detection head of the model for subsequent tongue detection tasks.

In the detection head, based on the anchor-free strategy [19], different convolution prediction detection boxes are used for the multi-scale tongue image feature map obtained from PAFPN. Depth convolution is used instead of ordinary convolution, and the number of convolution stacks was reduced from 4 to 2, greatly optimizing the detection process by spending a lot of time calculating the center point. And the depth-wise part of the depth-wise convolution kernel was changed from 3 × 3 to 5 × 5, which can improve the receptive field and performance of the detector with less parameter increase. The loss function was designed based on the predicted classification (

{l o s s}_{c l s}

), regression (

{l o s s}_{i o u}

), and the center point distance (

{l o s s}_{d i s}

), where the loss function

L_{1}

is as is shown in Equation (5). Predicted classification (

{l o s s}_{c l s}

), based on cross entropy loss, is shown in Equation (6). Regression (

{l o s s}_{i o u}

), based on the IoU(Intersection over Union) between the prediction and ground truth, is shown in Equation (7). Center point distance (

{l o s s}_{d i s}

), based on the L2-norm between the prediction center point and ground truth center point, is shown in Equation (8).

L_{1} = {l o s s}_{c l s} + {l o s s}_{i o u} + {l o s s}_{d i s} .

(5)

{l o s s}_{c l s} = C E (P_{c}, Y_{g t}^{c}) \times {(Y_{g t}^{c} - P_{c})}^{2}

(6)

{l o s s}_{i o u} = - l o g (I o U (P_{b b o x}, Y_{g t}^{b b o x}))

(7)

{l o s s}_{d i s} = a^{|P_{p r e d} - x_{g t}| - β}

(8)

The

P_{c}

,

P_{b b o x}

present the category and bounding box predicted by model, and

Y_{g t}^{c}, Y_{g t}^{b b o x}

present the ground truth category and bounding box.

I O U

is the intersection area over the union area of

P_{b b o x}

and

Y_{g t}^{b b o x}

.

|\cdot|

is the L2-norm computation;

a

and

β

are hyperparameters used to control the loss.

P_{p r e d}

is the center point of the predicted bounding box, and

x_{g t}

is the ground truth center point.

2.2.3. Group Attribute Auxiliary Guidance Module Based on Knowledge Distillation

The group attribute auxiliary guidance module based on knowledge distillation is a training method for auxiliary object detection algorithms, which can improve the accuracy and robustness of object detection by using knowledge distillation to perform some auxiliary tasks and object detection tasks. The extra branch is shown in the bottom of Figure 2, which is used to produce extra labels for the model. The extra branches can be seen as a teacher model, promoting model learning and convergence. We included three more branches to encode more information about the tongue for the three attributes (color, shape, texture) used in diagnosis using the tongue. Specifically, the auxiliary guidance module (AGM) used in this method consists of three parallel convolution blocks and two 3 × 3 convolutions. These three parallel convolutions are used to extract features of tongue color, tongue texture, and tongue texture. After each parallel convolution block, a fully connected layer is connected to calculate the loss function of the corresponding feature attribute. In the design of the module, GN (Group Normalization) is used as the normalization layer, and parameters are shared between tongue image feature maps of different scales to promote consistency and integrity in feature learning. The auxiliary guidance module is only used during the training process and is discarded after the training is completed.

Among them, the parallel convolution module is divided into three groups based on the attributes of the tongue image, namely tongue color (tongue color, coating color, sublingual vein color), tongue body (fat and thin tongue, tooth mark tongue, prickly tongue, cracked tongue, sublingual vein morphology), and tongue texture (greasy and rotten coating, moist and dry coating, thin and thick tongue, peeled coating, partially whole coating). For each attribute group, there is a corresponding loss function. The goal of optimizing these loss functions is to transfer prior knowledge from the auxiliary classifier to the object detection algorithm through knowledge distillation, in order to improve the perception and robustness of the object detection algorithm regarding different attributes. The loss function of three feature groups is cross entropy loss, and the final loss of the auxiliary guidance module

L_{2}

is shown in Equation (9). The

L_{c o l o r}, L_{b o d y}, L_{n a t u r e}

are the KL (Kullback–Leibler) divergence loss computed between the last output of the PAFPN of the model output of the three parallel convolution blocks.

L_{2} = L_{c o l o r} + L_{b o d y} + L_{n a t u r e}

(9)

The extracted tongue color, tongue body, and tongue texture features are connected, and then a detection head is used to output the detection results. Next, the output result is matched with each annotated sample to calculate a matching loss, which consists of a classification loss and regression loss, forming a cost matrix. In the cost matrix, bipartite graph matching is used to determine the allocation result of the output target. Through the above process, the extracted tongue image features can be combined with the target detection task, and the cost matrix can be used to determine the matching result of the target. This operation can improve the perception ability of object detection algorithms for different attributes, enabling them to detect and classify more accurately.

Among them, the auxiliary loss function consists of three parts: auxiliary classification loss

{A u x L}_{c l a s s}

, auxiliary regression loss

{A u x L}_{i o u}

, and auxiliary distance loss

{A u x L}_{d i s}

. Their calculation methods are shown in Equations (10)–(12), respectively. And

Y_{s o f t}^{c}

,

Y_{s o f t}^{b b o x}

, and

x_{s o f t}

are the category, bounding box, and center point predicted by the AGM.

{A u x L}_{c l a s s} = C E (P_{c}, Y_{s o f t}^{c}) \times {(Y_{s o f t}^{c} - P_{c})}^{2}

(10)

{A u x L}_{i o u} = - l o g (I o U (P_{b b o x}, Y_{s o f t}^{b b o x}))

(11)

{A u x L}_{d i s} = a^{|p_{p r e d} - x_{s o f t}| - β}

(12)

The final loss function is shown in Equation (13), where

α_{d e t e c t}

,

β_{d e t e c t}

,

γ_{d e t e c t}

are the training hyperparameters.

L_{3} = α_{d e t e c t} {A u x L}_{c l a s s} + β_{d e t e c t} {A u x L}_{i o u} + γ_{d e t e c t} {A u x L}_{d i s}

(13)

2.2.4. Keyframe Detection Based on Inter Frame Difference

In the process of tongue detection, it is necessary to extract tongue image features in continuous video streams. Because the features extracted from adjacent image frames in the video stream are similar, the optical flow features between images can be used to propagate the features of adjacent frame images, improving the efficiency and performance of object detection.

To achieve the above objectives, it is necessary to complete keyframe feature extraction and non-keyframe feature propagation. Specifically, within a fixed time interval, some keyframes are selected as representative samples. For non-keyframes, the motion field information of each pixel relative to the keyframe is calculated. Using the obtained optical flow map and tongue image feature map of the keyframes, a warp operation is used to propagate the features of keyframes on the image to the corresponding non-keyframes. Finally, the detection head network is used to output non-keyframe task results based on the propagated features. The specific feature propagation process is illustrated in Figure 5.

The calculation of tongue image features in non-keyframes relies on the inference of the nearest keyframe; the calculation of non-keyframes is shown in Equation (14).

f_{i} = W (f_{k}, M_{i \to k}, S_{i \to k})

(14)

In Equation (14),

f_{k}

is the feature map of the keyframes.

M_{i \to k}

is the optical flow field from non-key feature maps to key feature maps.

S_{i \to k}

is the spatial size and channel dimensions of different feature maps between keyframes and non-keyframes.

W

is a warp operation. This operation utilizes a relatively lightweight optical flow network and warp operation to replace the original feature extraction of each tongue image frame to obtain corresponding features, saving computation and improving model inference speed.

3. Results

3.1. Experimental Environment

In order to verify the performance of the proposed algorithm, experiments were conducted on the collected tongue image dataset using Python 3.11, configured with an Intel (R) Core (TM) i7-13700 CPU, a COLORFUL NVIDIA GTX4060 GPU, and a 16 GB Kingston memory size. The experiments use the mAP and FPS of the test set as the evaluation metrics.

3.2. Experiments on Hyperparameters

Here, we discuss the hyperparameters of the model, including learning rate, weight decay, and loss function hyperparameters. The experiment selects the optimal values of other parameters, which remain unchanged. The learning rate parameter experiment is shown in Figure 6a. During the training process, the Adam optimizer was selected as the optimizer. With other settings remaining unchanged, the model achieved the best mAP value of 0.796 on the test set with a learning rate of 0.001. The Adam optimizer attenuation weight experiment is shown in Figure 6b, with the learning rate set to the optimal 0.01. From Figure 6b, it can be seen that the optimal weight decay of the proposed algorithm is 0.005. When the coefficient is greater or less than 0.005, the mAP value of the model’s object detection on the test set decreases, and the detection performance of the model decreases.

Table 4 shows the model test results obtained by adjusting the parameter values of the loss function during the experiment. When

a_{d e t e c t}

,

β_{d e t e c t}

, and

γ_{d e t e c t}

are set to 0.5, 1, and 1, respectively, the detection results of the model are the best. Therefore, it can be inferred that during the model training process, the classification loss and IoU loss play a greater role than the center point loss. Therefore, based on the discussion of the above parameters, the optimal parameter settings for the entire model are shown in Table 5.

3.3. Comparative Experiment

After determining the optimal parameters for the learning rate, attenuation weight, and loss function parameter values of TD-DFP, comparative experiments were conducted between the model and the introduced object detection model to verify its performance. To validate the effectiveness of the model, it was compared with the following methods. The results on the tongue image dataset are shown in Table 6.

SSD [20]: SSD combines multiple feature maps of different scales with default anchors of different sizes to effectively detect objects of different sizes and proportions. It utilizes convolutional neural networks (CNNs) to extract features from images and applies predefined convolutional sliding windows on each feature map to predict the position and category of objects.
Faster R-CNN [21]: SSD combines multiple feature maps of different scales with default anchors of different sizes to effectively detect objects of different sizes and proportions. It utilizes convolutional neural networks (CNNs) to extract features from images and applies predefined convolutional sliding windows on each feature map to predict the position and category of objects.
YoloV5 [22]: This model integrates various detection techniques such as an FPN and Mosaic (data augmentation method), making it more effective in learning image features and capable of detecting objects of different sizes and shapes, with strong adaptability. Meanwhile, the model training process is simple, it can be trained on a large scale, and it has high scalability. These features make YoloV5 widely applicable in practical applications.
FCOS [23]: Compared to traditional two-stage methods, FCOS adopts a single-stage detection approach that does not require the use of candidate boxes and can directly output the object category and bounding box information for each pixel. FCOS introduces the concept of unbounded boxes (infinitely long square borders), and achieves accurate object detection by densely predicting the category, centrality, and boundary offset of each position on the feature map. In addition, FCOS also uses adaptive category branching and numerical stabilization techniques to improve performance and stability, and achieves a good balance between accuracy and speed, with high practicality and promotional value.
YoloV10 [24]: This model presents consistent dual assignments for NMS-free training of Yolos, firstly to eliminate the latency caused by non-maximum suppression (NMS) for post-processing. And it performs design on detection heads, down-sampling layers, and basic building blocks, making YoloV10 achieve better efficiency–accuracy trade-offs.

According to the experimental results on the tongue image dataset, TD-DFP achieved the best results in mAP values compared to other models, and YoloV10 achieved the best results in FPS values. While TD-DFP achieved 0.50% higher results in mAP than YoloV10, YoloV10 achieved a much faster inference speed. This result shows that the structure in YoloV10 is simple but effective, such as spatial-channel decoupled down-sampling, which shortens the time of inference greatly. The result also shows that the group attribute auxiliary guidance module in TD-DFP helped it to recognize the tongue body better.

Compared with the third most accurate model, Faster RCNN, although TD-DFP has a certain increase in parameters and memory usage, it is 2.00% higher in mAP and 44.86 higher in FPS, indicating that by designing more effective structures and parameter settings, better performance in mAP results. In terms of FPS, TD-DFP has improved FPS values by over 20 compared to the third optimal model, FCOS, achieving real-time detection of image frames, and TD-DFP has 382.49 MB less memory usage than FCOS.

3.4. Ablation Study

In order to verify the effectiveness of the proposed color channel addition, lightweight feature extraction network, auxiliary guidance module, and keyframe detection for tongue image detection, ablation experiments were conducted on TD-DFP. Experiments were conducted separately for different modules. ResNet [25], Ghost, and EfficientNetLite [26] were selected as replacement feature extraction blocks for ShuffleNetV2 in the feature extraction module. The hierarchical auxiliary guidance module used the auxiliary guidance module and did not add any modules for ablation experiments. The keyframe detection module selected different frame intervals for experiments. The results are shown in Table 7, Table 8, Table 9 and Table 10.

The experiment on color channel addition was conducted, and the model trained with original RGB images was chosen as the baseline. From Table 7, it can be seen that the model trained with an extra color channel achieved 2.30% higher results in mAP than the one without the extra color channel. But it can also be seen that when the H or S channel was added, the performance dropped. The cause of the decline in performance might be that the additional channels were treated the same as the original RGB channels. And the H and S channels consist of less useful information about the tongue body.

To further support this opinion, Figure 7 is presented to show an example of the original RGB tongue picture and the H, S, and V channels of the corresponding HSV image. It can be seen that the V channel consists of the clearest tongue body region.

From Table 8, it can be seen that the detection model using ShuffleNetV2 outperforms other feature extraction models in terms of detection performance indicators. The TD-DFP with ShuffleNetV2 achieved 6.1% higher results than the model with EfficientNetLite and 5.5% higher results than the model with Ghost. The structures in Ghost, EfficientNetLite, and ShuffleNetV2 help the TD-DFP to achieve higher mAP than the model with ResNet. ShuffleNetV2 has a channel shuffle operation, which helps the communication between channels. And, in TD-DFP, we introduce an extra color channel in the input, and a channel shuffle operation helps ShuffleNetV2 surpass the other two backbone networks.

To study the impact of the multi-scale feature, we removed the FPN structure in the TD-DFP with ShuffleNetV2, which means that only the last output of the model was sent to detection head. There was a decrease of 18.7% in mAP, if multi-scale features were abandoned, which shows that the multi-scale feature helps the model to recognize the tongue body better.

Table 9 shows that when using the auxiliary guidance module, although the parameters of the model trained with the AGM were 7.71 MB higher than the model without the AGM, it improved the performance of TD-DFP by 7.4% in mAP. Because the AGM is used to generate labels to help the model learn fast, the AGM consists of respectable parameters. The AGM blocks will be abandoned during the inference of the detection model.

Figure 8 shows that the AGM accelerates the training convergence of TD-DFP, because the result shows that the model with the AGM achieves higher mAP than the one without the AGM at same training epoch.

Keyframe selection and optical flow-based feature propagation are used in tongue detection in videos. Since the number of tongue videos is small, we only conducted the experiment to evaluate the processing speed of TD-DFP. The result is shown in Table 10: when the inter frame interval is 3, the FPS and inference accuracy of the video can achieve the optimal results.

To further validate the effectiveness of the keyframe selection strategy in tongue videos, a study of the frames of videos was conducted. When performing video keyframe detection, an 11 s video stream with a total of 339 frames was selected for detection. The optical flow selection algorithm was used for keyframe extraction, and parts of keyframes were extracted, as shown in Figure 9. The pixel differences between adjacent frames were calculated as shown in Figure 10, with the horizontal axis representing the video frame number and the vertical axis representing the average pixel difference between adjacent frames. The red dots in Figure 10 are the keyframes selected by keyframe selection strategies at an interval of 3. There is a certain overlap between the frame number and keyframe number of values with significant differences. The result shows the effectiveness of a simple keyframe selection strategy.

4. Discussion

4.1. Discussion on Tongue Images

To verify the effectiveness of the model, images were selected from the tongue image dataset for case analysis, and the detection results are shown in Figure 11. It can be seen that the method can accurately identify the lingual and sublingual regions from images with different tongue postures, different clarity, and different backgrounds with fewer parameters. The method can effectively locate the position of the tongue under certain occlusion conditions, demonstrating the accurate extraction of tongue features by the model. However, as shown in Figure 12, due to the fact that some patients have lip colors that are similar to the color of the lingual region or a lingual region that is similar to the color of the sublingual region, there may be cases where both the lingual region and sublingual region are detected in the image at the same time. Therefore, it is necessary to add more examples of abnormal situations in the subsequent model training process to improve the detection accuracy of the detection model.

4.2. Discussion on Tongue Videos

After identifying the keyframes in the video stream, similar features extracted from adjacent image frames in the video stream were utilized to apply the tongue detection results of keyframes in adjacent image frames to non-keyframes, further improving the performance of object detection. The lingual and sublingual detection results are shown in Figure 13 and Figure 14.

5. Conclusions

This article introduces the process of tongue diagnosis in traditional Chinese medicine. Clinical physicians diagnose patients’ diseases based on their tongue color, shape, texture, and other characteristics. Therefore, accurate detection of the tongue body position is necessary. Computer-based tongue detection methods currently rely on increasing the number of model parameters and more complex structures. However, these methods have a slow detection speed and poor tongue feature extraction performance on mobile devices. To address these problems, this article proposes a fast tongue detection algorithm, TD-DFP, based on lightweight object detection and deep feature propagation. The algorithm adds a lightweight tongue feature extraction module to reduce the parameters of the model, uses an auxiliary guidance module to better extract tongue features and accelerate the convergence of the training process, and finally uses keyframe detection to reduce the tongue feature extraction time in the video stream.

In the experimental verification of the tongue image dataset, the method proposed in this article achieved the best mAP value, 0.828, and FPS value, 61.88, on the tongue image dataset. In the detection process of the video stream, it can achieve a detection speed of 171.42 FPS. This result indicates that the model can perform real-time detection of the tongue body in the video stream, and the detected tongue body area is in accordance with the actual tongue body area. In some cases, accurate recognition of the lingual and sublingual regions can still be achieved with different tongue postures, clarity, and backgrounds.

In static tongue detection, there are some cases of recognition errors, mainly due to the similarity between the color of the patient’s lips and the color of the lingual region, or the similarity between the color of the lingual and the sublingual regions. This situation may result in image detection recognizing both the lingual and sublingual regions simultaneously. Therefore, in future work, we will collect more images in different environments, such as environments with stable or unstable illumination. And, it is necessary to further improve the feature extraction module of the model to enhance the differentiation of fine-grained features between the lingual and sublingual regions and the lips. During the training process, it is also necessary to add more targeted examples of abnormal situations to improve the accuracy of the detection model.

Author Contributions

Conceptualization, K.C., Y.Z. and Y.L.; methodology, K.C. and Y.Z.; software, K.C. and Y.Z.; validation, L.Z.; data curation, L.Z.; writing—original draft preparation, K.C.; writing—review and editing, Y.Z.; visualization, K.C.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Science & Technology Fundamental Resources Investigation Program under grant 2022FY102002, the Sichuan Science and Technology Program under grants 2023YFS0325, and the Natural Science Foundation of Sichuan under grant 2024NSFSC0717.

Institutional Review Board Statement

This study was approved by the Ethics Committee of the University of Electronic Science and Technology of China (approval code: 106142025031133093).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data are available on request to the correspondence author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, C.; Fang, C. Diagnostics of Traditional Chinese Medicine; China Press of Chinese Medicine: Beijing, China, 2021; pp. 39–55. [Google Scholar]
Chen, E.; Li, S.; Hu, M.; He, Q.; Bao, Z.; Yang, H. Research progress on image acquisition and color information analysis of tongue diagnosis in traditional Chinese medicine. China J. Tradit. Chin. Med. Pharm. 2024, 39, 3586–3589. [Google Scholar]
Cai, Y.; Hu, S.; Guan, J.; Zhang, X. Progress on Objectification Technology of Tongue Inspection in Traditional Chinese Medi-cine and Discussion on its Application. Mod. Tradit. Chin. Med. Mater. Med.-World Sci. Technol. 2021, 23, 2447–2453. [Google Scholar]
Bo, W.; Ni, S. Research on Multi-Position Target Detection of Electronic Components Combined with HOG and SVM. Mach. Des. Manuf. 2021, 10, 76–80. [Google Scholar]
Wang, Q. Highway Parking Event Detection Based on Improved Haar-like+Adaboost. Master’s Thesis, Chongqing University, Chongqing, China, 2021. [Google Scholar]
Zheng, F. Research on Tongue Detection and Tongue Segmentation in Open Environment. Master’s Thesis, Xiamen University, Xiamen, China, 2017. [Google Scholar]
Fu, Z.; Li, X.; Li, F. Tongue image segmentation based on Snake model and radial edge detection. Image Vis. Comput. 2009, 14, 688–693. [Google Scholar]
Tong, K.; Wu, Y.; Zhou, F. Recent Advances in Small Object Detection Based on Deep Learning: A Review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
Tang, W.; Gao, Y.; Liu, L.; Xia, T.; He, L.; Zhang, S.; Guo, J.; Li, W.; Xu, Q. An Automatic Recognition of Tooth- Marked Tongue Based on Tongue Region Detection and Tongue Landmark Detection via Deep Learning. IEEE Access 2020, 8, 153470–153478. [Google Scholar] [CrossRef]
Liu, B. Research on Tongue Feature Recognition Based on Image Segmentation and Detection. Master’s Thesis, University of Electronic Science and Technology of China, Chengdu, China, 2023. [Google Scholar]
Zhu, L. Research on tongue image detection and segmentation method based on deep learning. Master’s Thesis, Hunan University of Traditional Chinese Medicine, Changsha, China, 2023. [Google Scholar]
Zendehdel, N.; Chen, H.; Leu, M.C. Real-Time Tool Detection in Smart Manufacturing Using You-Only-Look-Once (YOLO)V5. Manuf. Lett. 2023, 35, 1052–1059. [Google Scholar] [CrossRef]
Kwon, H.; Kim, D.-J. Dual-Targeted Adversarial Example in Evasion Attack on Graph Neural Networks. Sci. Rep. 2025, 15, 3912. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Lin, Y.; Li, L. Application progress of intelligent diagnosis and treatment in tongue manifestation research. China J. Tradit. Chin. Med. Pharm. 2021, 36, 342–346. [Google Scholar]
Wu, X.; Xu, H.; Lin, Z.; Li, S.; Liu, H.; Feng, Y. Review of Deep Learning in Classification of Tongue Image. J. Front. Comput. Sci. Technol. 2023, 17, 303–323. [Google Scholar]
Nguyen, C.H.; Nguyen, T.C.; Tang, T.N.; Phan, N.L.H. Improving Object Detection by Label Assignment Distillation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; IEEE: Waikoloa, HI, USA, 2022; pp. 1322–1331. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer International Publishing: Munich, Germany, 2018; Volume 11218, pp. 122–138. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 1577–1586. [Google Scholar]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-Free Oriented Proposal Generator for Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Ultralytics. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 November 2021).
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Seoul, Republic of Korea, 2019; pp. 9626–9635. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 107984–108011. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML, Long Beach, CA, USA, 9–15 June 2019; Volume 36, pp. 6105–6114. [Google Scholar]

Figure 1. This is an example of a raw tongue image and its mask.

Figure 2. Fast tongue detection based on lightweight model and deep feature propagation.

Figure 3. Feature extraction block of backbone.

Figure 4. PAFPN feature pyramid network.

Figure 5. Deep feature propagation.

Figure 6. Experimental results of learning rate and weight decay; (a) represents the impact of learning rate on mAP; (b) represents the impact of weight decay on mAP.

Figure 7. Example of original RGB image and H, S, and V channels of corresponding HSV image.

Figure 8. Experimental results of the AGM. The blue line shows that the mAP varies when the model is trained with/without the AGM during training.

Figure 9. Example of keyframes.

Figure 10. Difference distribution of adjacent frames in a video result of tongue detection. The red dots stand for the keyframes.

Figure 11. Correct result of tongue detection.

Figure 12. Incorrect result of tongue detection. The red and blue boxes are results predicted by model, which represent lingual and sublingual regions. The overlap means that the model recognizes the same region as both lingual and sublingual wrongly.

Figure 13. Results of lingual detection in video stream. The red boxes are results predicted by model, which represent lingual region.

Figure 14. Results of sublingual detection in video stream. The blue boxes are results predicted by model, which represent sublingual region.

Table 1. Tongue image dataset.

Dataset	Type	Number	Resolution
Tongue image dataset	lingual image	1452	312 × 415
Tongue image dataset	sublingual image	1452	312 × 415

Table 2. Label of tongue images.

Label	Images	Categories	Annotations
content	height width id filename	super-category id name	imageid bbox area categoryid

Table 3. Parameter settings of ShuffleNetV2.

Stage	Output Size	Kernel Size	Stride	Number	Output Channel
image	224 × 224	-	-	-	4
Stage1	112 × 112	3 × 3	2	1	32
Stage1	56 × 56	3 × 3	2	1	32
Stage2	28 × 28	3 × 3	2	1	64
Stage2	28 × 28	3 × 3	1	3	64
Stage3	14 × 14	3 × 3	2	1	128
Stage3	14 × 14	3 × 3	1	7
Stage4	7 × 7	3 × 3	2	1	256
Stage4	7 × 7	3 × 3	1	3	256

Table 4. Experimental results of weights of different loss functions.

Hyperparameter			mAP
$a_{d e t e c t}$	$β_{d e t e c t}$	$γ_{d e t e c t}$	mAP
0.5	2	0.25	0.744 (±0.031)
1	1	1	0.726 (±0.013)
0.5	1	1	0.828 (±0.016)
1.5	1	1	0.767 (±0.011)
1	1	0.5	0.776 (±0.012)
1	1	1.5	0.738 (±0.019)
1	0.5	1	0.766 (±0.013)

Table 5. Optimal hyperparameters for TD-DFP.

Hyperparameter	Value
Learning rate	0.01
Weight decay	0.005
$Hyperparameters of loss function a_{d e t e c t}$	0.5
$Hyperparameters of loss function β_{d e t e c t}$	1
$Hyperparameters of loss function γ_{d e t e c t}$	1

Table 6. Results of different models.

Method	mAP	Parameters	FPS	Memory Usage
SSD	0.629	156.14 M	24.72 (±0.48)	407.48 MB
Faster RCNN	0.805	10.16 M	17.02 (±1.61)	38.80 MB
YoloV5	0.694	36.90 M	38.12 (±0.78)	142.10 MB
FCOS	0.562	123.12 M	38.15 (±1.65)	459.04 MB
YoloV10	0.823	20.45 M	156.25 (±0.21)	78.33 MB
TD-DFP	0.828	19.50 M	61.88 (±1.75)	76.55 MB

Table 7. Experimental results of color channel addition.

Color Channel Addition	mAP
without color channel addition	0.805 (±0.005)
H of HSV added	0.761 (±0.120)
S of HSV added	0.759 (±0.025)
V of HSV added	0.828 (±0.018)

Table 8. Results of different feature extraction modules.

Feature Extractor	mAP
TD-DFP with ResNet	0.504 (±0.025)
TD-DFP with Ghost	0.773 (±0.028)
TD-DFP with EfficientNetLite	0.767 (±0.017)
TD-DFP with ShuffleNetV2 without FPN	0.641 (±0.162)
TD-DFP with ShuffleNetV2	0.828 (±0.018)

Table 9. Experimental results of different auxiliary guidance modules.

Method	Parameters	mAP
TD-DFP without AGM	13.79 M	0.754 (±0.012)
TD-DFP with AGM	19.50 M	0.828 (±0.018)

Table 10. Experimental results of keyframe selection strategies.

Keyframe Selection	Forward_Time ¹	Decode_Time ²	Viz_Time ³	FPS
without keyframe selection	3.30 s (±0.25)	0.62 s (±0.01)	0.45 s (±0.02)	61.88 (±0.46)
2 frames interval	1.65 s (±0.19)	0.30 s (±0.02)	0.45 s (±0.03)	125.00 (±0.50)
3 frames interval	1.10 s (±0.17)	0.20 s (±0.04)	0.45 s (±0.02)	171.42 (±0.38)

¹ Forward_time represents the video frame feature extraction time,² Decode_time represents the detection time, and ³ Viz_time represents the time to plot the detection results on the video frames.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, K.; Zhang, Y.; Zhong, L.; Liu, Y. Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation. Electronics 2025, 14, 1457. https://doi.org/10.3390/electronics14071457

AMA Style

Chen K, Zhang Y, Zhong L, Liu Y. Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation. Electronics. 2025; 14(7):1457. https://doi.org/10.3390/electronics14071457

Chicago/Turabian Style

Chen, Keju, Yun Zhang, Li Zhong, and Yongguo Liu. 2025. "Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation" Electronics 14, no. 7: 1457. https://doi.org/10.3390/electronics14071457

APA Style

Chen, K., Zhang, Y., Zhong, L., & Liu, Y. (2025). Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation. Electronics, 14(7), 1457. https://doi.org/10.3390/electronics14071457

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Tongue Detection Based on Lightweight Model and Deep Feature Propagation

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Method

2.2.1. Overview

2.2.2. Tongue Detection Based on Improved NanoDet

2.2.3. Group Attribute Auxiliary Guidance Module Based on Knowledge Distillation

2.2.4. Keyframe Detection Based on Inter Frame Difference

3. Results

3.1. Experimental Environment

3.2. Experiments on Hyperparameters

3.3. Comparative Experiment

3.4. Ablation Study

4. Discussion

4.1. Discussion on Tongue Images

4.2. Discussion on Tongue Videos

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI