Multi-Modal Fusion Network with Multi-Head Self-Attention for Injection Training Evaluation in Medical Education

Li, Zhe; Kanazuka, Aya; Hojo, Atsushi; Nomura, Yukihiro; Nakaguchi, Toshiya

doi:10.3390/electronics13193882

Open AccessArticle

Multi-Modal Fusion Network with Multi-Head Self-Attention for Injection Training Evaluation in Medical Education

by

Zhe Li

^1,*

,

Aya Kanazuka

²,

Atsushi Hojo

²,

Yukihiro Nomura

³

and

Toshiya Nakaguchi

^3,*

¹

Department of Medical Engineering, Graduate School of Science and Engineering, Chiba University, Chiba 263-8522, Japan

²

Department of Orthopedic Surgery, Chiba University, Chiba 260-0856, Japan

³

Center for Frontier Medical Engineering, Chiba University, Chiba 263-8522, Japan

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(19), 3882; https://doi.org/10.3390/electronics13193882

Submission received: 21 August 2024 / Revised: 24 September 2024 / Accepted: 29 September 2024 / Published: 30 September 2024

(This article belongs to the Special Issue Machine Learning in Electronic and Biomedical Engineering, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The COVID-19 pandemic has significantly disrupted traditional medical training, particularly in critical areas such as the injection process, which require expert supervision. To address the challenges posed by reduced face-to-face interactions, this study introduces a multi-modal fusion network designed to evaluate the timing and motion aspects of the injection training process in medical education. The proposed framework integrates 3D reconstructed data and 2D images of hand movements during the injection process. The 3D data are preprocessed and encoded by a Long Short-Term Memory (LSTM) network to extract temporal features, while a Convolutional Neural Network (CNN) processes the 2D images to capture detailed image features. These encoded features are then fused and refined through a proposed multi-head self-attention module, which enhances the model’s ability to capture and weigh important temporal and image dynamics in the injection process. The final classification of the injection process is conducted by a classifier module. The model’s performance was rigorously evaluated using video data from 255 subjects with assessments made by professional physicians according to the Objective Structured Assessment of Technical Skill—Global Rating Score (OSATS-GRS)[B] criteria for time and motion evaluation. The experimental results demonstrate that the proposed data fusion model achieves an accuracy of 0.7238, an F1-score of 0.7060, a precision of 0.7339, a recall of 0.7238, and an AUC of 0.8343. These findings highlight the model’s potential as an effective tool for providing objective feedback in medical injection training, offering a scalable solution for the post-pandemic evolution of medical education.

Keywords:

clinical injection training; multi-modal fusion network; multi-head self-attention; medical education; post-pandemic

1. Introduction

The COVID-19 pandemic has profoundly reshaped numerous aspects of society with medical education being one of the most impacted fields [1,2]. Particularly, the training of essential clinical skills such as venipuncture, which traditionally depends on direct, hands-on guidance from experienced instructors, has encountered unprecedented challenges [3]. The stringent requirements for social distancing and the reduction of in-person interactions have made these conventional training methods difficult, if not impossible, to implement effectively. While adaptations were made to maintain the reliability of clinical training, these disruptions reduced opportunities for medical students to gain practical, in-person experience [4]. This has underscored the need for innovative solutions to ensure the continued development of clinical competency in such challenging circumstances.

To address these unprecedented challenges in medical education, there is a pressing need to adopt innovative solutions that can replace traditional methods in the post-pandemic era, ensuring that the continuity and effectiveness of medical education are maintained even under social distancing constraints [5]. Various technological advancements have been explored and implemented to facilitate remote and simulation-based learning [6,7,8], which have become pivotal in medical training during and after the COVID-19 pandemic.

1.1. Deep Learning (DL) in Medical Training

In recent years, the application of deep learning in medical education has rapidly expanded, particularly during the COVID-19 pandemic, where the need for remote and automated solutions became more urgent [9,10]. Deep learning models have been employed in various aspects of medical training, ranging from automated assessment to providing real-time feedback. These technologies offer more consistent and objective clinical support, which has been traditionally reliant on the direct supervision of instructors. For example, Convolutional Neural Networks (CNNs) have been extensively used in medical imaging, providing critical diagnostic support and significantly improving disease detection and classification in specialties such as radiology and dermatology [11,12]. In radiology, CNNs have been utilized to detect lung nodules in CT (computed tomography) scans with remarkable accuracy, aiding radiologists in making faster and more accurate diagnoses [13]. These advancements lay the foundation for incorporating deep learning into medical education.

More recently, deep learning models trained on large datasets have increasingly been applied to simulation-based medical training [14]. AI (Artificial Intelligence)-driven platforms assess procedural performance and provide personalized feedback to trainees [15]. These systems offer a standardized approach to training, ensuring that all students receive consistent, high-quality evaluations regardless of geographical or time constraints [16]. This is particularly important in the post-pandemic era, where the demand for remote and flexible training solutions has become more pronounced. AI systems, like those using CNN and LSTM networks, enable precise, automated assessments of clinical procedures, capturing both spatial and temporal dynamics to evaluate performance objectively [17,18].

Despite the progress in applying deep learning models to specific tasks, the evaluation of injection training remains an area with significant research potential. Integrating AI into the evaluation of clinical skills offers an opportunity to transform medical education, allowing for more effective and efficient skill acquisition, especially in high-stakes, procedural training such as injections.

1.2. Fusion Models in Medical Research

In recent years, the use of fusion models in medical research has garnered increasing attention as a means to enhance the accuracy and robustness of predictive models. Fusion models combine data from multiple sources or modalities—such as imaging, clinical data, and temporal sequences—to leverage the complementary strengths of each modality. This approach has been particularly successful in areas such as medical diagnosis, where integrating diverse data types can lead to more accurate and comprehensive assessments [19]. For instance, in the field of medical imaging, fusion models that combine Magnetic Resonance Imaging (MRI), CT scans, and histopathological data have been developed to improve the accuracy of tumor detection and classification [20]. These models can capture different aspects of the disease, providing a more holistic view that enhances diagnostic precision. Similarly, in surgical skill assessment, fusion models that integrate video data with sensor-based information have been used to provide a more detailed and accurate evaluations of surgical performance [21]. By combining visual and kinetic data, these models offer a more nuanced understanding of the surgeon’s skill level, leading to better training outcomes.

The fusion of multi-modal data not only improves the accuracy of predictive models but also enhances their generalizability across different contexts. In the context of medical training, fusion models can provide a more comprehensive assessment of clinical skills by integrating various data sources, such as 2D imaging and 3D spatial data [22,23]. This approach addresses the limitations of single-modality models, which may not capture all the relevant aspects of a complex medical procedure.

Despite these advancements, the application of fusion models in evaluating the injection process is still in its early stages. The current study aims to fill this gap by proposing a multi-modal fusion network that integrates 3D reconstructed data and 2D images to assess the injection process in medical training. The network extracts features using LSTM and CNN, combined with a self-attention mechanism, in accordance with the OSATS-GRS[B] criteria for time and motion evaluation, to enhance the assessment of medical injection training.

2. Materials and Methods

2.1. Data Acquisition

In this study, we utilized a custom-developed multi-camera acquisition system, as depicted in Figure 1. The hardware setup consists of three industrial cameras positioned at different angles, all of which are DFK 33UX290 CMOS cameras produced by IMAGINGSOURCE (Bremen, Germany) with a resolution of 1920 × 1080 and a frame rate of 40 fps. The system also includes a standard arm model widely used in medical education to simulate the injection process (Kyoto Science Blood Sampling and Injection Simulator Shinjo-II, Kyoto, Japan) and a generic event detection camera used to detect common occurrences of blood backflow during puncture [24].

In this experiment, successful injection is defined by the occurrence of blood backflow when the needle punctures the vessel. A photo sensor is employed to detect signal fluctuations and record the timing, assessing injection efficiency. The procedure requires completion of the injection within 30 s.

2.2. Data Preprocessing

2.2.1. Evaluation Criteria

In this experiment, the standard used to evaluate the injection process is the original Objective Structured Assessment of Technical Skill (OSATS) [25]. The OSATS concept consists of a three-part assessment form, including a task-specific checklist, a global rating scale, and a pass/fail judgment [26]. From this original framework, the Association of Surgeons of the Netherlands, responsible for the format and content of the general surgery residency training program, adopted only the Global Rating Score (GRS), which has been shown to be superior in terms of reliability and validity compared with task-specific checklists in a modified form [27], as shown in Table 1.

Given that this study focuses on the motion and time parameters of the injection process, specifically the B criterion in Table 1, the evaluation in this study will be conducted based on the experts’ GRS[B].

2.2.2. Dataset

The dataset was entirely self-collected and consists of hand injection video data from 225 fourth-year medical students at the School of Medicine, Chiba University, which were used to evaluate the proposed method.

Efficiency Detection (Time Parameters)

To quantitatively analyze the efficiency of the injection process, it was necessary to define different time stages, as illustrated in Figure 2.

The defined stages are as follows:

-: Stage A Time = Needle puncture the skin time [28]—Start time;
-: Stage B Time = Blood backflow time—Needle puncture the skin time.

After defining these time stages, each participant performed the injection experiment three times, resulting in a total of 225 experimental datasets. We analyzed the video frame results for both Stage A and Stage A + B. As summarized in Table 2, we present the mean, standard deviation, median, maximum, and minimum values, and the Intraclass Correlation Coefficient (ICC) for each stage. The shortest operation times corresponded to 65 frames for Stage A + B and 31 frames for Stage B. These frame counts are critical for further data processing and provide a reliable basis for evaluating the injection process.

Motion Detection (3D Reconstructed Data)

In previous research [29], a multi-camera system was used to record the hand injection process from different perspectives, effectively addressing the issue of finger occlusion that can occur with single-view tracking. The Mediapipe Hand framework [30] was then employed to identify the 2D joint positions of both hands from the various camera views. By utilizing the positional relationships of the same landmarks across different cameras and applying camera calibration to establish a world coordinate system, the 2D coordinates were converted into 3D space. To further enhance accuracy, the results from different camera combinations were compared, and optimized 3D reconstructed data of both hands were generated. The method’s accuracy was validated through comparison with the VICON system. The visualization of the 3D reconstruction is shown in Figure 3.

After obtaining the 3D positions of the finger joint landmarks, we can further extract features, including the angles between the joints of the injecting fingers (thumb–index), as shown in Figure 4. Additionally, the geometric distances between joints and the Euclidean distance between the thumb and index fingertips provide more detailed information for subsequent model training.

2.3. Network Framework

Our framework is illustrated in the Figure 5. In Figure 5A, the framework first takes as input the 3D reconstructed data of hand movements and 2D images of the injection procedure. The 3D data undergo preprocessing and are then encoded by an LSTM encoder module to extract features. Simultaneously, the 2D image data are also preprocessed and encoded using a CNN encoder module to extract features. The features extracted from both modalities are concatenated and passed into the multi-head self-attention module shown in Figure 5C, which captures the latent modality-specific relationships within each modality.

Figure 5B represents the bidirectional LSTM module used within the framework. The bidirectional LSTM processes the input sequence

X_{t}

from both directions—forward and backward—allowing the model to capture dependencies in both past and future contexts. The outputs

Y_{t}

from each LSTM unit are subsequently combined to form the final sequence output.

Figure 5C shows the structure of the multi-head self-attention module. In this module, the input features (

F τ

) are processed through the scaled dot-product attention mechanism. This process starts by applying linear transformations to the input data, producing three key components:

Q τ

(queries),

K τ

(keys), and

V τ

(values). These components are then involved in a dot-product operation, which is scaled to improve stability, followed by a softmax activation that ensures the attention scores sum to 1. After this, dropout is applied to prevent overfitting. Finally, the output of the attention mechanism (

A τ^{'}

) is combined with the original input through a skip connection, allowing the model to retain the original features while incorporating the attention-enhanced information before further processing.

Finally, the processed features are passed through the classifier module, where multiple layers, including batch normalization, dropout, and linear layers, refine the features before making the final prediction about the injection process. This comprehensive approach leverages both temporal and image information, enhancing the assessment of injection training in a multi-modal context.

2.3.1. Bidirectional LSTM Module

The LSTM network is a type of RNN that is particularly well suited for processing sequences of data, which makes it advantageous for tasks involving temporal dependencies [31], such as the analysis of sequential hand movement data in medical injection training. Traditional RNNs often struggle with learning long-term dependencies due to the vanishing gradient problem. However, LSTM networks address this issue with a unique architecture that includes memory cells, input gates, forget gates, and output gates, allowing the network to maintain and update information over longer periods.

To effectively utilize the LSTM module, the input data must be structured as sequential data. In our study, the 3D reconstructed data of hand movements are transformed into a format suitable for time-series to assess various aspects of the training and fusioyed three different padding methods:

Zero Padding (Fill 0): This method involves padding the data with zeros to standardize the timesteps across different sequences until it reaches the experimental setting of 30 s (at 40 fps), resulting in a total of 1200 timesteps.
Repeat Padding: In this method, data from different timesteps are repeated until it reaches the experimental setting of 30 s (at 40 fps), resulting in a total of 1200 timesteps.
Average Extraction: This method involves evenly dividing each person’s data and then extracting the first frame from each division, ensuring that all data sequences have the same number of timesteps. The number of divisions is determined based on the minimum operation times corresponding to 65 frames for Stage A + B and 31 frames for Stage B, as discussed in Section 2.2.2.

2.3.2. CNN Module

The CNN module in our framework is responsible for extracting image features from the 2D images captured during the injection process using three different cameras. To ensure precise focus on the hand movements and minimize noise interference, the Regions of Interest (ROIs) were manually defined for the hand movement areas in each camera view. This manual segmentation ensures that the model focuses on the most relevant areas, keeping the hand movements consistently within the ROI while excluding unnecessary environmental noise. Once the ROIs are set, they are resized to 224 × 224 pixels to prepare for further processing.

Subsequently, we utilize a ResNet-50 model [32] to obtain feature vectors. The features are extracted from the intermediate avgpool layer, which captures high-level spatial features crucial for accurately evaluating the injection process.

2.3.3. Multi-Head Self-Attention Module

The multi-head self-attention module plays a crucial role in our framework by capturing the latent relationships within each modality. This is achieved by applying the self-attention mechanism across multiple heads, which allows the model to focus on various parts of the input data simultaneously.

The detailed structure of our multi-head attention module, as illustrated in Figure 5C, begins with the input feature vector

F_{τ}

, which represents the concatenated features from previous layers. This input vector undergoes separate linear transformations to produce three distinct components: Query (

Q_{τ}

), Key (

K_{τ}

), and Value (

V_{τ}

). These are computed as follows:

Q_{τ} = W_{Q} \cdot F_{τ},

K_{τ} = W_{K} \cdot F_{τ},

V_{τ} = W_{V} \cdot F_{τ} .

where

W_{Q}

,

W_{K}

, and

W_{V}

are the learned weight matrices for the Query, Key, and Value, respectively.

The Query (

Q_{τ}

) represents the input in the context of attention, and it is used to determine which parts of the input sequence are important for the current input step. The Key (

K_{τ}

) provides the criteria that the Query will use to compare and match relevant information. The Value (

V_{τ}

) contains the actual information that will be aggregated based on the matching scores between the Query and Key.

The attention mechanism works by computing a similarity score between the Query and Key, which is used to weight the Values. This weighted sum of the Values is then passed to subsequent layers. These components allow the model to focus on different parts of the input sequence, depending on the task, enabling a more effective learning of complex dependencies.

The attention mechanism computes attention scores by performing a scaled dot-product between the Query

Q_{τ}

and Key

K_{τ}

components:

Attention (Q_{τ}, K_{τ}, V_{τ}) = SoftMax (\frac{Q_{τ} \cdot K_{τ}^{T}}{\sqrt{d_{k}}}) \cdot V_{τ} .

Here,

d_{k}

represents the dimensionality of the Key vector, and the scaling factor

\frac{1}{\sqrt{d_{k}}}

is applied to stabilize gradients during training.

The resulting scores from the scaled dot-product operation are passed through a SoftMax function to generate attention weights, which are then subjected to Dropout for regularization:

A_{τ} = Dropout (SoftMax (Attention (Q_{τ}, K_{τ}, V_{τ}))) .

These attention weights are used to compute a weighted sum of the Value

V_{τ}

components, yielding the output vector

A_{τ}

.

Finally, the output vector

A_{τ}

is combined with the original input feature

F_{τ}

through an element-wise addition (skip connection):

F_{τ}^{'} = F_{τ} + A_{τ} .

This skip connection ensures that the model retains the original input features while effectively incorporating the information modulated by the attention mechanism.

2.3.4. Classifier Module

The classifier module in our framework is designed to map the extracted features to the final output classes through a carefully structured sequence of operations. This module consists of a series of fully connected (linear) layers, batch normalization, and dropout layers.

Linear Layers: The classifier begins with a linear layer that projects the concatenated features from the CNN and LSTM encoders into a 256-dimensional space. This is followed by another linear layer that reduces the dimensionality to 128. The final linear layer maps this 128-dimensional feature vector to the output classes, which are grouped into three broad classes, corresponding to GRS[B] scores of 3, 4, and 5.
Batch Normalization: After each linear transformation, batch normalization is applied. This technique normalizes the input to each layer, stabilizing the learning process and improving the model’s convergence speed [33]. Specifically, batch normalization is applied after both the first and second linear layers.
Dropout: To prevent overfitting, dropout layers are strategically placed between the linear layers with a dropout rate of 0.2. Dropout randomly sets 20% of the input units to zero during training, promoting the model’s robustness and its ability to generalize to unseen data.

The sequence of linear layers, interspersed with batch normalization and dropout, ensures that the classifier can effectively process the features extracted by the earlier stages of the model and output a prediction with high accuracy. This structured approach not only enhances the model’s performance but also contributes to its stability and generalizability.

2.4. Evaluation Method

2.4.1. Evaluation Metrics

The primary focus of this study is to evaluate the classification performance of our multi-modal AI model in assessing injection tasks. Several commonly used evaluation metrics were employed, including accuracy (Acc), area under the receiver operating characteristic curve (AUC), recall, precision (Pre), and F1-score. These metrics are defined as follows:

Accuracy (Acc): The ratio of correctly predicted instances to the total instances.

$Acc = \frac{T P + T N}{T P + T N + F P + F N} .$

where $T P$ is the number of true positives, $T N$ is the number of true negatives, $F P$ is the number of false positives, and $F N$ is the number of false negatives.
Precision (Pre): The ratio of correctly predicted positive instances to the total predicted positive instances.

$Pre = \frac{T P}{T P + F P} .$
Recall: The ratio of correctly predicted positive instances to the all instances in actual class.

$Recall = \frac{T P}{T P + F N} .$
F1-Score: The harmonic mean of Precision and Recall.

$F 1 - Score = 2 \times \frac{Pre \times Recall}{Pre + Recall} .$
Area Under the Receiver Operating Characteristic Curve (AUC): Represents the degree or measure of separability, showing the model’s ability to distinguish between classes. It is the area under the ROC curve.

2.4.2. Model Training and Hyperparameters

Our model was trained in a two-stage process. First, the LSTM model was trained using the 3D motion data to capture the temporal features of hand movements during the injection process. Following this, the CNN model was trained using image data from the different camera angles to extract spatial features. Finally, the pre-trained LSTM and CNN models were used to fine-tune the entire multi-modal fusion framework, allowing for the effective integration of both spatial and temporal features. All experiments were conducted on a PC with an Intel(R) i7-14700K (3.40 GHz) CPU, an NVIDIA TITAN RTX GPU, and 32 GB of RAM.

The specific hyperparameters used in the training process were as follows:

For the 2D image model, the learning rate was set to 0.001, the weight decay was $5 \times 10^{- 4}$ , and the model was trained for 100 epochs.
For the 3D data model, the learning rate was set to 0.001, the weight decay was $5 \times 10^{- 4}$ , and the model was trained for 100 epochs.
For the fusion model, the learning rate was also set to 0.001, with a weight decay of $5 \times 10^{- 4}$ , and the model was trained for 100 epochs.

This approach ensured that each component of the model was independently trained to capture its respective feature set before integrating them in the final fusion model for better overall performance.

2.4.3. Assessment Criteria

We evaluated the model’s performance across four different configurations, which were each designed to assess various aspects of the training and fusion approach:

LSTM-based model (3D reconstructed data): This configuration evaluated the impact of different data padding methods and the effects of temporal variations (Stage A + B and Stage B) on model accuracy, focusing on the temporal aspects of the injection process.
CNN-based model (2D image features): This configuration assessed performance based on 2D image data collected from various camera angles and viewpoints, allowing the model to capture spatial features of the injection process.
Multi-modal fusion model (without adaptive module): In this setup, we fused the features extracted by the two unimodal models (LSTM and CNN) without applying the adaptive module to investigate how feature fusion alone impacts performance.
Proposed multi-modal fusion model (with multi-head adaptive module): The final evaluation was conducted using our proposed model, which integrates both the fused features and the multi-head self-attention mechanism to refine the fusion process for better performance.

2.4.4. Comparative Experiments

Since our approach of utilizing multi-modal data fusion for evaluating injection training is original in this field, direct comparisons with previous studies were not feasible. Therefore, we conducted comparative experiments to benchmark the performance of our fused feature vector against other well-established models, including both single-modal and multi-modal approaches.

For single-modal comparisons, we used the following:

CNN-based model (2D image data);
LSTM-based model (3D reconstructed data).

For multi-modal comparisons, we tested our approach against:

Gradient Boosting Decision Trees (GBDT) [34];
Random Forest (RF) [35];
Support Vector Machine (SVM) [36];
No-MHSA model (multi-modal fusion without multi-head self-attention).

By comparing our proposed model against these baseline models, we aimed to demonstrate the effectiveness of multi-modal fusion and the addition of the multi-head self-attention mechanism in improving accuracy, precision, and other performance metrics.

3. Experiments and Results

In the 5-fold cross-validation, we conducted an extensive comparison to validate the feasibility of the proposed hand injection simulation training model under different configurations. All the results presented are the average performance metrics across all five folds, ensuring a robust evaluation of the model’s consistency and generalizability.

3.1. Single-Modal Model Results

3.1.1. LSTM-Based Model for 3D Reconstructed Data

For the single-modal model using 3D reconstructed data, our feature parameters included the angles between the joints of the injecting fingers (thumb–index), the geometric distances between joints and Euclidean distance between the thumb and index fingertips, as shown in Figure 6.

Additionally, we aimed to determine the optimal data padding method and considered the impact of different temporal stages (Stage A + B and Stage B) on the accuracy of the assessment. The results are presented in Table 3.

The results presented in Table 3 illustrate the impact of different data padding methods and temporal stages on the performance metrics. Several key observations can be made:

Data-Padding Methods: Among the various padding methods, the “zero padding” approach consistently yielded the highest test accuracy across both Stage A + B and Stage B. This suggests that padding sequences with zeros might help maintain the continuity of the data sequence, leading to more reliable predictions. The “Average Extraction” method, which involves extracting specific frames, also performed relatively well, particularly in terms of mean AUC and F1-score, indicating its potential in scenarios where the regularity and consistency of data are critical. On the other hand, the “Repeat Padding” method demonstrated the lowest performance across most metrics, indicating that repeating data may introduce noise or redundancy that negatively affects the model’s accuracy.
Temporal Stages: When comparing the results across different temporal stages, it becomes clear that Stage B, which focuses on the critical moments immediately after needle insertion, generally outperforms Stage A + B in terms of test accuracy and recall. This suggests that emphasizing the period following needle puncture is more effective for evaluating the precision and consistency of the injection process. The higher accuracy and recall rates indicate that Stage B is particularly adept at capturing the key aspects of hand motion and needle control during the most crucial part of the injection procedure. This focus on the post-puncture phase aligns with clinical needs where the accuracy and safety of needle handling are paramount, making Stage B a more effective temporal range for assessing injection skills.

In summary, the analysis suggests that the “zero padding” method, coupled with the focused temporal range of Stage B, provides the most reliable and precise results for assessing hand injection simulation training. These findings will guide the selection of data processing techniques in future experiments to ensure optimal model performance.

3.1.2. CNN-Based Model for 2D Image

In addition, we investigated the evaluation of the hand injection process using 2D images captured from multiple viewpoints. This approach is designed to capture hand movements from various angles, ensuring that critical details are not overlooked due to occlusions between the fingers. The recorded video frames were extracted using the ’Average Extraction’ method, with 65 frames selected as the hyperparameter, which was consistent with prior experiments. Furthermore, to demonstrate the impact of different stages on the results across various modalities, we conducted additional comparative experiments. Based on the image modality data, we manually set the ROI to focus more closely on the hand movement areas, reducing the noise caused by the surrounding environment. The architecture of the CNN-based predictive model using 2D image features is illustrated in Figure 7.

The results presented in Table 4 provide a comparative analysis of the CNN-based model’s performance across different camera combinations and temporal stages (Stage A + B and Stage B).

Camera Combinations: The performance metrics demonstrate that as more camera views are combined, the model’s accuracy and other evaluation metrics generally improve. For instance, the combination of all three cameras (Cam 1-2-3) consistently yields higher accuracy, mean AUC, and F1-scores compared to single-camera setups. This trend suggests that incorporating multiple viewpoints helps capture a more comprehensive set of features, reducing occlusions and capturing important hand movements that may be missed by a single camera. This comprehensive capture leads to better model performance.
Stage A + B or Stage B: When comparing the two temporal stages, Stage B outperforms Stage A + B across most metrics. Notably, the highest test accuracy, mean AUC, and F1-score for Stage B are consistently higher than those for Stage A + B across all camera combinations. This suggests that focusing on the critical period immediately following needle insertion (Stage B) is more effective for accurately assessing the injection process. Stage B captures the most essential part of the injection, where precision and control are paramount, leading to better model performance in this temporal stage.
Impact of Specific Camera Combinations: Single Camera Performance: Among the single camera setups, Cam 3 tends to perform better than Cam 1 and Cam 2. This may indicate that the positioning of Cam 3 provides a more advantageous angle for capturing critical aspects of the hand movements during the injection process. Dual Camera Combinations: The combination of Cam 2-3 offers the best performance among dual-camera setups, particularly in Stage B, achieving the highest accuracy of 59.13%. This suggests that these two angles complement each other well, providing a more complete view that enhances the model’s ability to accurately assess the injection process. All Cameras Combined (Cam 1-2-3): Combining all three cameras yields the best overall performance, indicating that the diverse perspectives provided by multiple cameras result in a more robust and accurate model. However, the slight reduction in accuracy compared to the best dual-camera setup suggests that Cam 1 might introduce additional noise, which could slightly detract from the model’s effectiveness.

The analysis indicates that the model performs best when using a combination of cameras 2 and 3 (left and right cameras) and focusing on Stage B, highlighting the critical moments following needle insertion. This approach captures the essential features needed for an accurate assessment of the injection process.

When comparing the two single-modal models, the 3D reconstructed data model outperforms the 2D image-based model, achieving higher accuracy, with an improvement of 6.6%. This demonstrates the effectiveness of using 3D reconstructed data for evaluating the hand injection process.

3.2. Multi-Modal Model Results

3.2.1. Multi-Modal Fusion Model

Despite the modest improvements achieved with the 3D reconstructed data over the 2D image data, the assessment accuracy for hand injection remained below the desired threshold. The results suggest that neither modality alone sufficiently captures the complexity of hand injection movements. To address this limitation, we explored a multi-modal fusion approach, since integrating both 2D image data and 3D reconstructed data could lead to more accurate predictions.

The basic framework of the proposed multi-modal fusion model is illustrated in Figure 8. This model integrates both features extracted from 2D images and 3D reconstructed data, respectively. Specifically, the CNN encoder module is employed to extract image features from the 2D images captured from multiple viewpoints. The images undergo preprocessing, including region of interest (ROI) extraction, followed by the average extraction method to obtain 65 frames, which serve as input to the CNN. Simultaneously, an LSTM encoder module is used to capture the temporal dynamics of the hand movements, focusing on joint angles, joint distances, and fingertip distances from the 3D reconstructed data. The image and temporal features obtained from the CNN and LSTM modules are concatenated to form a fused feature vector. This vector is then processed by a fully connected classifier network, which consists of several layers including linear layers, batch normalization, and dropout layers. This classifier refines the fused features, ultimately producing the final prediction for the hand injection process.

The classification performance of the proposed multi-modal fusion model is illustrated in Table 5. These results indicate that the model performs relatively well across various metrics, with particularly strong performance in terms of AUC, suggesting that the model is effective in distinguishing between classes. However, while the accuracy and recall are satisfactory, there is room for improvement in the model’s overall performance, particularly in terms of consistency across different metrics.

3.2.2. Multi-Head Self-Attention Module

To further enhance the predictive accuracy and robustness of the model, we introduce a multi-head self-attention mechanism in this section. This mechanism aims to better capture the complex interactions between the image and temporal features by allowing the model to focus on multiple aspects of the input data simultaneously.

The overall structure of the model is illustrated in Figure 5A. A key component of this structure is the “Scaled Dot Product Attention” module, shown in Figure 5C, where the number of attention heads h plays a crucial role in the performance of the model. To determine the optimal number of heads for the attention mechanism, we conducted a series of comparative experiments. The results of these experiments are presented in Table 6.

Based on the performance metrics presented in Table 6, the configuration with h = 8 attention heads yields the highest accuracy. With this configuration, we proceeded to evaluate the multi-modal fusion and multi-head self-attention model on the 5-fold cross-validation to assess its overall effectiveness and robustness in predicting the hand injection process. The summarized results are shown in Figure 9.

The results from the 5-fold cross-validation, as illustrated by the confusion matrix in Figure 9 (left), show an average accuracy of 0.7238 across the folds. While the model consistently performs well in predicting class “5”, there is noticeable variability in distinguishing between classes “3” and “4”. In particular, instances of class “4” were frequently misclassified as class “5”. To further assess the model’s performance, we examined the ROC curves for the three-class classification, as shown in Figure 9 (right), which provide a detailed view of the model’s discriminative ability.

Class 3 Performance: The model demonstrates strong performance in predicting Class 3 with an average AUC of 0.93. This high AUC indicates strong sensitivity and specificity in identifying lower skill levels, suggesting that the model effectively distinguishes Class 3 from the other classes.
Class 4 Performance: The model’s performance for Class 4 shows more variability, with an average AUC of 0.74. This variability reflects the challenges the model faces in distinguishing between intermediate skill levels, which may overlap with the characteristics of both Class 3 and Class 5. As observed in the confusion matrix, misclassification between Class 4 and the other classes occurs more frequently.
Class 5 Performance: The model performs well in predicting Class 5 with an average AUC of 0.83. Although the model generally performs better at identifying high skill levels, the slightly lower AUC compared to Class 3 suggests that at times, high-performing subjects are not as clearly differentiated from those in the intermediate class.
Overall Mean ROC: The mean ROC across all classes is 0.84, indicating solid overall performance. The model shows robustness across different folds, although its ability to predict the intermediate class (Class 4) is less consistent, highlighting areas for improvement.

3.3. Comparative Experiments Result

The comparative analysis presented in Table 7 demonstrates the clear advantages of our proposed multi-modal fusion model with multi-head self-attention (labeled “Proposed”) over other approaches across all evaluation metrics, including test accuracy, mean AUC, F1-score, precision, and recall. The integration of multi-modal data fusion and the multi-head self-attention mechanism provides a notable improvement in performance compared to both traditional machine learning models and deep learning models that do not utilize these advanced features.

Specifically, our proposed model achieved the highest test accuracy (0.7238), mean AUC (0.8343), F1-score (0.7060), and precision (0.7339). This indicates that the combination of spatial and temporal features, alongside the ability of the self-attention mechanism to focus on multiple relevant aspects of the data, contributes significantly to the model’s robustness and overall effectiveness in predicting injection process outcomes.

In contrast, the No-MHSA model, which omits the multi-head self-attention mechanism, still performs reasonably well but falls short of the performance levels achieved by the full version of our model. The difference in performance, particularly in precision and F1-score, highlights the crucial role that the self-attention mechanism plays in effectively capturing and leveraging the complex relationships within the fused feature vectors.

Traditional machine learning models such as Gradient Boosting Decision Trees (GBDT), Random Forest (RF), and Support Vector Machine (SVM) exhibit lower performance across all metrics with mean AUC and F1-scores notably lagging behind those of our proposed model. This outcome underscores the limitations of traditional models in handling the intricacies of multi-modal data particularly in complex tasks like the evaluation of medical training processes.

Overall, these results validate the effectiveness of our proposed approach, demonstrating its superiority over traditional methods and the importance of incorporating advanced deep learning mechanisms like multi-modal fusion and self-attention for accurate and reliable assessment in medical training contexts.

4. Discussion and Conclusions

4.1. Discussion

The results of this study underscore the critical importance of integrating multi-modal data with advanced attention mechanisms to enhance the accuracy of automated assessment systems in medical training. By combining 2D image data with 3D reconstructed data, the proposed multi-modal fusion model with multi-head self-attention achieved a significant improvement in performance metrics compared to single-modal approaches. Specifically, while the accuracy for the 2D image-based single-modal model was 0.5913, and the 3D reconstructed data model achieved 0.6573, the multi-modal approach elevated this to 0.7238. This improvement in accuracy and other metrics, including AUC (0.8343), precision (0.7339), and F1-score (0.7060), demonstrates the model’s ability to differentiate between skill levels with higher sensitivity and reliability.

Our findings align with previous research on AI applications in medical education, where the fusion of multiple data modalities has shown promise in enhancing prediction accuracy and robustness. For example, studies in the field of surgical skill assessment have demonstrated that integrating video data with sensor-based metrics leads to more precise and objective evaluations of performance compared to single-modality systems. These studies [37,38] highlight the importance of capturing both spatial and temporal features when evaluating complex medical procedures, which is an approach that our multi-modal fusion model also leverages.

Moreover, the use of multi-head self-attention mechanisms further distinguishes our approach from existing methods by enabling the model to focus on relevant aspects of the fused data. This attention mechanism allows the model to better capture nuanced differences in hand movements and task progression, leading to an improved classification of skill levels. This is particularly critical in medical education, where subtle variations in performance can significantly impact clinical outcomes.

This study contributes to advancing the field of AI-driven solutions for medical education, particularly in the area of skill assessment, by proposing a novel multi-modal fusion network that integrates 3D reconstructed data and 2D images. The network utilizes LSTM and CNN to extract temporal and spatial features, and the introduction of a self-attention mechanism allows the model to efficiently capture critical moments during the injection process. This novel architecture provides a comprehensive evaluation of hand movements in medical training, as shown in our improvement over single-modal models (e.g., 0.5913 and 0.6573 accuracy for CNN and LSTM, respectively). Furthermore, extensive experiments validated the effectiveness of this model, demonstrating superior accuracy over existing methods while enabling objective, quantitative assessments of the injection process.

In comparison to existing literature on AI applications in medical education, our model presents several key advantages. First, the ability to process both 2D and 3D data allows for a more comprehensive analysis of the injection process, which is not limited to surface-level observations but also incorporates the depth and temporal flow of hand movements. Second, the use of attention mechanisms facilitates a more targeted evaluation, ensuring that the model focuses on critical stages of the injection task, such as needle insertion and blood backflow, which are vital for accurate skill assessment. Other studies, such as those in surgical training [39,40], have employed similar attention-based models, further validating the importance of this approach in medical skill assessment.

However, while our model outperforms traditional machine learning methods like Gradient Boosting Decision Trees (GBDT) and Random Forest, and even other deep learning models such as CNN and LSTM-based approaches, there are still areas for improvement. The results from the confusion matrices and ROC curves indicate that the model performs well in identifying high skill levels (Class 5) but struggles with distinguishing intermediate skill levels (Class 4) from lower levels (Class 3). This suggests that further refinement of the model’s sensitivity to subtle differences between intermediate and high skill levels is necessary.

Additionally, our study contributes to the growing body of research advocating for AI-driven solutions in the post-pandemic era of medical education. The disruption caused by COVID-19 has accelerated the need for remote and scalable training tools, and our proposed multi-modal fusion model represents a step toward fulfilling that need. By providing a data-driven, objective method of skill assessment, this model could enhance the quality and consistency of medical education, particularly in settings where traditional face-to-face instruction is not feasible.

4.2. Limitations and Future Work

While the results of this study are promising, several limitations warrant further exploration.

Classification of Intermediate Skill Levels: The variability in accurately classifying intermediate skill levels suggests that additional features or more advanced models may be necessary to capture the nuanced distinctions between these levels.
Dataset Imbalance: A notable limitation is the imbalance in the dataset, particularly within the GRS[B] scores, where there is a lack of representation for lower ratings (e.g., scores of 1 and 2). This imbalance restricts the model’s ability to generalize effectively across the entire spectrum of skill levels.

Future work should aim to address these limitations by obtaining a more balanced dataset and exploring alternative model architectures. Specifically, investigating the use of a 3D Convolutional Neural Network (3D CNN) suited for point cloud data or Graph Neural Network (GNN) may enable the model to capture both spatial and temporal features more effectively. Furthermore, incorporating novel attention mechanisms or other advanced neural network architectures could significantly improve the model’s performance and its applicability in medical training environments. These future directions could enhance the model’s sensitivity in distinguishing between skill levels and improve generalizability across varied clinical scenarios.

4.3. Conclusions

In conclusion, this study introduces a novel multi-modal fusion model with multi-head self-attention for the automated assessment of hand injection training in medical education. By combining 2D image data with 3D reconstructed motion data, the model achieves a notable improvement over single-modal approaches with an accuracy of 0.7238 and an AUC of 0.8343. These findings underscore the importance of integrating multi-modal data to provide a more accurate and comprehensive evaluation of medical skills.

Our model demonstrates strong performance in distinguishing between different skill levels, particularly in predicting high and low skill levels. However, challenges remain in consistently classifying intermediate skill levels, which could be addressed by refining the attention mechanisms or incorporating additional dataset.

This study holds significant implications for the future of medical education, even though it is still in its early stages. The post-pandemic era has underscored the need for innovation, particularly in remote training and automated assessment tools, which are becoming increasingly vital. Our proposed approach contributes to the growing body of AI-based tools that aim to enhance clinical education by providing scalable, reliable, and objective data-driven evaluations. Future work will focus on further enhancing the model’s capabilities by integrating more diverse datasets and exploring the introduction of advanced deep learning techniques, such as 3D-CNN, to improve performance and adaptability in real-world training environments.

Author Contributions

Conceptualization, Z.L. and T.N.; methodology, A.K.; software, Z.L. and T.N.; validation, A.K., A.H. and Y.N.; formal analysis, Z.L. and A.H.; investigation, A.K.; resources, A.K. data curation, T.N.; writing—original draft preparation, Z.L.; writing—review and editing, T.N. and Y.N.; visualization, Z.L.; supervision, Y.N.; project administration, T.N.; funding acquisition, A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Review Committee of the Graduate School of Medicine, Chiba University (protocol code 3425).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Papapanou, M.; Routsi, E.; Tsamakis, K.; Fotis, L.; Marinos, G.; Lidoriki, I.; Karamanou, M.; Papaioannou, T.G.; Tsiptsios, D.; Smyrnis, N.; et al. Medical education challenges and innovations during COVID-19 pandemic. Postgrad. Med. J. 2022, 98, 321–327. [Google Scholar] [CrossRef] [PubMed]
Walters, M.; Alonge, T.; Zeller, M. Impact of COVID-19 on medical education: Perspectives from students. Acad. Med. 2022, 97, S40–S48. [Google Scholar] [CrossRef]
Alsoufi, A.; Alsuyihili, A.; Msherghi, A.; Elhadi, A.; Atiyah, H.; Ashini, A.; Ashwieb, A.; Ghula, M.; Ben Hasan, H.; Abudabuos, S.; et al. Impact of the COVID-19 pandemic on medical education: Medical students’ knowledge, attitudes, and practices regarding electronic learning. PLoS ONE 2020, 15, e0242905. [Google Scholar] [CrossRef]
De Souza-Junior, V.D.; Mendes, I.A.C.; Marchi-Alves, L.M.; Jackman, D.; Wilson-Keates, B.; de Godoy, S. Peripheral venipuncture education strategies for nursing students: An integrative literature review. J. Infus. Nurs. 2020, 43, 24–32. [Google Scholar] [CrossRef]
Rose, S. Medical student education in the time of COVID-19. JAMA 2020, 323, 2131–2132. [Google Scholar] [CrossRef]
Boffelli, A.; Kalchschmidt, M.; Shtub, A. Simulation-Based Training: From a Traditional Course to Remote Learning–The COVID-19 Effect. High. Educ. Stud. 2021, 11, 8–17. [Google Scholar] [CrossRef]
Major, S.; Krage, R.; Lazarovici, M. SimUniversity at a distance: A descriptive account of a team-based remote simulation competition for health professions students. Adv. Simul. 2022, 7, 6. [Google Scholar] [CrossRef]
Reece, S.; Johnson, M.; Simard, K.; Mundell, A.; Terpstra, N.; Cronin, T.; Dubé, M.; Kaba, A.; Grant, V. Use of virtually facilitated simulation to improve COVID-19 preparedness in rural and remote Canada. Clin. Simul. Nurs. 2021, 57, 3–13. [Google Scholar] [CrossRef] [PubMed]
Naidoo, N.; Azar, A.J.; Khamis, A.H.; Gholami, M.; Lindsbro, M.; Alsheikh-Ali, A.; Banerjee, Y. Design, implementation, and evaluation of a distance learning framework to adapt to the changing landscape of anatomy instruction in medical education during COVID-19 pandemic: A proof-of-concept study. Front. Public Health 2021, 9, 726814. [Google Scholar] [CrossRef] [PubMed]
Blandford R., D. Post-pandemic science and education. Am. J. Phys. 2020, 88, 518. [Google Scholar]
Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. Chexnet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv 2017, arXiv:1711.05225. [Google Scholar]
Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
Ardila, D.; Kiraly, A.P.; Bharadwaj, S.; Choi, B.; Reicher, J.J.; Peng, L.; Tse, D.; Etemadi, M.; Ye, W.; Corrado, G.; et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 2019, 25, 954–961. [Google Scholar] [CrossRef] [PubMed]
Elendu, C.; Amaechi, D.C.; Okatta, A.U.; Amaechi, E.C.; Elendu, T.C.; Ezeh, C.P.; Elendu, I.D. The impact of simulation-based training in medical education: A review. Medicine 2024, 103, e38813. [Google Scholar] [CrossRef] [PubMed]
Mirchi, N.; Bissonnette, V.; Yilmaz, R.; Ledwos, N.; Winkler-Schwartz, A.; Del Maestro, R.F. The Virtual Operative Assistant: An explainable artificial intelligence tool for simulation-based training in surgery and medicine. PLoS ONE 2020, 15, e0229596. [Google Scholar] [CrossRef]
Pantelimon, F.V.; Bologa, R.; Toma, A.; Posedaru, B.S. The evolution of AI-driven educational systems during the COVID-19 pandemic. Sustainability 2021, 13, 13501. [Google Scholar] [CrossRef]
Urban, G.; Tripathi, P.; Alkayali, T.; Mittal, M.; Jalali, F.; Karnes, W.; Baldi, P. Deep learning localizes and identifies polyps in real time with 96% accuracy in screening colonoscopy. Gastroenterology 2018, 155, 1069–1078. [Google Scholar] [CrossRef]
Islam, M.Z.; Islam, M.M.; Asraf, A. A combined deep CNN-LSTM network for the detection of novel coronavirus (COVID-19) using X-ray images. Inform. Med. Unlocked 2020, 20, 100412. [Google Scholar] [CrossRef]
Basu, S.; Singhal, S.; Singh, D. A systematic literature review on multimodal medical image fusion. Multimed. Tools Appl. 2024, 83, 15845–15913. [Google Scholar] [CrossRef]
Hou, R.; Zhou, D.; Nie, R.; Liu, D.; Ruan, X. Brain CT and MRI medical image fusion using convolutional neural networks and a dual-channel spiking cortical model. Med. Biol. Eng. Comput. 2019, 57, 887–900. [Google Scholar] [CrossRef]
Hashimoto, D.A.; Rosman, G.; Rus, D.; Meireles, O.R. Artificial intelligence in surgery: Promises and perils. Ann. Surg. 2018, 268, 70–76. [Google Scholar] [CrossRef]
Steyaert, S.; Pizurica, M.; Nagaraj, D.; Khandelwal, P.; Hernandez-Boussard, T.; Gentles, A.J.; Gevaert, O. Multimodal data fusion for cancer biomarker discovery with deep learning. Nat. Mach. Intell. 2023, 5, 351–362. [Google Scholar] [CrossRef]
Ziani, S. Enhancing fetal electrocardiogram classification: A hybrid approach incorporating multimodal data fusion and advanced deep learning models. Multimed. Tools Appl. 2024, 83, 55011–55051. [Google Scholar] [CrossRef]
Fujii, C. Vacuum-venipuncture skills: Time required and importance of tube order. Vasc. Health Risk Manag. 2013, 9, 457–464. [Google Scholar] [CrossRef]
Martin, J.; Regehr, G.; Reznick, R.; Macrae, H.; Murnaghan, J.; Hutchison, C.; Brown, M. Objective structured assessment of technical skill (OSATS) for surgical residents. Br. J. Surg. 1997, 84, 273–278. [Google Scholar] [PubMed]
Aggarwal, R.; Grantcharov, T.; Moorthy, K.; Milland, T.; Darzi, A. Toward feasible, valid, and reliable video-based assessments of technical surgical skills in the operating room. Ann. Surg. 2008, 247, 372–379. [Google Scholar] [CrossRef] [PubMed]
Hopmans, C.J.; den Hoed, P.T.; van der Laan, L.; van der Harst, E.; van der Elst, M.; Mannaerts, G.H.; Dawson, I.; Timman, R.; Wijnhoven, B.P.; IJzermans, J.N. Assessment of surgery residents’ operative skills in the operating theater using a modified Objective Structured Assessment of Technical Skills (OSATS): A prospective multicenter study. Surgery 2014, 156, 1078–1088. [Google Scholar] [CrossRef]
Li, Z.; Kanazuka, A.; Hojo, A.; Suzuki, T.; Yamauchi, K.; Ito, S.; Nomura, Y.; Nakaguchi, T. Automatic Puncture Timing Detection for Multi-Camera Injection Motion Analysis. Appl. Sci. 2023, 13, 7120. [Google Scholar] [CrossRef]
Li, Z.; Kanazuka, A.; Hojo, A.; Hara, Y.; Nomura, Y.; Nakaguchi, T. Multi-Camera Hand Motion Analysis For Puncture Technique Training. In Proceedings of the 2024 46th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Orlando, FL, USA, 15–19 July 2024; pp. 1–5. [Google Scholar]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. Mediapipe hands: On-device real-time hand tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
Sun, L.; Jia, K.; Chen, K.; Yeung, D.Y.; Shi, B.E.; Savarese, S. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2147–2156. [Google Scholar]
Koonce, B.; Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 63–72. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Kasa, K.; Burns, D.; Goldenberg, M.G.; Selim, O.; Whyne, C.; Hardisty, M. Multi-Modal deep learning for assessing surgeon technical skill. Sensors 2022, 22, 7328. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Weng, Y.; Wang, B. CWT-ViT: A time-frequency representation and vision transformer-based framework for automated robotic surgical skill assessment. Expert Syst. Appl. 2024, 258, 125064. [Google Scholar] [CrossRef]
Nwoye, C.I.; Yu, T.; Gonzalez, C.; Seeliger, B.; Mascagni, P.; Mutter, D.; Marescaux, J.; Padoy, N. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Med. Image Anal. 2022, 78, 102433. [Google Scholar] [CrossRef]
Liu, D.; Li, Q.; Jiang, T.; Wang, Y.; Miao, R.; Shan, F.; Li, Z. Towards unified surgical skill assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9522–9531. [Google Scholar]

Figure 1. The multi-camera injection acquisition system. It includes three sets of industrial cameras, which are positioned at the upper left, upper right, and directly above the front of the arm model to clearly capture the hand movements. A general-purpose camera is used for event detection, and an arm model is employed to simulate puncture. A timing button is used to control the start, while a photo sensor detects blood backflow.

Figure 2. Definition of different stages in time detection. The time parameters are divided into three parts: the start time (when the operator presses the button to start the injection); needle puncture time (when the tip of needle contacts the skin); and blood backflow time (when the needle continues to puncture the vessel, causing blood backflow).

Figure 3. Mediapipe Hand used to recognize hand injection motions from different camera perspectives and convert the data into 3D space.

Figure 4. Labeling of the thumb–index finger joints. The information that can be extracted includes the angles between the joints, distances between the joints, and the fingertip distance (distance between points 4 and 8).

Figure 5. Overall framework of the proposed multi-modal fusion network for evaluating medical injection training. The framework is divided into three main components: (A) the overall process flow, (B) the bidirectional LSTM module, and (C) the multi-head self-attention module.

Figure 6. Architecture of the LSTM-based predictive model using 3D features.

Figure 7. Architecture of the CNN-based predictive model using 2D image features.

Figure 8. Framework of the multi-modal data fusion model.

Figure 9. Confusion matrix and ROC curve results from the 5-fold cross-validation for the multi-modal fusion and multi-head self-attention model.

Table 1. Objective Structured Assessment of Technical Skill—Global Rating Score (OSATS-GRS) [27].

GRS	1: Poor	2: Below Average	3: Average	4: Above Average	5: Excellent
A. Respect for Tissue	Frequently used unnecessary force on tissue or caused damage by inappropriate handling of instruments	Handled tissue carefully with some minor damage due to inattention	Handled tissueappropriately with minimal damage	Handled tissue appropriately with no damage	Consistently handled tissue with care, causing no damage
B. Timeand Motion	Many unnecessary moves	Efficient time/motion but some unnecessary moves	Efficient time/motion with few unnecessary moves	Efficient time/motion with no unnecessary moves	Fluid motion and maximum efficiency
C. InstrumentHandling	Repeatedly makes tentative or awkward moves with instruments	Competent use of instruments, but occasionally appeared stiff or awkward	Used instruments competently, with some minor errors	Fluid and competent use of instruments with minor, if any, errors	Fluid and efficient use of instruments, with no errors

Table 2. Statistical summary of time stages across all subjects, corresponding to video frame counts for Stage A + B and Stage B.

	Mean	Standard Deviation	Median	Maximum	Minimum	ICC (95% CI)
Stage A + B	390	311	270	1200	65	0.703 (0.62, 0.78)
Stage B	299	304	170	1135	31	0.629 (0.53, 0.72)

Table 3. Comparison of LSTM model performance metrics for different data padding techniques and temporal features.

	Stage A + B			Stage B
	Zero Padding	Repeat Padding	Average Extraction	Zero Padding	Repeat Padding	Average Extraction
Test Accuracy	0.6401	0.5690	0.6261	0.6573	0.5824	0.6224
Mean AUC	0.7375	0.7060	0.7592	0.7528	0.7259	0.7298
F1-score	0.5871	0.5694	0.6250	0.5953	0.5861	0.6007
Precision	0.5908	0.5833	0.6342	0.5739	0.5966	0.6038
Recall	0.6401	0.5690	0.6261	0.6573	0.5824	0.6224

Table 4. Comparison of CNN-based model performance metrics across different camera combinations (average results from 5-fold cross-validation).

		Cam 1	Cam 2	Cam 3	Cam 1-2	Cam 1-3	Cam 2-3	Cam 1-2-3
Stage A + B	Test Accuracy	0.3647	0.4040	0.4942	0.4723	0.5075	0.5335	0.5387
	Mean AUC	0.5832	0.6336	0.6412	0.6769	0.6473	0.7237	0.7114
	F1-Score	0.2820	0.2675	0.3900	0.4334	0.4600	0.5003	0.5317
	Precision	0.3256	0.2119	0.4563	0.4556	0.5271	0.5087	0.5549
	Recall	0.3647	0.4040	0.4942	0.4723	0.5075	0.5335	0.5387
Stage B	Test Accuracy	0.5188	0.4886	0.5025	0.5254	0.5120	0.5913	0.5642
	Mean AUC	0.6808	0.6636	0.6468	0.7111	0.6917	0.7325	0.7430
	F1-Score	0.4795	0.4146	0.3862	0.5229	0.4941	0.5699	0.5495
	Precision	0.4943	0.4389	0.3275	0.5249	0.5321	0.5793	0.5508
	Recall	0.5188	0.4886	0.5025	0.5254	0.5120	0.5913	0.5642

Table 5. Classification metrics for the proposed multi-modal fusion model without multi-head self-attention module.

	Accuracy	F1-Score	Precision	Recall	AUC
Propose Method	0.6975	0.6836	0.7008	0.6975	0.8348

Table 6. Performance comparison across different numbers of attention heads (h) in the multi-head self-attention mechanism.

Attention Heads (h)	Test Accuracy	Mean AUC	F1-Score	Precision
h = 2	0.6881	0.8344	0.6622	0.6769
h = 4	0.6971	0.8391	0.6713	0.6781
h = 8	0.7238	0.8343	0.7060	0.7339
h = 16	0.7103	0.8449	0.6884	0.7154

Table 7. Performance comparison of different models for injection process evaluation (CNN: Convolutional Neural Network, LSTM: Long Short-Term Memory, GBDT: Gradient Boosting Decision Trees, RF: Random Forest, SVM: Support Vector Machine, No-MHSA: no self-attention, Proposed: multi-modal fusion model with multi-head self-attention module.)

Method	Model	Test Accuracy	Mean AUC	F1-Score	Precision	Recall
Single-Modal Model	CNN (2D)	0.5913	0.7325	0.5699	0.5793	0.5913
Single-Modal Model	LSTM (3D)	0.6573	0.7528	0.5953	0.5739	0.6573
Multi-Modal Model	GBDT [34]	0.6761	0.7900	0.6554	0.6786	0.6761
	RF [35]	0.6534	0.7740	0.6357	0.6411	0.6534
	SVM [36]	0.6755	0.8230	0.6500	0.6642	0.6755
	No-MHSA	0.6975	0.8348	0.6836	0.7008	0.6975
	Proposed	0.7238	0.8343	0.7060	0.7339	0.7238

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Kanazuka, A.; Hojo, A.; Nomura, Y.; Nakaguchi, T. Multi-Modal Fusion Network with Multi-Head Self-Attention for Injection Training Evaluation in Medical Education. Electronics 2024, 13, 3882. https://doi.org/10.3390/electronics13193882

AMA Style

Li Z, Kanazuka A, Hojo A, Nomura Y, Nakaguchi T. Multi-Modal Fusion Network with Multi-Head Self-Attention for Injection Training Evaluation in Medical Education. Electronics. 2024; 13(19):3882. https://doi.org/10.3390/electronics13193882

Chicago/Turabian Style

Li, Zhe, Aya Kanazuka, Atsushi Hojo, Yukihiro Nomura, and Toshiya Nakaguchi. 2024. "Multi-Modal Fusion Network with Multi-Head Self-Attention for Injection Training Evaluation in Medical Education" Electronics 13, no. 19: 3882. https://doi.org/10.3390/electronics13193882

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Modal Fusion Network with Multi-Head Self-Attention for Injection Training Evaluation in Medical Education

Abstract

1. Introduction

1.1. Deep Learning (DL) in Medical Training

1.2. Fusion Models in Medical Research

2. Materials and Methods

2.1. Data Acquisition

2.2. Data Preprocessing

2.2.1. Evaluation Criteria

2.2.2. Dataset

2.3. Network Framework

2.3.1. Bidirectional LSTM Module

2.3.2. CNN Module

2.3.3. Multi-Head Self-Attention Module

2.3.4. Classifier Module

2.4. Evaluation Method

2.4.1. Evaluation Metrics

2.4.2. Model Training and Hyperparameters

2.4.3. Assessment Criteria

2.4.4. Comparative Experiments

3. Experiments and Results

3.1. Single-Modal Model Results

3.1.1. LSTM-Based Model for 3D Reconstructed Data

3.1.2. CNN-Based Model for 2D Image

3.2. Multi-Modal Model Results

3.2.1. Multi-Modal Fusion Model

3.2.2. Multi-Head Self-Attention Module

3.3. Comparative Experiments Result

4. Discussion and Conclusions

4.1. Discussion

4.2. Limitations and Future Work

4.3. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI