RGB-D Camera-Based Human Head Motion Detection and Recognition System for Positron Emission Tomography Scanning

Shan, Yixin; Lu, Zikun; Sun, Zhe; Liu, Hao; Xu, Jiangchang; Sun, Yixing; Chen, Xiaojun

doi:10.3390/electronics14071441

Open AccessArticle

RGB-D Camera-Based Human Head Motion Detection and Recognition System for Positron Emission Tomography Scanning

by

Yixin Shan

,

Zikun Lu

,

Zhe Sun

,

Hao Liu

,

Jiangchang Xu

,

Yixing Sun

and

Xiaojun Chen

^*

Institute of Biomedical Manufacturing and Life Quality Engineering, State Key Laboratory of Mechanical System and Vibration, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1441; https://doi.org/10.3390/electronics14071441

Submission received: 6 March 2025 / Revised: 25 March 2025 / Accepted: 1 April 2025 / Published: 2 April 2025

(This article belongs to the Special Issue Medical Robots: Safety, Performance and Improvement)

Download

Browse Figures

Versions Notes

Abstract

:

Positron emission tomography (PET) is one of the most advanced imaging diagnostic devices in the medical field, playing a crucial role in tumor diagnosis and treatment. However, patient motion during scanning can lead to motion artifacts, which affect diagnostic accuracy. This study aims to develop a head motion monitoring system to identify and select images with excessive motion and corresponding periods. The system, based on an RGB-D structured-light camera, implements facial feature point detection, 3D information acquisition, and head motion monitoring, along with a user interaction software. Through phantom experiments and volunteer experiments, the system’s performance was tested under various conditions, including stillness, pitch movement, yaw movement, and comprehensive movement. Experimental results show that the system’s translational error is less than 2.5 mm, rotational error is less than 2.0°, and it can output motion monitoring results within 10 s after the PET scanning, meeting clinical accuracy requirements and showing significant potential for clinical application.

Keywords:

PET scanning; feature point recognition; three-dimensional reconstruction; motion monitoring; computer-aided surgery

1. Introduction

Positron emission tomography (PET) is a leading technology in the field of nuclear medicine and is widely recognized as one of the most advanced large-scale medical diagnostic imaging devices [1]. PET imaging plays an irreplaceable role in the diagnosis and pathological research of tumors, cardiovascular diseases, and brain disorders, significantly improving the diagnostic accuracy of various diseases [2]. However, PET imaging still faces several challenges, such as lower spatial resolution, longer image acquisition times, complex operations, and difficulties in image interpretation [3]. Typically, a PET scan takes 10 to 15 min to complete [4]. Since PET imaging relies on the distribution of radioactive tracers within the body [5], patients are required to remain as still as possible during the examination. This requirement poses a significant challenge, especially for patients with low pain tolerance, such as children or other populations prone to movement. During image acquisition, bodily motion (including overall body movement and the physiological movement of internal organs) can cause artifacts, severely affecting image fusion quality and diagnostic accuracy [6]. Among the various body regions, the head and neck are some of the most commonly imaged areas in PET scans. However, compared to torso motion, head movement is more difficult to control and has a more significant impact on image quality [7].

Currently, to address the issue of artifacts in PET head and neck imaging, the primary solution relies on manual screening of the imaging results by doctors to eliminate image segments with significant artifacts that are unsuitable for diagnosis [6]. This process is not only time-consuming (usually taking 5 to 10 min) but also demands a high level of expertise from the doctors. Therefore, real-time detection of head and neck movement in patients and the automatic filtering of PET imaging results to assist doctors have become key research directions for improving clinical diagnostic efficiency and imaging quality.

To address the issue of artifacts caused by head and neck movement, it is necessary to effectively monitor and detect the subject’s movement during the scanning process, thereby enabling the automated screening of PET scan images. One approach is to attach a large, curved marker to the patient’s forehead and use an external optical camera to track the movement. By recognizing encoded symbols on the marker, six-degree-of-freedom motion data can be recorded in real time [8]. However, to reduce the discomfort caused by the marker, some studies use ink stamps with rich features as markers, combined with a stereoscopic optical camera system and feature detection algorithms, to achieve close-range head movement tracking [9]. Additionally, in the field of assistive devices and human–computer interaction, some research has fixed an Inertial Measurement Unit (IMU) to the subject’s head to track real-time six-degree-of-freedom head movement [10]. Nevertheless, methods based on external markers still have numerous limitations. For the subject, fixing the marker may cause discomfort and even induce involuntary movements. Furthermore, the process of affixing the marker is time-consuming and labor-intensive, and once the marker shifts, it becomes difficult to accurately estimate the movement [11].

To address this issue, we propose a marker-free PET scan motion detection and recognition system, which implements motion monitoring and detection based on natural image capture by depth cameras from multiple engineering aspects, including structure, hardware, software, and algorithms. The system is equipped with functions such as image acquisition, facial landmark analysis, head pose estimation, post-data processing, and motion intensity evaluation. Specifically, the system uses a depth structured-light camera deployed within the PET system to detect the patient’s motion in real time during the scan. The depth and RGB images collected by the system are registered, and the registration results are output to the host system. The software on the host system decodes, stores, and processes the real-time acquired natural images, and by analyzing and detecting facial RGB images, it extracts robust facial landmarks. By combining the depth registration results, the system obtains the three-dimensional coordinate information of the landmarks in space. Through coordinate transformation and local coordinate system establishment, the system calculates the translational and rotational amplitudes of the head, generating a comprehensive metric for assessing head and neck motion intensity. Based on these metrics, the system can identify periods prone to motion artifacts and output the detection results, assisting doctors in quickly screening PET scan data.

The structure of this study is as follows: Section 1 introduces the principles and applications of PET imaging, as well as the existing challenges. It further elaborates on the motivation and objectives of this study, emphasizing the research approach and content in relation to the issues addressed in this field. Section 2 introduces the principles of PET imaging, reinforcing the project’s background information. This section also provides a detailed literature review, exploring the current research status and existing methods in this area, and compares and selects methods in the context of this research. Section 3 presents the technical roadmap of the proposed system, explaining each technical module, conducting feasibility analyses, and providing corresponding mathematical derivations or performance demonstrations. It integrates RGB-D structured-light camera registration and image fusion techniques, facial landmark detection technology, and head pose estimation techniques. Moreover, it introduces the motion intensity evaluation metric in the context of PET scanning, building an integrated system tailored for the target environment. Section 4 explains the data source of the validation experiments, introduces the developed software framework, and presents a detailed analysis of experimental results from both phantom-based and volunteer-based experiments. This section also explores the accuracy and efficiency of the system in line with clinical requirements. Section 5 discusses the main advantages of the system, areas for improvement, and prospects for future research. Finally, Section 6 analyzes the findings of this study, summarizes the innovative points and major contributions, and discusses the potential for clinical application. In conclusion, this study aims to validate the feasibility and clinical value of the proposed motion monitoring-assisted image selection system through literature review, system development, and experimental investigations. The subsequent sections will provide detailed explanations.

2. Literature Review

This section is divided into subsections to provide a clearer analysis and discussion of the relevant literature. Section 2.1 introduces the principle of PET scan imaging, laying the foundation for the system’s research background; Section 2.2 focuses on image information acquisition, comparing different camera technologies and discussing the requirements and selection of hardware based on the target context; Section 2.3 discusses head feature point detection and motion tracking techniques, analyzing the strengths and weaknesses of existing methods; Section 2.4 introduces the main methods for spatial motion monitoring of rigid bodies.

2.1. Principle of PET Scan Imaging

Positron emission tomography (PET) is a highly specific molecular imaging technique that provides functional information about organs and their lesions, primarily used in molecular-level medical imaging [1]. PET employs the short-lived 18F-FDG positron-emitting radionuclide as the main tracer, which allows for high-precision, quantitative detection of abnormal increases in metabolic processes, producing clear images [12]. Therefore, PET provides crucial early insights into disease progression, particularly in the early detection of tumors. During PET imaging, a positron-emitting radioactive isotope-labeled molecular probe is injected into the body. After the unstable atoms decay and release positrons, these positrons encounter electrons in the tissue and annihilate, generating two oppositely flying 511 keV gamma photons [13]. The PET scanner detects these annihilation photons using a ring of photon detectors and reconstructs a three-dimensional image of the distribution of the molecular probe within the body based on the path of the photon pairs.

2.2. Image Information Acquisition

To achieve high-quality and stable image acquisition, selecting the appropriate camera is crucial. Common camera types include monocular cameras, binocular cameras, Time-of-Flight (ToF) cameras, and RGB-D structured-light cameras [14], each with its specific application scenarios and advantages and disadvantages.

Monocular cameras are the most commonly used type due to their low cost and ease of operation. However, due to scale uncertainty, a single-frame image cannot directly recover the three-dimensional information of objects. To improve accuracy, multiple monocular cameras are typically required to capture images from different viewpoints, and multi-view fusion is used to estimate the spatial pose of the object [15]. Additionally, in recent years, deep learning methods have been widely applied to monocular camera pose estimation [16], where neural networks are trained to predict the three-dimensional pose. However, the robustness of this method in complex environments still needs improvement, and the accuracy remains relatively low.

Binocular stereo cameras obtain depth information through the disparity between two cameras, enabling relatively accurate object pose estimation [17]. While binocular cameras provide high-precision pose estimation in regular environments, their accuracy in image matching and pose estimation may degrade in low-texture, uneven-lighting, or occluded environments [18].

Time-of-Flight (ToF) cameras calculate object depth information by emitting pulsed light and measuring the reflection time. They can maintain high accuracy over long distances, making them suitable for pose estimation in dynamic scenes [19]. However, the high cost of ToF cameras may increase the overall system cost.

RGB-D structured-light cameras acquire object depth information by actively projecting structured light and capturing images with a camera. These cameras achieve high accuracy over short distances and are particularly suited for pose estimation in confined spaces [20]. However, the accuracy of depth information deteriorates over long distances or under strong lighting conditions. To address these limitations, researchers often integrate depth learning techniques, combining image features and depth information to enhance the stability and robustness of pose estimation [21].

Despite the low cost and ease of operation of monocular cameras, they cannot accurately recover three-dimensional information, typically requiring multiple cameras to capture images from different viewpoints in order to improve precision. In contrast, binocular cameras, ToF cameras, and RGB-D structured-light cameras can achieve higher precision in three-dimensional pose estimation by directly or indirectly acquiring depth information. Given the specific conditions of a PET scanning environment, such as complex indoor settings, uneven lighting, and the proximity between the camera and the subject, RGB-D structured-light cameras are more suitable for this system after considering factors such as detection accuracy, hardware deployment complexity, and cost-effectiveness.

2.3. Detection and Recognition of Head Feature Points

During the PET scanning process, to detect the motion of the patient’s head in real time, it is necessary to perform feature point recognition and motion tracking over a certain period on the frame-by-frame images transmitted to the terminal from the communication equipment. Currently, the algorithms addressing this issue can be categorized into the following types based on their underlying principles and hardware devices: traditional vision-based methods, tracking-based methods, multimodal information fusion-based methods, and deep learning-based methods.

Traditional vision-based methods mainly rely on manually designed facial features and techniques from image processing and geometry, such as Haar cascade classifiers [22], feature point matching algorithms (e.g., SIFT [23], SURF [24]), and optical flow methods [25]. The advantages of these methods lie in their fast processing speed and low computational requirements. However, their performance tends to degrade in complex environments, under significant pose changes, or in the presence of occlusions [26].

Tracking-based methods include both traditional and deep learning-based target tracking algorithms, with representative algorithms such as the Kalman filter [27], particle filter [28], and Siamese network [29]. These tracking-based algorithms are suitable for real-time scenarios but are not robust enough in the presence of complex occlusions or rapid movements, and they generally require substantial computational overhead.

Multimodal information fusion-based methods refer to approaches that combine information from multiple sensors, such as RGB, depth, and thermal infrared, for feature point recognition. The advantage of these methods lies in the complementary information provided by different sensors, which enhances robustness in complex environments [30]. However, the use of various types of sensors requires complex hardware support, resulting in higher system costs and significant challenges in sensor calibration [31].

The mainstream deep learning algorithms for motion tracking utilize convolutional neural networks (CNNs) [32] and recurrent neural networks (RNNs), such as LSTM [33], for feature learning and motion tracking. The development of such methods includes CNN-based face recognition, keypoint detection, and motion tracking using RNN/LSTM. There are many mature network architectures in deep learning, such as the open-source DLIB library developed in C++, which can achieve stable face detection, feature localization, and landmark tracking [34]. Prados et al. [35] proposed the SPIGA network, a combination of CNNs and graph attention network regression sub-level cascades, which performs well in identifying blurry contours and edge points of faces. These methods typically offer more accurate detection of human body keypoints and motion tracking, but larger models may have certain hardware requirements when deployed.

Considering the need for a robust and real-time feature point recognition algorithm in the PET scanning environment, which must be adaptable to complex environments, capable of being deployed on medium-to-small hardware systems (such as integration into PET scanning devices), and economically feasible, a lightweight deep learning-based feature point recognition algorithm, such as the DLIB algorithm, is ultimately selected.

2.4. Space Motion Monitoring of Rigid-like Objects

Monitoring the spatial motion of rigid bodies requires the use of spatial feature point information to estimate the translational and rotational movements of the object in different directions. The main methods currently employed include Euler angle-based methods [36], quaternion-based methods [37], Denavit–Hartenberg (D-H) matrix methods [38], and rotation matrix-based methods [39].

The Euler angle-based method estimates rotation by describing the rotation angles of an object around the X, Y, and Z axes in three-dimensional space, and uses changes in these angles to estimate the rotational motion of the object. This method has the simplest computational principle. However, it suffers from the gimbal lock problem when two rotation axes approach parallel alignment, and its computational load is relatively high, making it difficult to convert angles into distance metrics [36]. The quaternion-based method describes spatial rotation using a quaternion expression consisting of a scalar and three vectors, which avoids the gimbal lock problem found in Euler angles. However, the selection of the rotation axis in this method is challenging, and the mathematical transformations involved are complex [37]. The Denavit–Hartenberg (D-H) matrix method represents the relative position and orientation of the object with respect to the reference coordinate system using Denavit–Hartenberg (D-H) parameterization. It is effective for estimating the spatial pose of pure rigid bodies, but it is less robust when feature points are lost or experience jitter [38]. The spatial rotation matrix-based method uses a rotation matrix to represent both the translation and rotation of the object, with the matrix elements indicating the spatial translation–rotation relationship. Its advantages include high accuracy in pose estimation and computational stability. However, it suffers from a relatively high computational load during complex movements [39].

Considering that the motion of the head and neck during PET scanning is primarily rotational, involving mainly pitch and yaw movements, and that the scanning process takes a relatively long time, the algorithm for estimating spatial pose must exhibit high stability and accuracy. After considering factors such as computational resources, accuracy, and stability, the spatial rotation matrix-based method is more optimal.

3. Research Methods

In this study, we developed a software system for human head motion monitoring during PET scans, with its architecture illustrated in Figure 1. The system is primarily composed of three components: image processing algorithm, motion detection algorithm, and visualization software.

The image processing algorithm serves as the foundation of the system. Its core task is to register depth images and RGB images captured by the RGB-D camera, enabling precise recognition of facial feature points and calculation of local coordinate systems. This component provides accurate image data and positioning information, which are essential for subsequent motion monitoring. The image processing algorithm is depicted in Figure 1a and corresponds to Section 3.1, Section 3.2 and Section 3.3, which detail the stages involved in processing the image data.

The motion detection algorithm acts as the central part of the system, directly determining the accuracy and robustness of motion monitoring. This component comprises three key stages: spatial point monitoring, spatial pose estimation, and motion intensity evaluation. By accurately tracking the trajectory and intensity of head motion, the system effectively assesses the impact of motion artifacts on PET scan results. The motion detection algorithm is illustrated in Figure 1b and corresponds to Section 3.4 and Section 3.5, where the process of motion detection and evaluation is thoroughly explained.

The visualization software, implemented using a Qt-based interactive interface, aims to provide an intuitive and user-friendly operating platform for medical professionals and non-engineering personnel. It facilitates real-time evaluation and analysis of motion artifacts. The software supports two working modes: real-time image acquisition mode and image loading mode, catering to different application scenarios. The visualization software is shown in Figure 1c and corresponds to Section 4.1.2, which provides a detailed description of the user interface and its functionality.

In terms of hardware configuration, the system utilizes an Orbbec Astra Pro Plus RGB-D monocular structured-light camera for image acquisition and operates on a Windows x64 platform. The processing system is equipped with an Intel i7-14700HX CPU and an NVIDIA RTX 4060 GPU to ensure efficient data processing. The specific layout of the test environment is illustrated in Figure 1d.

3.1. Camera Registration and Image Acquisition

To acquire head motion data from PET subjects, the first step is to obtain depth and RGB images of the subject during scanning. This study employs the Astra Pro Plus camera module from Orbbec, a high-precision, low-power 3D camera based on structured-light technology. The effective working distance of the camera ranges from 0.6 m to 8 m. The module consists of an infrared camera, an infrared projector, and a depth computation processor. The infrared projector projects a structured-light pattern (speckle pattern) onto the target scene, while the infrared camera captures the reflected infrared structured-light image. The depth computation processor processes the captured infrared image using depth calculation algorithms to generate depth images of the target scene.

For motion estimation, the camera principle is described as a mathematical model that maps 3D coordinates to a 2D pixel plane. This study adopts the pinhole camera model. The mathematical representation of the pinhole model can be expressed in matrix form as follows:

Z (\begin{matrix} u \\ v \\ 1 \end{matrix}) = (\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} X \\ Y \\ Z \end{matrix}) ≜ K P

(1)

where u and v represent the coordinates of the target point in the pixel coordinate system, while

P = {(X, Y, Z)}^{T}

denotes the coordinates of the target point in the camera coordinate system.

f_{x}

and

f_{y}

are the focal lengths of the camera, and

c_{x}

and

c_{y}

are the coordinates of the principal point. In this equation, the matrix K, formed by these intermediate variables, is referred to as the camera intrinsic matrix and is considered a constant property of the camera.

Based on the pinhole camera model, Zhang’s calibration method [40] is employed for camera calibration. This method requires only a flat checkerboard calibration board to complete the entire calibration process. In this study, we calibrated the structured-light depth camera using the MATLAB R2023b Camera Calibrator, and the calibration interface is shown in Figure 2. Figure 2a shows the interface of the calibration software, where the green circles represent detected points, the orange circles represent the pattern origin, and the red dots represent the reprojected points. Figure 2b,c are automatically generated by the calibration software. The former presents the error analysis based on 20 calibration images, while the latter shows the variation in different calibration images relative to the camera’s initial pose.

Generally, if the reprojection error of the camera calibration is less than 0.5 pixels, the calibration result is considered accurate. In this study, the maximum reprojection errors for both RGB and depth images did not exceed 0.25 pixels, with an average reprojection error of 0.09 pixels, which meets the accuracy criteria, indicating high calibration precision.

3.2. Head Feature Point Recognition

This project employs the DLIB facial landmark detection model [41] for identifying and tracking human head landmarks, consisting of two main components: face detection and face alignment. The face detection algorithm leverages Histogram of Oriented Gradients (HOG) for feature extraction combined with a Support Vector Machine (SVM) for classification. The face alignment algorithm is based on the Ensemble of Regression Trees (ERT) method, which optimizes the process through gradient boosting to iteratively fit the facial shape.

The face detection and alignment algorithm used in this study is based on an open-source pre-trained model that has been trained and validated on a large-scale facial database. This model is capable of meeting the requirements for facial feature point detection in most scenarios. Although the target application scenario of this study is the PET scanning environment, the RGB images collected and processed are still facial images of human subjects. Therefore, the characteristics of the dataset remain largely unchanged in this scenario. Through extensive phantom and volunteer experiments conducted on various phantoms and different volunteers, the pre-trained model has been demonstrated to achieve satisfactory performance in the target environment, including high accuracy and stability of feature point recognition. Additionally, the model is highly efficient, with a size of only 85 MB and an average processing time of 0.017 ms per frame. This makes it particularly suitable for scenarios with limited hardware resources and stringent real-time requirements.

The main implementation steps of the facial detection and landmark recognition algorithm are as follows:

Use the HOG-based cascaded classifier to extract all feature vectors, including HOG features, from patient images;
Input the extracted feature vectors into the SVM model inherited from the CPP-DLIB library [41] to classify and extract features around the facial region, thereby identifying and annotating the location of the face in the image;
Pass the annotated facial region as input to the 68-point alignment model to achieve real-time detection of 68 facial landmarks. The alignment standard, shown in Figure 3, includes key facial regions such as the contours, eyes, eyebrows, nasal triangle, and mouth. Among them, the 8 red dots and 4 star points represent the 12 feature points used for pose calculation, with the 4 star points specifically serving as the initial selection points.
In the detected video stream, the landmark information of each frame is recorded in real time. A filtering algorithm is applied to extract 12 robust landmarks, typically located in regions such as the nasal triangle and eye corners. These selected landmarks are used as input data for the spatial pose estimation algorithm to achieve precise motion state evaluation.

The feasibility of the proposed algorithm was validated on both the constructed phantom experimental platform and the collected facial dataset. Experimental results demonstrated that the algorithm effectively tracks and monitors facial landmarks, outputting the position of each landmark for every frame in the video stream. Furthermore, the results were synthesized into motion detection videos using video processing techniques.

3.3. Fusion Registration of RGB Images and Depth Images

After calibrating the structured-light RGB and depth cameras, discrepancies in their intrinsic and extrinsic parameters can lead to significant errors when directly overlaying the captured RGB and depth images. Therefore, image registration between the two views is essential to ensure accurate alignment of images captured at the same moment. Based on the known depth map, RGB image, and camera intrinsic parameters, depth and RGB images can be registered and fused.

The registration process is based on known depth images, RGB images, and camera intrinsic parameters. Using the intrinsic matrix of the depth camera, the 3D coordinates of a point in the depth camera coordinate system can be obtained from the depth map. Subsequently, the point is transformed from the depth camera coordinate system to the RGB camera coordinate system using the rotation matrix R and translation vector t. Finally, the 3D coordinates are converted into pixel coordinates in the RGB image through the intrinsic matrix of the RGB camera. The critical step in this process is solving for the rotation matrix R and translation vector t. By obtaining the extrinsic matrices of a checkerboard pattern in both the depth camera and RGB camera coordinate systems, the transformation matrix linking the two camera coordinate systems can be computed. The computation of the rotation matrix R and translation vector t is as follows:

\{\begin{matrix} R = R_{rgb} R_{dep}^{- 1} = R_{rgb} R_{dep}^{T} \\ t = t_{rgb} - R_{rgb} R_{dep}^{- 1} t_{dep} = t_{rgb} - R t_{dep} \end{matrix}

(2)

where

R_{rgb}

and

t_{rgb}

represent the rotation matrix and translation vector transforming point coordinates from the world coordinate system to the RGB camera coordinate system, respectively. Similarly,

R_{dep}

and

t_{dep}

represent the rotation matrix and translation vector transforming point coordinates from the world coordinate system to the depth camera coordinate system. The registration effect is shown in Figure 4.

Therefore, in the same scene, it is sufficient to obtain the extrinsic parameter matrices of the chessboard in both the depth camera and RGB camera coordinate systems in order to compute the transformation matrix that links the two camera coordinate systems. Although the extrinsic matrices obtained in different scenes may vary, using a front-facing chessboard calibration image typically yields satisfactory results.

3.4. Calculation of Head Space Pose

After obtaining the precise 3D spatial coordinates of the feature points, these data are utilized to compute and track the head motion of the subject. The method employed in this study monitors the rotational displacement of the head using a rotation matrix about a fixed coordinate system, while independently tracking its translational displacement. The core of the pose detection algorithm lies in selecting an appropriate rigid-body coordinate system on the subject’s head to derive a set of suitable orthogonal vectors.

Although head feature point detection algorithms can stably identify 68 facial landmarks, significant variations in recognition accuracy across different facial regions occur under extreme conditions, such as when the yaw angle exceeds 60°. Therefore, to establish the head coordinate system, it is essential to select feature points with high recognition accuracy and robust performance. Specifically, priority is given to points that are distant from facial contour edges, exhibit significant depth variations, and demonstrate strong geometric invariance. The feature points selected in this study are shown in Figure 3b, the eight red dots and the four yellow stars, corresponding to the numbers 22, 23, 30, 31, 37, 40, 43, 46, 49, 52, 55, and 58.

Among these 12 points, the left outer canthus, right outer canthus, center of the upper lip, and the tip of the nose (marked as yellow stars in Figure 3b) are used to describe the derivation of the formulas in this subsection, denoted as

P_{1}

,

P_{2}

,

P_{3}

, and

P_{4}

, respectively. For most individuals, the plane defined by the two outer canthi and the center of the upper lip is generally parallel to the face. Therefore, the vector perpendicular to this plane can be used to estimate the position of the head’s center of mass by integrating the spatial pose data of the nose tip and the head dimensions. Consequently, an orthogonal coordinate system for rigid-body motion can be constructed using these three points. The process of establishing the coordinate system is described by the following formula:

\{\begin{matrix} \vec{s_{1}} = \vec{P_{3} P_{1}} + \vec{P_{3} P_{2}} \\ \vec{s_{2}} = \vec{P_{3} P_{1}} \times \vec{P_{3} P_{2}} \end{matrix}

(3)

\{\begin{matrix} \vec{e_{1}} = \frac{\vec{s_{1}}}{| s_{1} |} \\ \vec{e_{2}} = \frac{\vec{s_{2}}}{| s_{2} |} \\ \vec{e_{3}} = \vec{e_{1}} \times \vec{e_{2}} \end{matrix}

(4)

Equations (3) and (4) describe the fundamental constraints of rigid-body rotation: all vectors are three-dimensional, where

\vec{e_{1}}

,

\vec{e_{2}}

, and

\vec{e_{3}}

are unit vectors that are mutually orthogonal, forming a right-handed coordinate system

(\vec{e_{1}}, \vec{e_{2}}, \vec{e_{3}})

. Since the spatial coordinates of feature points on the rigid body are represented relative to the camera coordinate system,

(\vec{e_{1}}, \vec{e_{2}}, \vec{e_{3}})

effectively correspond to the spatial rotation matrix of the rigid body around the fixed axis

E_{3}

of the camera coordinate system.

Moreover, considering the spatial rotation matrix

R_{z y x}

, which represents a rigid body rotating by

ψ

radians around the x-axis of the fixed coordinate system, then by

θ

radians around the y-axis, and finally by

ϕ

radians around the z-axis, the definitions of directional rotation matrices and the physical interpretation of

R_{z y x}

yield Equations (5) through (7).

\{\begin{matrix} R_{x} (ψ) = (\begin{matrix} 1 & 0 & 0 \\ 0 & cos (ψ) & - sin (ψ) \\ 0 & sin (ψ) & cos (ψ) \end{matrix}) \\ R_{y} (θ) = (\begin{matrix} cos (θ) & 0 & sin (θ) \\ 0 & 1 & 0 \\ - sin (θ) & 0 & cos (θ) \end{matrix}) \\ R_{z} (ϕ) = (\begin{matrix} cos (ϕ) & - sin (ϕ) & 0 \\ sin (ϕ) & cos (ϕ) & 0 \\ 0 & 0 & 1 \end{matrix}) \end{matrix}

(5)

R_{z y x} = R_{z} (ϕ) R_{y} (θ) R_{x} (ψ) = (\begin{matrix} cos (θ) cos (ϕ) & \dots & \dots \\ cos (θ) sin (ϕ) & \dots & \dots \\ - sin (θ) & sin (ψ) cos (θ) & cos (ψ) cos (θ) \end{matrix})

(6)

R_{z y x} : = R_{z} (ϕ) R_{y} (θ) R_{x} (ψ) = (\begin{matrix} R_{11} & R_{12} & R_{13} \\ R_{21} & R_{22} & R_{23} \\ R_{31} & R_{32} & R_{33} \end{matrix})

(7)

By comparing Formula (6) with Formula (7), it can be concluded that

\{\begin{matrix} θ & = - arcsin R_{31} \\ ψ & = - arctan \frac{R_{32}}{R_{33}} \\ ϕ & = - arctan \frac{R_{21}}{R_{11}} \end{matrix}

(8)

By continuously monitoring

θ

,

ψ

, and

ϕ

, the rotational amplitudes of the human head in three directions can be accurately obtained. The translation of the head is determined by estimating the spatial position of the head’s centroid. The fundamental principle involves subtracting a normal vector perpendicular to the facial plane from the spatial pose of the nasal tip. The magnitude of this normal vector represents the average human head radius, which is set to 80 mm based on national standards obtained by National Bureau of Statistics of China. The mathematical representation is given in Equation (8), where

∥ \vec{r} ∥

denotes the average head radius, and

\vec{p}

is the estimated centroid coordinate of the rigid body, represented as a three-dimensional vector.

\vec{p} : = (\begin{matrix} x \\ y \\ z \end{matrix}) = \vec{P_{4}} + \vec{r} \cdot \vec{e_{2}}

(9)

By utilizing the three-dimensional spatial pose data of facial feature points, head rotation and translation can be monitored. However, the complexity of motion in real clinical environments may lead to certain errors in the algorithm. In particular, two prominent issues are feature point occlusion when the yaw angle is large and the presence of outliers during motion synthesis. To address these issues, the following solutions are proposed in this study:

Feature Point Occlusion: During head rotation, when the yaw angle becomes large, some facial feature points may move out of the depth camera’s view. To solve this problem, a feature point compensation method is proposed. When a feature point (such as the outer corner of the eye) experiences significant fluctuations in its spatial pose across consecutive frames, it is determined that the feature point is no longer suitable for input into the spatial pose estimation algorithm. Other stable feature points are then used to supplement the calculation, ensuring the continuity and accuracy of the spatial pose estimation.
Outliers in Motion Synthesis: When synthesizing the motion intensity curve from consecutive frames, outliers may occur, causing the curve to exhibit abnormal fluctuations. To address this issue, a low-pass filtering method is applied to smooth the calculated motion intensity curve, eliminating noise interference. This results in a stable and continuous motion intensity curve, which is then used for motion pattern classification.

3.5. Definition and Calculation of Head Motion Intensity

The primary objective of this study is to determine whether the subject’s head motion during PET scanning exceeds a specific intensity threshold that may compromise imaging quality, thereby enabling the selection of valid video segments. Due to the complexity of head motion, it is challenging to characterize motion intensity using either translation or rotation alone. To quantify the overall head motion, this study introduces a dimensionless motion intensity metric, Amplitude, validated through theoretical analysis and simulation testing. Amplitude is defined as follows:

Amplitude = λ \cdot Rot + (1 - λ) \cdot Trans

(10)

where Rot represents the total rigid-body rotational displacement (unit: °), Trans represents the total rigid-body translational displacement (unit: mm), and

λ

is a weighting factor ranging from 0 to 1.

Since the comprehensive head movement of the subject cannot be effectively evaluated by considering either translational or rotational motion alone, it is necessary to carefully define the translational component (Trans) and rotational component (Rot) to accurately characterize the overall motion. For the translational component, Trans, the displacements along the X, Y, and Z axes are combined to represent the overall translation. According to the principles of spatial vector composition, Trans is defined as follows:

Trans = \sqrt{x_{t}^{2} + y_{t}^{2} + z_{t}^{2}}

where

x_{t}

,

y_{t}

, and

z_{t}

denote the displacements along the X, Y, and Z axes, respectively.

For rotational motion, which involves rotations around the X, Y, and Z axes, the rotational component, Rot, is derived using the properties of spatial rotation matrices. Assuming small-angle approximations, the total rotation matrix

R_{zyx}

can be expressed as follows:

R_{zyx} \approx I + ψ K_{x} + θ K_{y} + ϕ K_{z}

where I is the identity matrix,

ψ

,

θ

, and

ϕ

are the rotation angles around the X, Y, and Z axes, and

K_{x}

,

K_{y}

, and

K_{z}

are the corresponding skew-symmetric matrices:

K_{x} = [\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & - 1 \\ 0 & 1 & 0 \end{matrix}], K_{y} = [\begin{matrix} 0 & 0 & 1 \\ 0 & 0 & 0 \\ - 1 & 0 & 0 \end{matrix}], K_{z} = [\begin{matrix} 0 & - 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}] .

By substituting these approximations into the overall rotation matrix and neglecting higher-order terms, the matrix simplifies to the following:

R_{zyx} \approx I + ψ K_{x} + θ K_{y} + ϕ K_{z} .

According to the axis–angle representation of small-angle rotations, the rotation matrix can also be expressed as follows:

R \approx I + Rot \cdot K

where Rot is the equivalent rotation angle, and K is the skew-symmetric matrix corresponding to the unit rotation axis

{[k_{x}, k_{y}, k_{z}]}^{T}

. By comparing the two forms, the equivalent rotation angle is derived as follows:

{Rot}^{2} = ϕ^{2} + θ^{2} + ψ^{2}

Thus, when the rotation angles (

ϕ

,

θ

, and

ψ

) about the fixed coordinate axes are small, the equivalent rotation angle can be calculated using the above formula. The translational and rotational components, Trans and Rot, are defined as follows:

\{\begin{matrix} Trans = \sqrt{\sum_{a \in Ω} {(a - a_{i n i t})}^{2}}, where Ω = {x, y, z} \\ Rot = \sqrt{\sum_{α \in Γ} {(α - α_{i n i t})}^{2}}, where Γ = {roll, pitch, yaw} \end{matrix}

(11)

where

a_{i n i t}

and

α_{i n i t}

are derived by averaging the elements corresponding to the first 15 sampling points of the sequence, thereby minimizing the error in the reference initial values.

Due to the differing units of rotation and motion, this formula serves only as a numerical operation, and Amplitude is expressed as a dimensionless quantity. Theoretically, the subject’s head can be approximated as a sphere with a radius of 80 mm, rolling and sliding on the bed surface of the PET scanner. Given that rolling is the predominant motion, the value of

λ

is set between 0.7 and 1, reflecting a primary focus on rotation with supplementary consideration of translation [42]. Specifically, not only is this metric sensitive to the rotation of the head, but it also effectively monitors the motion changes caused by the head’s translation. In the case of a two-dimensional scenario, the specific meaning of this monitoring metric can be referenced in Figure 5. In this figure,

Rotation

represents the rotation angle synthesized by the pose monitoring algorithm,

{Motion}_{R}

denotes the displacement caused by rotation, and

{Motion}_{T}

refers to the displacement induced by translation.

Experimental and clinical tests indicate that

λ = 0.9

achieves optimal motion representation. Based on this, Amplitude thresholds are established: 10 as the warning value and 15 as the critical threshold. PET scans with Amplitude values exceeding 15 are excluded from imaging analysis.

4. Experiments

To validate the performance of the proposed motion monitoring system, comprehensive experiments are conducted using self-collected data from phantoms and volunteers. The system’s performance is assessed across multiple dimensions, including recognition accuracy, computational speed, and overall system robustness, to align with clinical requirements.

4.1. Experiment Setup

4.1.1. Datasets

The experimental data for this study were derived from self-collected phantom datasets and volunteer datasets obtained in clinical environments. The phantom dataset was collected using a custom-built rotational platform (as shown in Figure 6a), consisting of a background frame, a high-precision rotatable gimbal, and a high-fidelity phantom. A total of 23 video cases were collected, covering scenarios such as static state, spatial translation, multi-angle rotation, and combined movements. The volunteer dataset was collected in a 1:1 real PET scanning room provided by Shanghai United Imaging Healthcare Co., Ltd., Shanghai, China (as shown in Figure 6b), including 17 video cases covering static state, single-angle and multi-angle rotations, spatial translation, and arbitrary combined movements. In total, the two datasets comprise 40 video cases and approximately 48,000 images.

4.1.2. Interaction Software Design

Given that this study is intended for deployment in PET operating systems and primarily targets non-technical users such as doctors, designing user-friendly and interactive software is particularly crucial. On the one hand, software integration serves as a key method for technical consolidation, enabling the seamless connection of individual modules into a fully functional system. This not only facilitates the organic combination of functionalities but also provides a testing platform for experimental validation and system optimization, allowing for the evaluation of accuracy and real-time performance. On the other hand, considering that the primary users are doctors and other non-engineering professionals, an intuitive and user-friendly interface can significantly reduce the barriers to understanding and operation, thereby enhancing the practicality and adoption of the system. The modular software system architecture designed in this study is shown in Figure 7.

The software system is integrated and developed through the decomposition of functionality and interaction deployment into the following four main modules:

Image Registration Module: This module receives RGB and depth images, fuses them using visualization methods, and outputs the fused image to the visualization module. The depth image is stored on the local server, and the RGB image is sent to the feature point recognition module. The module returns the 2D coordinates of facial feature points, which are then used for 3D reconstruction, and the 3D information is input into the pose estimation module.
Feature Point Recognition Module: This module processes the RGB images provided by the image registration module using deep learning networks to detect facial feature points and return their 2D coordinates.
Pose Estimation Module: This module receives the 3D feature point data, applies rigid-body motion algorithms to calculate the head movement amplitude, and outputs the result to the local server. It also plots the corresponding movement amplitude curve, which is displayed in the visualization module.
Visualization Module: This module receives the RGB and fused images from the image registration module, as well as the movement amplitude curve from the pose estimation module. It displays them on the interface and receives user input to control the operation of the system.

4.2. Performance

4.2.1. Phantom Experiments

The phantom experiments were tested on a self-constructed experimental platform. This platform consists of a background frame, a rotating gimbal, and a phantom. The background frame is composed of three mutually perpendicular blue background panels, which serve to limit the maximum measurement depth, thereby ensuring more accurate depth measurements. Based on this experimental platform, various motion tests were conducted, including stationary, spatial translation, multi-angle rotation, and composite motion, to comprehensively evaluate the system’s performance and applicability.

Taking the pitch motion of the phantom as an example, the tracking results of the spatial coordinates for the 12 selected feature points used in the spatial pose estimation are shown in Figure 8a. The pitch motion of the head can be decomposed into the superposition of feature point movements in the y and z directions. Therefore, during periodic pitch motion of the phantom, the y and z components of the facial feature points’ spatial coordinates exhibit synchronized periodic variations, while the x component remains nearly constant.

Furthermore, based on the real-time recognition and collection of facial feature points, spatial motion estimation of the phantom was performed, with the results shown in Figure 8b,c. The motion of the local facial coordinate system can be decomposed into six types of movements: translations in the X, Y, and Z directions, and rotations around the X-axis (pitch movement), Y-axis (roll movement), and Z-axis (yaw movement). During periodic pitch motion of the phantom, rotation occurs around the X-axis, leading to periodic changes in the angles of the Y and Z axes, while the angle of the X-axis remains nearly constant. It is important to note that the rotation of the phantom is facilitated by the rotating gimbal, so the intersection of the rotation axes is located at the lower end of the neck, rather than at the center of mass of the phantom. As a result, in Figure 8b,c, the translations in the Y and Z directions are not zero. In an ideal scenario, where the intersection of the three orthogonal rotation axes coincides with the center of mass of the phantom, this issue could be avoided.

Based on the calculated translational and rotational amplitudes of the head phantom in different directions, the total translational amplitude and rotational amplitude were quantified, as shown in Figure 8d. Subsequently, the overall motion intensity of the head phantom was derived using Equation 11, as shown in Figure 8e. It is important to note that the warning threshold of 10 and the exclusion threshold of 15 were established based on clinical experimental studies and must adhere to specific experimental conditions.

Overall, the proposed system was thoroughly tested and analyzed for errors through experiments involving different types of head models. As shown in Figure 6a, the motion of the head model was controlled by a turntable at its base, and detailed records were maintained for various motions, such as pitch and yaw. These records included the peak and valley values of motion amplitude for each cycle, the range of translational movement, and the fluctuations in motion amplitude during stationary states. A comprehensive analysis was conducted to evaluate the system’s error performance under different motion conditions, including the deviation of peaks (or valleys) from the true values and the minor fluctuations observed during stationary states. The experimental results demonstrate that the system achieves an error within 2.5 mm for translational movements and within 2.0° for rotational movements.

4.2.2. Volunteer Experiments

The volunteer experiments were conducted in a 1:1 scale real PET scanning room provided by Shanghai United Imaging Healthcare Co., Ltd. In terms of environmental setup, the volunteer lay flat on the PET scanning bed, with the RGB-D camera positioned 60 cm above the volunteer’s face and kept parallel to the facial plane. Various motion scenarios were tested, including the volunteer remaining stationary, performing multi-angle rotations, and engaging in composite random movements. For each specific motion pattern, such as yaw or pitch movements, comprehensive communication and practice sessions were conducted with the volunteer prior to the experiments. These sessions covered detailed instructions on the movement patterns, amplitude, and number of cycles to ensure the volunteer could accurately control the prescribed motions before the experiments began. For natural or composite movements, the volunteer was instructed to move freely to simulate realistic motion scenarios during PET scanning. The software operation interface of the system, in addition to interactive function buttons, simultaneously records and displays the feature point recognition interface and the depth registration interface, as shown in Figure 7b. The feature point recognition interface displays the detected facial bounding box, feature points, and the local head coordinate system.

Spatial pose tracking was performed on the 12 facial feature points of the volunteer during stationary, pitch movement, yaw movement, and multi-type movement, with the results shown in Figure 9. It can be observed that when the volunteer remained stationary, and the spatial coordinates in the X, Y, and Z directions remained constant. During pitch or roll motion, which involves single-axis rotation, the coordinates in two directions exhibit periodic variations, while the coordinate along the axis remains almost unchanged. In the case of composite motion, the spatial coordinates in all three directions change in a highly synchronized manner, either remaining unchanged or varying simultaneously.

Based on the recognition results of the facial feature points, the translational and rotational amplitudes for each type of motion were synthesized, and the overall motion amplitude was calculated using Equation (5). The results are shown in Figure 10. According to the experimental setup, when the total motion amplitude exceeds 10, it indicates that the motion during this period is relatively large, potentially affecting the imaging results, and a warning should be triggered. When the total motion amplitude exceeds 15, it suggests that the motion during this period is excessive, with a high likelihood of causing artifacts in the imaging results, necessitating the exclusion of the corresponding data.

4.3. Cost Analysis

The system is designed to enable real-time monitoring and tracking of head movements during PET scanning, allowing the identification and exclusion of imaging data acquired during periods of excessive motion. Consequently, the system’s real-time performance and algorithmic processing efficiency must be thoroughly evaluated and validated. To assess the system’s modular integrity and operational efficiency, the processing time for single-frame images in each module was tested in C++ environments. Additionally, the real-time operational mode and the image-reading mode were analyzed, with the time consumption of each module under these two modes summarized in Table 1.

As shown in Table 1, in the real-time processing mode of the system, the modules for detecting facial feature points from RGB images and calculating the spatial pose of the head using facial feature points exhibit the lowest time consumption, with average processing times below 0.1 ms, posing negligible performance impact. However, the most time-consuming parts of the system are the depth registration module and the image acquisition module. Specifically, the registration of depth and RGB images, as well as the acquisition of depth and RGB images by the camera, are the primary contributors. The average processing times for the former module are 32 ms under C++ environments, while the latter consumes an average of 21 ms. Overall, the total time required for processing each frame in this mode ranges between 50 ms and 55 ms, enabling the system to operate at an average frame rate exceeding 20 frames per second.

In contrast, under the image-reading mode, the total time consumption per frame is approximately 35 ms to 40 ms, which is 10 ms to 15 ms faster than the real-time processing mode. This improvement is primarily attributed to the ability of the system to perform subsequent operations after completing the acquisition of all RGB and depth images, thereby eliminating the time overhead caused by function calls during real-time execution.

5. Discussion

In this study, we validated the proposed motion monitoring system through testing on a precisely constructed phantom platform and in a realistically replicated PET scanning room. The experimental results demonstrate that this system effectively addresses the significant problem of artifacts caused by head motion during PET scans. By evaluating the intensity of head motion, the system enables artifact screening and post-processing of PET imaging results, significantly improving identification efficiency and saving considerable time for physicians. This improvement enhances the comfort of medical services and contributes to the advancement of healthcare systems, highlighting the system’s potential for clinical application.

One notable advantage of the system is its reliance on facial feature recognition and tracking to monitor head motion. This approach simplifies project complexity, avoids the need for external markers or complex hardware setups, and ensures low computational requirements, making it adaptable for deployment on various processors. Furthermore, the system achieves angular displacement accuracy and translational displacement accuracy of less than 2.0° and 2.5 mm, respectively, well within the clinical thresholds of 5.0° and 5.0 mm. These results validate the feasibility and reliability of the system for practical use. Despite these strengths, the system faces limitations when handling large head motion amplitudes (e.g., yaw angles exceeding 60°), which may reduce the stability of facial feature point recognition. While this instability slightly affects motion amplitude estimation, its impact on overall motion intensity remains manageable due to the clear distinctions between periods of large motion and stationary intervals.

The system’s real-time responsiveness further supports its applicability, as it generates rigid-body motion monitoring images and comprehensive motion intensity metrics within approximately 10 s post-scan. This efficiency surpasses manual image selection speeds, significantly improving diagnostic workflows. Additionally, the developed visualization software enhances user experience by supporting features such as real-time data acquisition, image loading, and intuitive motion artifact evaluations through RGB and fusion images.

Future research will focus on expanding this system to monitor full-body 3D motion by incorporating advanced three-dimensional reconstruction methods for facial feature points. This would enable comprehensive elimination of motion artifact periods in PET imaging and further support clinical applications. The use of higher-precision depth cameras will be explored to enhance detection accuracy, targeting motion amplitude errors within 2 mm. Collaboration with industry partners, such as Shanghai United Imaging Healthcare Co., Ltd., will also be prioritized to accelerate the clinical translation of this technology.

6. Conclusions

This study addressed the challenges of prolonged PET scanning durations and motion artifacts by designing a motion detection and recognition system based on natural images. The system achieves contactless head motion monitoring and intensity estimation without relying on external markers, providing reliable criteria for artifact screening. This innovation simplifies the manual selection of imaging results, saving valuable time for physicians. The system employs an RGB-D monocular structured-light camera, avoiding complex hardware setups while balancing accuracy and real-time performance. Experiments conducted on both phantom models and human volunteers validated the system’s capability across various motion scenarios, achieving clinically acceptable displacement accuracy. This balance of performance and feasibility underscores the system’s potential for practical deployment.

In summary, the proposed motion monitoring system bridges a critical gap in contactless, marker-free PET motion monitoring using low-cost and non-invasive RGB-D cameras. By combining accuracy, real-time responsiveness, and user-friendly visualization tools, the system significantly enhances artifact identification efficiency, benefiting both physicians and patients. This framework lays a foundation for future developments in PET imaging and broader motion monitoring applications.

Author Contributions

Conceptualization, Y.S. (Yixin Shan); data curation, Y.S. (Yixin Shan); formal analysis, Y.S. (Yixin Shan) and H.L.; investigation, Y.S. (Yixin Shan), Z.L., Z.S. and H.L.; methodology, Y.S. (Yixin Shan), Z.L. and Z.S.; software, Z.L.; validation, Y.S. (Yixin Shan); visualization, Z.L. and Z.S.; writing—original draft, Y.S. (Yixin Shan); writing—review and editing, Y.S. (Yixin Shan), J.X. and X.C.; funding acquisition, Y.S. (Yixing Sun) and X.C.; project administration, J.X. and X.C.; supervision, Y.S. (Yixing Sun) and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grants from National Natural Science Foundation of China (82330063), Shanghai Jiao Tong University Foundation on Medical and Technological Joint Science Research (YG2023ZD19; YG2023ZD15; YG2024QNA23; YG2024QNB26), the Foundation of Science and Technology Commission of Shanghai Municipality (24TS1413000, 24490710300, 22Y11911700), and Shanghai Leading Talent Program of Eastern Talent Plan (BJKJ2024003).

Informed Consent Statement

All individuals participating in the volunteer experiment or whose images are presented in the paper have been informed and have given their consent.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the need to protect the portrait rights of the volunteers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

van der Meulen, N.P.; Strobel, K.; Lima, T.V.M. New radionuclides and technological advances in SPECT and PET scanners. Cancers 2021, 13, 6183. [Google Scholar] [CrossRef] [PubMed]
Filippi, L.; Dimitrakopoulou-Strauss, A.; Evangelista, L.; Schillaci, O. Long axial field-of-view PET/CT devices: Are we ready for the technological revolution? Expert Rev. Med. Devices 2022, 19, 739–743. [Google Scholar] [CrossRef]
Berger-Tal, O.; Blumstein, D.T.; Swaisgood, R.R. Conservation translocations: A review of common difficulties and promising directions. Anim. Conserv. 2020, 23, 121–131. [Google Scholar]
Surti, S.; Pantel, A.R.; Karp, J.S. Total body PET: Why, how, what for? IEEE Trans. Radiat. Plasma Med. Sci. 2020, 4, 283–292. [Google Scholar]
Zhang, S.; Wang, X.; Gao, X.; Chen, X.; Li, L.; Li, G.; Liu, C.; Miao, Y.; Wang, R.; Hu, K. Radiopharmaceuticals and their applications in medicine. Signal Transduct. Target. Ther. 2025, 10, 1. [Google Scholar] [PubMed]
Kyme, A.Z.; Fulton, R.R. Motion estimation and correction in SPECT, PET and CT. Phys. Med. Biol. 2021, 66, 18TR02. [Google Scholar]
Zeng, T.; Zhang, J.; Lieffrig, E.V.; Cai, Z.; Chen, F.; You, C.; Naganawa, M.; Lu, Y.; Onofrey, J.A. Fast Reconstruction for Deep Learning PET Head Motion Correction. Int. Conf. Med. Image Comput. Comput.-Assisted Interv. 2023, 14229, 710–719. [Google Scholar]
Spangler-Bickell, M.G.; Khalighi, M.M.; Hoo, C.; DiGiacomo, P.S.; Maclaren, J.; Aksoy, M.; Rettmann, D.; Bammer, R.; Zaharchuk, G.; Zeineh, M.; et al. Rigid motion correction for brain PET/MR imaging using optical tracking. IEEE Trans. Radiat. Plasma Med. Sci. 2018, 3, 498–503. [Google Scholar]
Henry, D.; Fulton, R.; Maclaren, J.; Aksoy, M.; Bammer, R.; Kyme, A. Close-range feature-based head motion tracking for MRI and PET-MRI. In Proceedings of the 2018 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Sydney, NSW, Australia, 10–17 November 2018; pp. 1–3. [Google Scholar]
Borowska-Terka, A.; Strumillo, P. Person independent recognition of head gestures from parametrised and raw signals recorded from inertial measurement unit. Appl. Sci. 2020, 10, 4213. [Google Scholar] [CrossRef]
Chatzidimitriadis, S.; Bafti, S.M.; Sirlantzis, K. Non-intrusive head movement control for powered wheelchairs: A vision-based approach. IEEE Access 2023, 11, 65663–65674. [Google Scholar] [CrossRef]
Elmoujarkach, E.; Seeger, S.; Möller, N.; Schmidt, C.; Rafecas, M. Development and characterization of 3D printed radioactive phantoms for high resolution PET. In Proceedings of the 2022 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Milano, Italy, 5–12 November 2022; pp. 1–2. [Google Scholar]
Pratt, E.C.; Lopez-Montes, A.; Volpe, A.; Crowley, M.J.; Carter, L.M.; Mittal, V.; Pillarsetty, N.; Ponomarev, V.; Udías, J.M.; Grimm, J.; et al. Simultaneous quantitative imaging of two PET radiotracers via the detection of positron–electron annihilation and prompt gamma emissions. Nat. Biomed. Eng. 2023, 7, 1028–1039. [Google Scholar]
Brenner, M.; Reyes, N.H.; Susnjak, T.; Barczak, A.L.C. RGB-D and thermal sensor fusion: A systematic literature review. IEEE Access 2023, 11, 82410–82442. [Google Scholar] [CrossRef]
Zanfir, A.; Marinoiu, E.; Sminchisescu, C. Monocular 3D pose and shape estimation of multiple people in natural scenes—The importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2148–2157. [Google Scholar]
Sun, C.; Gu, D.; Lu, X. Three-dimensional structural displacement measurement using monocular vision and deep learning based pose estimation. Mech. Syst. Signal Process. 2023, 190, 110141. [Google Scholar]
Mansour, M.; Davidson, P.; Stepanov, O.; Piché, R. Relative importance of binocular disparity and motion parallax for depth estimation: A computer vision approach. Remote Sens. 2019, 11, 1990. [Google Scholar] [CrossRef]
Liu, L.; Liu, Y.; Lv, Y.; Li, X. A novel approach for simultaneous localization and dense mapping based on binocular vision in forest ecological environment. Forests 2024, 15, 147. [Google Scholar] [CrossRef]
Paredes, A.L.; Song, Q.; Heredia Conde, M. Performance evaluation of state-of-the-art high-resolution time-of-flight cameras. IEEE Sens. J. 2023, 23, 13711–13727. [Google Scholar] [CrossRef]
Zhu, J.; Gao, C.; Sun, Q.; Wang, M.; Deng, Z. A Survey of Indoor 3D Reconstruction Based on RGB-D Cameras. IEEE Access 2024, 12, 112742–112766. [Google Scholar]
Nguyen, A.-H.; Ly, K.L.; Lam, V.K.; Wang, Z. Generalized fringe-to-phase framework for single-shot 3D reconstruction integrating structured light with deep learning. Sensors 2023, 23, 7284. [Google Scholar] [CrossRef]
Kareem, O.S. Face mask detection using haar cascades classifier to reduce the risk of COVID-19. Int. J. Math. Stat. Comput. Sci. 2024, 2, 19–27. [Google Scholar]
Arooj, S.; Altaf, S.; Ahmad, S.; Mahmoud, H.; Mohamed, A.S.N. Enhancing sign language recognition using CNN and SIFT: A case study on Pakistan sign language. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 101934. [Google Scholar]
Bakheet, S.; Al-Hamadi, A.; Youssef, R. A fingerprint-based verification framework using Harris and SURF feature detection algorithms. Appl. Sci. 2022, 12, 2028. [Google Scholar] [CrossRef]
Meng, Z.; Kong, X.; Meng, L.; Tomiyama, H. Lucas-Kanade Optical Flow Based Camera Motion Estimation Approach. In Proceedings of the 2019 International SoC Design Conference (ISOCC), Jeju, Republic of Korea, 6–9 October 2019; pp. 77–78. [Google Scholar] [CrossRef]
Shakir, S.; Rambli, D.R.A.; Mirjalili, S. Vision-based human detection techniques: A descriptive review. IEEE Access 2021, 9, 42724–42761. [Google Scholar]
Khodarahmi, M.; Maihami, V. A review on Kalman filter models. Arch. Comput. Methods Eng. 2023, 30, 727–747. [Google Scholar] [CrossRef]
Chicco, D. Siamese neural networks: An overview. In Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2021; pp. 73–94. [Google Scholar]
Fang, H.; Liao, G.; Liu, Y.; Zeng, C. Siam-sort: Multi-target tracking in video SAR based on tracking by detection and Siamese network. Remote Sens. 2023, 15, 146. [Google Scholar] [CrossRef]
Wang, Y.; Kuang, B.; Durazo, I.; Zhao, Y. 3D Reconstruction of Rail Tracks based on Fusion of RGB and Infrared Sensors. In Proceedings of the 2024 29th International Conference on Automation and Computing (ICAC), Sunderland, UK, 28–30 August 2024; pp. 1–6. [Google Scholar]
Kalenberg, K.; Müller, H.; Polonelli, T.; Schiaffino, A.; Niculescu, V.; Cioflan, C.; Magno, M.; Benini, L. Stargate: Multimodal sensor fusion for autonomous navigation on miniaturized UAVs. IEEE Internet Things J. 2024, 11, 21372–21390. [Google Scholar] [CrossRef]
Ganga, B.; Lata, B.T.; Venugopal, K.R. Object detection and crowd analysis using deep learning techniques: Comprehensive review and future directions. Neurocomputing 2024, 597, 127932. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Aydın, M.T.; Menemencioğlu, O.; Orak, İ.M. Face Recognition Approach by Using Dlib and K-NN. Curr. Trends Comput. 2024, 1, 93–103. [Google Scholar]
Prados-Torreblanca, A.; Buenaposada, J.M.; Baumela, L. Shape preserving facial landmarks with graph attention networks. arXiv Preprint 2022, arXiv:2210.07233. [Google Scholar]
Bai, Q.; Shehata, M.; Nada, A. Review study of using Euler angles and Euler parameters in multibody modeling of spatial holonomic and non-holonomic systems. Int. J. Dyn. Control 2022, 10, 1707–1725. [Google Scholar] [CrossRef]
Kimathi, S.; Lantos, B. Simultaneous attitude and position tracking using dual quaternion parameterized dynamics. In Proceedings of the 2024 IEEE 22nd World Symposium on Applied Machine Intelligence and Informatics (SAMI), Stará Lesná, Slovakia, 25–27 January 2024; pp. 000309–000314. [Google Scholar]
Zhong, F.; Liu, G.; Lu, Z.; Han, Y.; Liu, F.; Ye, T. Inverse Kinematics Analysis of Humanoid Robot Arm by Fusing Denavit–Hartenberg and Screw Theory to Imitate Human Motion With Kinect. IEEE Access 2023, 11, 67126–67139. [Google Scholar]
Zingoni, A.; Diani, M.; Corsini, G. Tutorial: Dealing with rotation matrices and translation vectors in image-based applications: A tutorial. IEEE Aerosp. Electron. Syst. Mag. 2019, 34, 38–53. [Google Scholar]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 22, 1330–1334. [Google Scholar]
King, D.E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
Aksoy, M.; Forman, C.; Straka, M.; Skare, S.; Holdsworth, S.; Hornegger, J.; Bammer, R. Real-time optical motion correction for diffusion tensor imaging. Magn. Reson. Med. 2011, 66, 366–378. [Google Scholar]

Figure 1. System workflow diagram. (a) Image processing algorithms, including image registration, feature point recognition, etc. (b) Motion monitoring algorithms, including spatial pose monitoring, motion intensity estimation, etc. (c) Visualization software and test interface. (d) Experimental test environment and hardware equipment.

Figure 2. Camera calibration interface. (a) Checkerboard for calibration. (b) Reprojection error of checkerboard images. (c) Images of checkerboard from different perspectives.

Figure 3. Recognition effect of 68 feature points. (a) Standard diagram of 68 points. (b) Recognition effect on volunteer.

Figure 4. The registration of depth images and RGB images. (a) Images before registration. (b) Images after registration.

Figure 5. The geometric significance of the “Amplitude” in the two-dimensional case.

Figure 6. Dataset composition and collection environment. (a) Phantom dataset collection environment. (b) Volunteer dataset collection environment.

Figure 7. Interactive software. (a) Software design architecture. (b) Main interface.

Figure 8. Motion analysis of pitch movement. (a) The spatial position of 12 feature points. (b) The decomposition and estimation of translation. (c) The decomposition and estimation of rotation. (d) The estimation of translational amplitude and rotational amplitude. (e) The estimation of exercise intensity amplitude.

Figure 9. Spatial position tracking of 12 feature motion points for different motions.

Figure 10. Motion amplitude analysis of volunteer experiments. (a) Static state. (b) Yaw movement. (c) Pitch movement. (d) Multi-type movement.

Table 1. The time consumption of each module in the system.

System Mode	Module	Mean (ms)	Standard Deviation (ms)
Real-Time Mode	Image Acquisition Module	21.3082	6.6335
	Depth Registration Module	32.0709	2.0099
	Feature Point Recognition Module	0.0171	0.0074
	Pose Calculation Module	0.0274	0.0177
	Total Time	53.4236	6.9313
Image-Reading Mode	Total Time	35.1104	3.0520

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shan, Y.; Lu, Z.; Sun, Z.; Liu, H.; Xu, J.; Sun, Y.; Chen, X. RGB-D Camera-Based Human Head Motion Detection and Recognition System for Positron Emission Tomography Scanning. Electronics 2025, 14, 1441. https://doi.org/10.3390/electronics14071441

AMA Style

Shan Y, Lu Z, Sun Z, Liu H, Xu J, Sun Y, Chen X. RGB-D Camera-Based Human Head Motion Detection and Recognition System for Positron Emission Tomography Scanning. Electronics. 2025; 14(7):1441. https://doi.org/10.3390/electronics14071441

Chicago/Turabian Style

Shan, Yixin, Zikun Lu, Zhe Sun, Hao Liu, Jiangchang Xu, Yixing Sun, and Xiaojun Chen. 2025. "RGB-D Camera-Based Human Head Motion Detection and Recognition System for Positron Emission Tomography Scanning" Electronics 14, no. 7: 1441. https://doi.org/10.3390/electronics14071441

APA Style

Shan, Y., Lu, Z., Sun, Z., Liu, H., Xu, J., Sun, Y., & Chen, X. (2025). RGB-D Camera-Based Human Head Motion Detection and Recognition System for Positron Emission Tomography Scanning. Electronics, 14(7), 1441. https://doi.org/10.3390/electronics14071441

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RGB-D Camera-Based Human Head Motion Detection and Recognition System for Positron Emission Tomography Scanning

Abstract

1. Introduction

2. Literature Review

2.1. Principle of PET Scan Imaging

2.2. Image Information Acquisition

2.3. Detection and Recognition of Head Feature Points

2.4. Space Motion Monitoring of Rigid-like Objects

3. Research Methods

3.1. Camera Registration and Image Acquisition

3.2. Head Feature Point Recognition

3.3. Fusion Registration of RGB Images and Depth Images

3.4. Calculation of Head Space Pose

3.5. Definition and Calculation of Head Motion Intensity

4. Experiments

4.1. Experiment Setup

4.1.1. Datasets

4.1.2. Interaction Software Design

4.2. Performance

4.2.1. Phantom Experiments

4.2.2. Volunteer Experiments

4.3. Cost Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI