*Article* **An Experimental Study on the Validity and Reliability of a Smartphone Application to Acquire Temporal Variables during the Single Sit-to-Stand Test with Older Adults**

**Diogo Luís Marques 1, Henrique Pereira Neiva 1,2, Ivan Miguel Pires 3,4,5, Eftim Zdravevski 6, Martin Mihajlov 7, Nuno M. Garcia 3, Juan Diego Ruiz-Cárdenas 8, Daniel Almeida Marinho 1,2 and Mário Cardoso Marques 1,2,\***

	- jdruiz@ucam.edu

**Abstract:** Smartphone sensors have often been proposed as pervasive measurement systems to assess mobility in older adults due to their ease of use and low-cost. This study analyzes a smartphone-based application's validity and reliability to quantify temporal variables during the single sit-to-stand test with institutionalized older adults. Forty older adults (20 women and 20 men; 78.9 ± 8.6 years) volunteered to participate in this study. All participants performed the single sit-to-stand test. Each sit-to-stand repetition was performed after an acoustic signal was emitted by the smartphone app. All data were acquired simultaneously with a smartphone and a digital video camera. The measured temporal variables were stand-up time and total time. The relative reliability and systematic bias inter-device were assessed using the intraclass correlation coefficient (ICC) and Bland-Altman plots. In contrast, absolute reliability was assessed using the standard error of measurement and coefficient of variation (CV). Inter-device concurrent validity was assessed through correlation analysis. The absolute percent error (APE) and the accuracy were also calculated. The results showed excellent reliability (ICC = 0.92–0.97; CV = 1.85–3.03) and very strong relationships inter-devices for the stand-up time (*r* = 0.94) and the total time (*r* = 0.98). The APE was lower than 6%, and the accuracy was higher than 94%. Based on our data, the findings suggest that the smartphone application is valid and reliable to collect the stand-up time and total time during the single sit-to-stand test with older adults.

**Keywords:** mobile application; accelerometer sensor; stand-up time; total time; aging

#### **1. Introduction**

As the population's age increases in industrialized countries [1], older adults' care is critical for their well-being. Consequently, the evaluation and quantification of daily activities are essential for determining health status changes and, subsequently, detecting early signs of loss of autonomy [2]. Standing up from a sitting position and its counterpart transition, sitting down from a standing position, are the two most common daily motor

**Citation:** Marques, D.L.; Neiva, H.P.; Pires, I.M.; Zdravevski, E.; Mihajlov, M.; Garcia, N.M.; Ruiz-Cárdenas, J.D.; Marinho, D.A.; Marques, M.C. An Experimental Study on the Validity and Reliability of a Smartphone Application to Acquire Temporal Variables during the Single Sit-to-Stand Test with Older Adults. *Sensors* **2021**, *21*, 2050. https:// doi.org/10.3390/s21062050

Academic Editors: Rebeca P. Díaz Redondo and Jesús Ureña

Received: 22 January 2021 Accepted: 11 March 2021 Published: 15 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

activities that could be critical indicators for older adults' functional autonomy [3,4]. The sit-to-stand movement is one of the most challenging activities in terms of mechanics [5]. It requires the optimization of several kinematic tasks, including coordination, balance, mobility, muscular strength, and power output [6]. Within the population of older adults, increased sit-to-stand time is associated with a high risk of fall occurrence [7–9], decreased leg muscle power and strength [10,11], slow walking speed [12–14], and mobility disability [15]. Usually, the sit-to-stand test's time combined with the subject's age and previous medical history (e.g., recovering from an injury or surgery) helps identify the fall risk, assessing functional lower extremity strength, transitional movements, and balance.

Recently, approaches for quantifying mobility emerged that rely on inexpensive sensor technologies [16]. Remarkably, smartphone use has been suggested as a useful tool to objectively monitor and improve patients' health and fitness [17], which has been verified in research [18] and clinical practice [19]. Smartphone sensor technology has become sufficiently reliable and accurate to substitute specific biomechanics lab equipment and portable devices used in functional mobility research. Some authors showed that a smartphone with a motion sensor could be used as a low-cost integration device to evaluate the patient's balance and mobility [20]. Other authors demonstrated that the smartphones' accelerometer could measure kinematic tremor frequencies equivalent to electromyography's tremor frequency [21]. Wile et al. [22] utilized a smartwatch to differentiate the symptoms in patients with Parkinson's disease and related tremor diseases by calculating the first four harmonics' signal power.

Several clinical tests, such as the timed-up and go test and sit-to-stand test, have been developed to evaluate physical performance and mobility-related to every-day tasks [8,23]. The sit-to-stand test is a widely adopted clinical test used to evaluate older adults' functionality [24]. Initially designed to measure the lower extremities' functional capacity [25], it has been applied and investigated in different populations to assess the rehabilitation process and functional performance in older people with varied medical conditions [26]. During the sit-to-stand test, researchers and clinicians commonly measure the time spent to perform a fixed number of repetitions or the number of repetitions performed during a specific time [4,14]. Commonly, the evaluators use a chronometer to measure the time during the sit-to-stand test [27,28]. Despite its low cost and ease of use, chronometers present some limitations, mainly associated with human error (e.g., reaction time delay and position judgment) [29–31]. Therefore, to overcome the limitations mentioned above, clinicians and researchers may opt for alternative and reliable technologies to measure biomechanical parameters during the sit-to-stand test with aged populations [32,33]. Several authors have analyzed smartphone applications to quantify kinematic variables during the sit-to-stand test using high-speed video recordings [34,35]. Other studies utilized the triaxial accelerometer sensor embedded in the mobile smartphone [30,31,36]. For example, González-Rojas et al. [37] characterized the time measurement of sit-to-stand transitions by transforming the relative acceleration signal, recorded by a triaxial accelerometer. Cerrito et al. [30] validated a smartphone-based app using the accelerometer sensor to quantify the sit-to-stand test movement in older adults. These authors captured vertical ground reaction forces and vertical acceleration simultaneously using two force plates (reference standard) and a mobile smartphone. The total movement duration, peak force, rate of force development, and peak power were measured. Chan et al. [31] also developed a mobile application to calculate the time during the five-repetition sit-to-stand and timed-up and go tests in older women. The mobile application also includes a beep sound to cue the participants to initiate the test, which aims to eliminate potential human errors when using a chronometer, including the reaction time delay.

As frailty is viewed as a transitional state from robustness to functional decline, identifying a pre-frailty state can alleviate or postpone the consequences of this syndrome [38]. Within this context, temporal variables have been used as predictors of frailty in several older adults' studies. For example, Hausdorff et al. [39] measured the stride time, swing time, stance time, and percentage stance time. Fallers compared with non-fallers revealed

higher standard deviations and coefficients of variation across all variables. In the study by Zhou et al. [40], gait is deconstructed into clinically observable spatial-temporal variables to establish a quantitative model to classify fallers and non-fallers.

Considering that the sit-to-stand test is strongly recommended in clinical and research settings to assess functional independence, detect frailty, and sarcopenia in aged populations [15,41], the practicability of using smartphones for assessing the sit-to-stand in older adults can be a logical advancement of an inherent concept. Therefore, as stated above, measuring reliably and accurately temporal variables during the sit-to-stand test with older adults, including the stand-up time and the total time, is determinant to identify those who present functional impairments and designing individual clinical interventions to improve functionality [42,43]. Due to the difficulties of performing the different measurements with older adults, developing an easy-use solution is essential for taking different preventive actions. Therefore, this study aimed to analyze a smartphone application's validity and reliability to acquire temporal descriptors during the single sit-to-stand test with institutionalized older adults, including the stand-up time and total time. We hypothesized that the smartphone application would be valid and reliable for measuring the temporal variables during the single sit-to-stand test with older adults. This study's novelty consists of creating a scientifically valid and reliable mobile application for the independent and automatic measurement of temporal variables during the single sit-to-stand test with older adults. To our best knowledge, only one study using a mobile app to quantify the sit-to-stand test with older adults exhibited the data in real-time through graphics [30]. However, analyzing the data through graphics might not be a practical approach in clinical contexts due to its complexity and time-consuming. An essential factor to bear in mind is that clinicians and researchers want to immediately access the results when the test ends to provide feedback in real-time for the participants. The automatic method will also ensure that the data were properly collected. Therefore, with our mobile app, the possibility of automatically recording, processing, and presenting the results in the smartphone screen when the test ends entails a new clinical approach to assess the single sit-to-stand performance with older adults.

#### **2. Methods**

As part of the research on developing solutions for Ambient Assisted Living (AAL) [44–46], the scope of this study consists of using technological equipment that embeds inertial sensors that acquire different data types to measure and identify human movements [47–50].

#### *2.1. Study Design*

This study was a cross-sectional design aiming to analyze a smartphone application's validity and reliability to capture temporal descriptors during the single sit-to-stand test with institutionalized older adults. A digital video camera was used further to validate the correct execution of the sit-to-stand movement, and the results presented in this study only considered the valid ones. The sit-to-stand test is quick and easy to administer and presents practical utility in clinical and research settings to evaluate functional independence in older adults [7,8]. The experimental procedures were carried out over ten weeks. Each session was performed between 10:00 and 11:00 a.m. in the same location (room temperature 22–24 ◦C). In the first week, we familiarized the participants with the testing procedures. We also measured the body mass (TANITA BC-601, Tokyo, Japan) and height (Portable Stadiometer SECA, Hamburg, Germany). Then, from the second to the tenth week, the participants performed one testing session per week. In each session, we assessed a group of four to five participants in the sit-to-stand test.

#### *2.2. Participants*

Forty institutionalized older adults (20 men and 20 women) volunteered to participate in this study. Inclusion criteria were age ≥ 65 years old, men and women, capable of stand-up from a chair independently with the arms crossed over the chest, and willingness

to participate in the experimental procedures. Exclusion criteria were severe physical and cognitive impairment (i.e., Barthel index score < 60 and mini-mental state examination score < 20), deafness, musculoskeletal injuries in the previous three months, and terminal illness (life expectancy < 6 months). Table 1 presents the characteristics of the participants. All participants gave their informed consent for inclusion before they participated in the study. The study was conducted following the Declaration of Helsinki. The Ethics Committee approved the protocol of the University of Beira Interior (code: CE-UBI-Pj-2019-019).



Data are mean ± standard deviation.

#### *2.3. Sit-to-Stand Test*

For this study, we used the single sit-to-stand test, starting with a 10 min general warmup consisting of light walking and mobility exercises, as described in Marques et al. [51]. The participants were equipped with a smartphone placed inside a waistband (Sports Waistband Universal Phone Holder), which was in turn attached to the waist. We placed the mobile phone on the waist because the center of gravity is located around the abdomen [52]. The waistband was tightened to avoid slight movements of the mobile phone that would adversely affect data capture. The participants sat on an armless chair (height = 0.49 cm) with the back straight and the arms crossed over the chest. We did not allow the participants to lean back on the chair neither to assume a perched position. All participants were instructed to maintain a 90◦ hip and knee flexion, which the operator closely monitored during the test. After 10 s of the smartphone application's activation, an acoustic signal cued the participants to stand up and sit down on the chair while maintaining the arms crossed over the chest, thus performing a single sit-to-stand movement. When they finished the movement, the participant rested on the chair with the arms crossed over the chest for 15 s. The subsequent single sit-to-stand movement was performed after hearing another acoustic signal. After six single sit-to-stand repetitions, the participants had a 3-min rest before repeating the test four more times. This procedure was necessary to ensure that enough and correct data (i.e., having enough rest and without previous body movement before the beep) was collected for post-analysis. Before each sit-to-stand transition, we instructed the participants to perform the repetitions as fast as possible immediately after hearing the acoustic signal. In all trials, a researcher was standing next to the participants to ensure safety during the movement transitions. Figure 1 illustrates the sit-to-stand testing procedure.

**Figure 1.** Illustration of the single sit-to-stand test.

#### *2.4. Data Acquisition*

A smartphone application and a digital video camera acquired the data simultaneously. The latter device was considered the reference criterion [53–55]. The smartphone model was the Xiaomi Mi A1. This device embeds a triaxial accelerometer (model Bosch BMI120), which acquires the data at a sampling frequency of 200 Hz. As described before, we placed the smartphone inside a waistband, which was then attached to the participants' waist. The digital video camera (Canon LEGRIA HF R46, Tokyo, Japan) was positioned perpendicular to the field of view (distance = 3 m) and attached to a stationary tripod (height = 1.2 m). We recorded the participants from the sagittal plane at a sampling frequency of 25 frames per second. Although the sampling frequency between devices was different (200 vs. 25 Hz), this was not considered a significant limitation [53]. In fact, previous studies with older adults comparing the accelerometer data vs. video camera data during postural transitions (e.g., sit-to-stand or gait analysis) used different sampling frequencies between devices [54–56]. Therefore, according to the scientific literature, when comparing handheld devices (e.g., mobile phones) vs. machines (e.g., video cameras), it is impossible to achieve synchronization or sampling frequency equality [53].

Regarding using a 25 Hz video camera for analysis, it is essential to note that one frame's error corresponds to an error of 0.04 s when using this sampling frequency. Table 2 shows that the stand-up time and total time values are around 1.62 and 2.75 s, respectively. These results indicate that the video camera's error is around 2.5% in the stand-up time (i.e., 0.04/1.68) and 1.5% in the total time (i.e., 0.04/2.75), in the worst-case scenario which means a minor measurement error. Therefore, as stated by Winter [57], except for highspeed running and athletic movements, slower movement analyses (e.g., walking) can be reliably done with minor errors using a 25 Hz video camera. Previous studies with older adults used a sampling frequency of 25 Hz to analyze kinematic data during movement transitions, such as sit-to-stand, stand-to-sit, or sit-to-walk [54–56,58,59], which reinforces the validity and reliability of using this frequency for movement analysis.

**Table 2.** Relative reliability and relationship inter-devices.


CI: confidence interval; ICC: intra-class correlation coefficient; *-*: Spearman's rank correlation coefficient; \*\*\* *p*-value < 0.001.

#### *2.5. Data Analysis*

#### 2.5.1. Mobile Application

The accelerometer data were acquired with a mobile application, which automatically pre-processes the raw data and measures the stand-up time and total time (Figure 2). These measures are related to the occurrence of events, such as when standing up starts or ends. The mobile application was developed with Android Studio 4.1., with Java SE 12. It was used for the automatic detection of different events during the sit-to-stand test. It was developed and adjusted, considering the previous study [60], by this study's research team. The application will be available in the market after validation. Contrary to other studies, and to avoid the mobile device's incorrect positioning on the waist, we used the Euclidean norm of the accelerometer's outputs. After, the data was filtered, and the different calculations were applied. The different measurements can be performed locally without an Internet connection. The results are presented as soon as the test ends. The stand-up time starts from the acoustic signal until the minimum (negative) acceleration value is reached before the maximum (positive) acceleration value. The time frame between the acoustic signal and the maximum (positive) acceleration value is defined as the total time (Figure 2). For the measurement of the stand-up time and total time, we researched in the literature how these variables are automatically detected considering the accelerometer data [41,42].

**Figure 2.** Signal plot of the acceleration during one sit-to-stand repetition (i.e., the complete cycle of stand-up and sit down on the chair); ST: stand-up time; TT: total time.

#### 2.5.2. Video-Camera Recordings

The video recording files were transferred to a personal laptop and then analyzed using Adobe Premiere Pro (version 14.4.0, Adobe Systems, San Jose, CA, USA). We analyzed all video files frame by frame. The first frame was considered the start of the acoustic signal. After that, we calculated the stand-up time and total time. The stand-up time was defined as the moment from the acoustic signal until the participant was stand-up with the legs fully extended and an upright torso. The total time was measured from the acoustic signal until the participant returned to the seated position, the moment of contact with the chair, with vertical velocity decreased to zero. We identified that the person was entirely sat by monitoring subsequent frames and ensuring that the last frame corresponded to the moment they were fully seated. We converted the data to seconds by dividing the frame number by 25 frames per second. The repetitions were invalid if participants moved any segment of the body the instant before the acoustic signal or did not complete the sit-to-stand cycle. Therefore, we only selected valid repetitions for further analysis.

#### *2.6. Statistical Analysis*

The calculation of the interquartile range of the mean difference between devices in each temporal variable enabled the detection of outliers. If data were higher than 1.5 or lower than −1.5 times the Inter-quartile Range (IQR), it was removed [53]. The intraclass correlation coefficient (ICC with 95% confidence intervals [CI]) analyzed the level of agreement or relative reliability inter-device [61]. The ICC model was the two-way random-effects, absolute agreement, single rater/measurement [ICC(2,1)] [61]. Cronbach's alpha analyzed internal consistency. ICC values were interpreted as: <0.50, poor; 0.50–0.75, moderate; 0.75–0.90, good; >0.90, excellent [61]. Bland-Altman plots with 95% limits of agreement (LOA) (mean difference ± 1.96 × standard deviation [SD] of the differences) analyzed the systematic bias/differences between devices [62]. The Kendall Rank Correlation Coefficient (τ) between the absolute differences and the mean of both devices analyzed the degree of heteroscedasticity. If τ > 0.1, the data were considered heteroscedastic and transformed by logarithms to the base 10 (log10) [63]. Linear regressions and Spearman's Rank Correlation Coefficients (ρ) analyzed the concurrent validity inter-device. The magnitude of correlation was interpreted as: 0.00–0.10, negligible; 0.10–0.39, weak; 0.40–0.69, moderate; 0.70–0.89, strong; 0.90–1.00, very strong [64]. The assumption of homoscedasticity was analyzed by inspecting the standardized residuals' scatter plots against the standardized predicted values. The absolute reliability was analyzed by estimating the standard error of measurement (SEM = SD of the difference between the smartphone application and video camera scores divided by the √2), the coefficient of variation (CV = (SEM/mean of both devices) <sup>×</sup> 100), and minimal detectable change (MDC = <sup>√</sup><sup>2</sup> <sup>×</sup> SEM <sup>×</sup> 1.96) [65]. CV values < 5% were considered acceptable [66]. The absolute percent error of the measurements (APE = ((|smartphone application − video camera)/video camera|) × 100) [67], and the accuracy ((video camera − (|video camera − smartphone application|)/video camera) × 100) were also calculated. An APE < 10% was considered acceptable [67]. We conducted a sample size calculation based on an expected reliability level of 0.90 and a minimum acceptable reliability level of 0.80. With an alpha value of 0.05 and 6 repetitions per participant, a minimum sample size of 32 was required to obtain a power of 80% [68]. The significance level was set at *p* < 0.05. All data were analyzed using Microsoft Office Excel 2016 and SPSS version 27 (SPSS Inc., Chicago, IL, USA). Figures were designed using the GraphPad Prism version 7.0 (GraphPad Software Inc., San Diego, CA, USA).

#### **3. Results**

Table 2 presents the relative reliability and relationship between devices. The results obtained through manual based on the video camera recordings correspond to traditional methods when an operator observes the patient performing the test and times his/her movements with a chronometer. It is expected that the careful analysis frame-by-frame to measure the time of the test is more accurate than an operator measuring the time with a chronometer. Therefore, the results provided in Tables 2 and 3 and Figures 3 and 4 comparing the video camera results and the smartphone-based approach correspond to a comparison between a traditional and a smartphone-based approach as well. The standup time and total time showed excellent relative reliability and very strong significant relationships (*p* < 0.001) inter-devices.

**Table 3.** Absolute reliability and accuracy inter-devices.


APE: absolute percent error calculated as: ((|smartphone application − video camera)/video camera|) × 100); Accuracy calculated as: ((video camera − (|video camera − smartphone application|)/video camera) × 100); CV: coefficient of variation; MDC: minimal detectable change; SEM: standard error of measurement.

Figure 3 shows the Bland-Altman plots of agreement between the mobile application and video-camera for the stand-up time (A) and total time (B).

Figure 4 shows the linear regression between the mobile application and the videocamera for all variables. The line of 45◦ indicates the amount of difference inter-devices in the measurement of the variables. Both the stand-up time and total time fall nearby the line of 45◦. The resulting linear regression equation is provided for both variables.

Table 3 presents the absolute reliability and accuracy between devices. The standup time and total time showed CV values lower than 4%, revealing excellent absolute reliability. In both variables, the APE values were lower than 6%, and the accuracy was higher than 94%, thus revealing a high accuracy level.

**Figure 3.** Bland-Altman plots with 95% limits of agreement (mean difference ± 1.96 × standard deviation [SD] of the differences) between the mobile application and video-camera for the stand-up time (**A**) and total time (**B**); the solid lines in the middle of the plots represent the mean difference/bias, while the upper and lower dotted lines represent the upper and lower LOA.

**Figure 4.** Linear regression between the mobile application and video-camera for the stand-up time (**A**) and total time (**B**); *r*2: coefficient of determination; the black lines indicate the regression line, while the red lines indicate the line of 45◦; dotted lines indicate 95% confidence intervals.

#### **4. Discussion**

In this study, we analyzed the validity of a mobile smartphone application to quantify the stand-up time and total time during the single sit-to-stand test with institutionalized older adults. The results revealed excellent reliability, high accuracy, and very strong relationships between devices in both temporal variables. These results agree with our central hypothesis, meaning that the mobile application is valid and reliable for measuring temporal variables during the single sit-to-stand test with institutionalized older adults.

Regarding stand-up time, only two studies reported acquiring this temporal variable with a mobile application during the sit-to-stand test with older adults. In a study with stroke patients [36] of both sexes (67.50 ± 13.18 years), the authors could observe a mean stand-up time of 1.95 s (SD = 0.08), which is 16% longer than the value observed in our study (1.68 ± 0.29 s). With our study focusing on older adults without any medical conditions that would affect mobility during the sit-to-stand test, these differences are expected. In the second study [30], which included community-dwelling older adults (73.5 ± 10.4 years), the reported mean stand-up time of 1.66 s (SD = 0.42) is close to our results. Possible reasons for these similarities might be associated with using a triaxial accelerometer embedded in the smartphone, a similar system to develop the mobile application, the participants' age, and the maximal intended speed during the sit-to-stand transfer. Other studies that captured the stand-up time through a mobile application [35] reported significantly lower times (0.47 ± 0.09 s). However, the results are incomparable as the participant sample pool included adults of a significantly wider age range (21–87 years), and the stand-up time in this study was defined as the rising phase of the sit-to-stand movement without taking into account the preparatory phase, i.e., when the trunk is shifted forward prior to seat-off. Additionally, the mobile application developed in that study quantified the sit-to-stand test based on high-speed video recordings and not through the triaxial accelerometer incorporated in the smartphone. Having only three studies related to the sit-to-stand test and its analysis using mobile devices suggests that this is an active area of research. Furthermore, the inconsistent results reported by different studies due to the reasons mentioned above justify the work performed in our research, which is in a more controlled and homogeneous age group of older adults, lacking in other studies.

Studies that used body-fixed sensors instead of a mobile device to record the standup time with older adults observed that the time ranged between 1.81 to 2.17 s [69–72]. Furthermore, when the participants were instructed to perform the movement as fast as possible, the time decreased to 1.74 s (SD = 0.33) [69] and 1.7 s (SD = 0.80) [42]. These results are relevantly like the stand-up time values presented in our study, mainly when the sitto-stand transfer is performed at the maximal intended velocity. Considering the findings above, the accelerometer data acquired with a hybrid sensor or a mobile application seems to present similar results in the sit-to-stand test with older adults.

Regarding the total time (i.e., the complete measurement of the stand-up/sit-down cycle), to our best knowledge, only one study measured this variable with a mobile application [36]. Merchán-Baeza et al. [36], in a study with stroke patients, reported a total time of 4.09 ± 0.07 s. This time is 45% higher than the total time presented in our study (2.81 ± 0.50 s). However, this result is expected as none of our participants had suffered from a stroke. Hence, the observation of a faster time for sit-to-stand transitions in older adults without mobility disability when compared to stroke patients can be anticipated.

Our mobile application demonstrated high accuracy levels and minor errors to capture temporal variables during the single sit-to-stand test with older adults, reinforcing its validity and reliability. Although several related studies mentioned accurate mobile apps to measure temporal variables during the sit-to-stand test with older adults [31,35], none reported the accuracy values like ours, which does not allow valid comparisons with our results.

We want to note that the paper discusses a mobile app designed primarily for use in clinical settings or in controlled settings when the patient has help from caregivers, medical personnel, or family members to set up the application and place the mobile device properly. If used in an uncontrolled environment, then the test results can be invalid. However, even in such limiting circumstances, not having to visit a medical center to perform the test is quite valuable for older adults with mobility and dexterity problems.

Future studies should consider the following limitations and perform new analyses in the sit-to-stand test with aged populations to strengthen this field's knowledge. Firstly, the sit-to-stand test analysis can be strengthened by capturing other biomechanical variables. For example, developing an algorithm to calculate the velocity, force, and power generated during the sit-to-stand transitions will provide insightful information for researchers and clinicians. Secondly, determining the mobile application's intra-device reliability by repeating the experiment over different trials will help analyze the results' consistency. Thirdly, analyzing the mobile app's validity and reliability considering smartphones with different sampling frequencies will help understand if its use can be generalized among several devices. Finally, applying other field-based tests such as upper and lower body strength tests, the timed-up and go, or walking speed tests will enable researchers to analyze their relationship with the temporal variables collected during the sit-to-stand test.

Another promising avenue for research is to also include capabilities in the mobile application to identify whether the test was performed correctly, such as due to incorrect body positioning or performing incomplete movements. For the current research, we assume that the participants were trained to perform the test when asked to install the mobile application. Additionally, the mobile app will also include informative videos and tutorials about the test's procedures.

#### **5. Conclusions**

For older adults, sit-to-stand tasks are an essential facet of independence and wellbeing. Therefore, improved quantification of the sit-to-stand test is warranted. It can provide important information that can help improve the quality of life of older adults. The smartphone application presented in this study is suitable for valid and reliable measurements of temporal variables during the single sit-to-stand test with institutionalized older adults, specifically, stand-up time and total time. Researchers and clinicians commonly use these variables for different purposes, such as identifying frailty and analyzing the effects of different training interventions on these variables (i.e., do they improve the time to stand-up from the chair after a training program?). Therefore, having a valid and reliable instrument like our mobile application to measure these variables is clinically essential for capturing accurate data. The smartphone application can also be used in contexts where budget, space, time, and equipment are limited. It is also essential to note that, as the test ends, the results are presented on the smartphone screen in real-time, meaning that the evaluators can immediately access the data and provide reliable feedback regarding the test's performance. As a result, there is no need to use other materials more sensitive to human error like chronometers to capture the data. Finally, the data is also stored in the mobile phone and cloud, enabling follow-up analysis.

In the future, these tests should be evaluated by multidisciplinary teams comprised of coaches, physiotherapists, physicians, nurses, and technicians to identify potential issues that might have been neglected during this study. A pilot test in a broader population performed for a prolonged period should be conducted to evaluate the long-term effects of exercise or rehabilitation on older adults' sit-to-stand performance. As a result, it would help determine whether the current practice should be modified or updated and under which conditions.

**Author Contributions:** Conceptualization, methodology, software, validation, formal analysis, investigation, writing—original draft preparation, writing—review and editing; D.L.M., H.P.N., I.M.P., E.Z., M.M., N.M.G., J.D.R.-C., D.A.M. and M.C.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Portuguese Foundation for Science and Technology, I.P., grant number **SFRH/BD/147608/2019** and project number **UIDB/04045/2020**. This work is also funded by FCT/MEC through national funds and co-funded by FEDER—PT2020 partnership agreement under the project **UIDB/50008/2020** (*Este trabalho é financiado pela FCT/MEC através de fundos nacionais e cofinanciado pelo FEDER, no âmbito do Acordo de Parceria PT2020 no âmbito do projeto UIDB/50008/2020*), as well as by National Funds through the FCT—Foundation for Science and Technology, I.P., within the scope of the project **UIDB/00742/2020**.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of University of Beira Interior (protocol code CE-UBI-Pj-2019-019; 27/05/2019).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Data is available in Mendeley data at http://dx.doi.org/10.17632/335 rmgrfw2.4 io (accessed on 29 September 2020).

**Acknowledgments:** We wish to express our sincere gratitude to the care staff of the Santa Casa da Misericórdia do Fundão and thank all the participants involved in this study. We also would like to thank João Leal, full professor in the University of Agder in Norway, for his contribution to programming the mobile application. This research was funded by Portuguese Foundation for Science and Technology, I.P., grant number **SFRH/BD/147608/2019** and project number **UIDB/04045/2020**. This work is funded by FCT/MEC through national funds and co-funded by FEDER—PT2020 partnership agreement under the project **UIDB/50008/2020** (*Este trabalho é financiado pela FCT/MEC através de fundos nacionais e cofinanciado pelo FEDER, no âmbito do Acordo de Parceria PT2020 no âmbito do projeto UIDB/50008/2020*). This work is also funded by National Funds through the FCT—Foundation for Science and Technology, I.P., within the scope of the project **UIDB/00742/2020.** This article is based upon work from COST Action IC1303–AAPELE–Architectures, Algorithms and Protocols for Enhanced Living Environments and COST Action CA16226–SHELD-ON–Indoor living space improvement: Smart Habitat for the Elderly, supported by COST (European Cooperation in Science and Technology). More information in www.cost.eu. Furthermore, we would like to thank the Politécnico de Viseu for their support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Using Direct Acyclic Graphs to Enhance Skeleton-Based Action Recognition with a Linear-Map Convolution Neural Network**

**Tan-Hsu Tan 1, Jin-Hao Hus 1, Shing-Hong Liu 2,\*, Yung-Fa Huang <sup>3</sup> and Munkhjargal Gochoo <sup>4</sup>**


**Abstract:** Research on the human activity recognition could be utilized for the monitoring of elderly people living alone to reduce the cost of home care. Video sensors can be easily deployed in the different zones of houses to achieve monitoring. The goal of this study is to employ a linear-map convolutional neural network (CNN) to perform action recognition with RGB videos. To reduce the amount of the training data, the posture information is represented by skeleton data extracted from the 300 frames of one film. The two-stream method was applied to increase the accuracy of recognition by using the spatial and motion features of skeleton sequences. The relations of adjacent skeletal joints were employed to build the direct acyclic graph (DAG) matrices, source matrix, and target matrix. Two features were transferred by DAG matrices and expanded as color texture images. The linear-map CNN had a two-dimensional linear map at the beginning of each layer to adjust the number of channels. A two-dimensional CNN was used to recognize the actions. We applied the RGB videos from the action recognition datasets of the NTU RGB+D database, which was established by the Rapid-Rich Object Search Lab, to execute model training and performance evaluation. The experimental results show that the obtained precision, recall, specificity, F1-score, and accuracy were 86.9%, 86.1%, 99.9%, 86.3%, and 99.5%, respectively, in the cross-subject source, and 94.8%, 94.7%, 99.9%, 94.7%, and 99.9%, respectively, in the cross-view source. An important contribution of this work is that by using the skeleton sequences to produce the spatial and motion features and the DAG matrix to enhance the relation of adjacent skeletal joints, the computation speed was faster than the traditional schemes that utilize single frame image convolution. Therefore, this work exhibits the practical potential of real-life action recognition.

**Keywords:** linear-map convolutional neural network; direct acyclic graph; action recognition; spatial feature; temporal feature

#### **1. Introduction**

Recently, the lifespans of the world's population are increasing, and society is gradually aging. According to the report of the United Nations [1], the number of elderly people (over 65) in the world in 2019 was 703 million, and this is estimated to double to 1.5 billion by 2050. From 1990 to 2019, the proportion of the global population over 65 years old increased from 6% to 9%, and the proportion of the elderly population is expected to further increase to 16% by 2050. In Taiwan, the report of the National Development Council indicated that the elderly population with age over 65 will exceed 20% of the national population at 2026 [2].

**Citation:** Tan, T.-H.; Hus, J.-H.; Liu, S.-H.; Huang, Y.-F.; Gochoo, M. Using Direct Acyclic Graphs to Enhance Skeleton-Based Action Recognition with a Linear-Map Convolution Neural Network. *Sensors* **2021**, *21*, 3112. https://doi.org/10.3390/ s21093112

Academic Editor: Ivan Miguel Serrano Pires

Received: 30 March 2021 Accepted: 27 April 2021 Published: 29 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Taiwan will enter a super-aged society in 2026. This means that the labor manpower will gradually decrease in the future. Thus, the cost of home care for elders will significantly increase. In homecare, the monitoring of elderly people living alone is a major issue. The behaviors of their activities have a high relation with their physical and mental health [3,4]. Therefore, how to use artificial intelligence (AI) techniques to reduce the cost of home care is an important challenge.

The recognition of body activities has two major techniques. One is physical sensors, like accelerometers [5,6], gyroscopes [7], and strain gauges [8], which have the advantage that they can be worn on the body to monitor dangerous activities throughout the day and the disadvantage that few activities can be recognized. Therefore, the physical sensors are typically not used to identify the daily activities. Another technique is the charge-coupled device (CCD) camera [9,10], which has the advantage of being able to recognize many daily activities and the disadvantage that it can only be used to monitor the activities of people in a local area. Thus, it is suitable to be used in a home environment.

Many previous studies have used deep learning techniques to recognize daily activities, including two-stream convolutional neural networks (CNNs), long short-term memory networks (LSTMNs), and three-dimensional CNNs (3D CNNs). For the two-stream CNN, Karpathy et al. used context stream and fovea streams to train a CNN [11]. The two streams proposed by Simonyan et al. were spatial and temporal streams, which represent the static and dynamic frames of each action's film [12]; however, the spending time for each action was different.

Thus, Wang et al. proposed a time segment network to normalize the spatial and temporal streams [13]. Jiang et al. used two streams as the input combined with CNN and LSTMN [14]. Ji et al. proposed 3D CNN to obtain the features of the spatial and temporal streams [15]. The two-stream methods using image and optical flow to represent the spatial and temporal streams had a better performance for recognizing activities compared with the one-stream methods. However, the weakness is that its doubled data amount requires more time to train the model.

Studies have used skeletal data as the common input feature for human action recognition [16–18], where the 3D skeletal data were typically obtained by use of the depth camera. In these studies, the number of recognized actions was less than 20 [17], and the skeletal data had to be processed to extract the features. Machine learning methods were used in these studies. The spatiotemporal information of skeleton sequences was exploited using recurrent neural networks (RNNs) [19,20].

Both the amount of data and recognized actions were less than in the video datasets [16–20]. An RNN tends to overemphasize the temporal information and ignore the spatial information, leading to low accuracy. However, the advantage of methods employing skeletal data is that these requires less training data and training time compared to those using image data. Hou et al. used a CNN to recognize actions with skeletal features [21]. Therefore, an effective method to encode the spatiotemporal information of a skeleton sequence into color texture images that could be recognized by a CNN is a relevant issue.

A directed acyclic graph (DAG) consists of a combination of nodes and edges. Each node points to another node by an edge. These directions do not become a circle graph that will end at the extremities. DAGs are usually used to represent causal relations amongst variables, and they are also used extensively to determine which variables need to be controlled for confounding in order to estimate causal effects [22]. The physical posture of people can be described by the positions of the skeletal joints. The adjacent joints have a causal relation when the body is moving. Thus, we can define the DAG of skeletal joints to explain the relations of physical skeletons.

This study aims to recognize the daily activities with films recorded by CCD cameras. To reduce the large amount of data for model training, we transferred body images to physical postures with an open system, AlphaPose [23]. The posture information is the skeleton sequences captured from the films of actions to build the spatial and motion features. These features all include both the spatial and temporal characteristics of actions. The relations of

adjacent joints were used to build direct acyclic graph (DAG) matrices, the source matrix, and the target matrix.

These features are expanded by the DAG matrices as color texture images. The linear-map CNN has a two-dimensional linear map at the beginning of each layer to adjust the number of channels. Then, a two-dimensional CNN is used to recognize the actions. A structure with two streams was used to increase the accuracy of the action recognition. The datasets (NTU RGB+D) used in this study is an open source supported by Rapid-Rich Object Search Lab, National Technological University, Singapore [18].

In our work, a total of 49 actions, including daily actions, medical conditions, and mutual actions, were considered for action recognition. The total number of films was 46,452 for the cross-subject and cross-view sources. Of the cross-subject sources, 32,928 films were used for training, and 13,524 films were used for testing. Of the cross-view sources, 30,968 films were used for training, and 15,484 were used for testing. The experimental results show that the performance of our method was better than those in the previous studies.

#### **2. Methods**

Figure 1 shows the flowchart of action recognition in this study, which has three phases. In the feature phase, RGB images were processed by AlphaPose [23] to obtain the coordinate values of the skeletal joints of the subject in an image as a vector. A film contained 300 images that were used to build a posture matrix as the feature. We defined the spatial features and motion features by the coordinate values of the skeletal joints for each film. Spatial features are the position information of skeletons and joints, and motion features are the optical-flow information of skeletons and joints.

Each feature was expanded into two features of source and target by DAGs. In the recognition phase, a 10-layer linear-map CNN was used to recognize the activities. The cross-subject and cross-view evaluations were used to test the performance of this linearmap CNN. In the output phase, the results of the spatial and motion features were fused to show the recognized actions.

**Figure 1.** The flowchart of the action recognition in this study.

#### *2.1. NTU RGB+D Dataset*

The datasets of action recognition supported by the Rapid-Rich Object Search Lab, National Technological University, Singapore [24] were used in the study. There were 56,880 files, including 60 action classes. Each file consists of RGB, depth, and skeleton data of human actions. All actions were recorded by three Kinect V2 cameras. The size of the RGB images was 1920 × 1080. There were 40 classes for daily actions, 9 classes for medical

conditions, and 11 classes for mutual actions. Forty distinct subjects were invited for this data collection.

The physical activities of only a single person were recognized, and the sample size of 46,452, including the 49 physical activities, was used in this study. To ensure standard evaluations for all the reported results on the benchmark, two types of action classification evaluation (cross-subject evaluation, and cross-view evaluation) were used [24]. In the cross-subject evaluation, the sample sizes for training and testing sets were 40,320 and 16,560, respectively. In the cross-view evaluation, the sample sizes for training and testing sets were 37,920 and 18,960, respectively.

#### *2.2. Spatial and Motion Features*

The RGB images were processed by the AlphaPose [23] to obtain the coordinate values of skeleton joints of people in the image, and the format is shown in Equation (1).

$$\text{pose} = \{ (\mathbf{x}\_0, \mathbf{y}\_0, \mathbf{c}\_0, \dots, \mathbf{x}\_M, \mathbf{y}\_M, \mathbf{c}\_M) \parallel M = 17 \} \tag{1}$$

where *xi* and *yi* are the coordinate values of *i*th joint, *ci* is the confidence score, and *M* is the index of the joints. According to the coordinate values of the joints, the spatial and motion variables were defined, as shown in Table 1. *n* is the index of the frames. The spatial variables are the joint data (*vi*), and skeleton data (*si,j*). The motion variables are the motion data of the joints and skeleton (*mvi* and *msi*). Thus, there are four features in this study, *Fv*, *Fs*, *Fmv*, and *Fms*. Table 2 shows the indexes and definition of 18 joints and 17 edges in the body. The 17 edges, e*i*, are defined as the relations between two adjacent joints, *i-*1th and *i*th.

**Table 1.** The indexes of the joints and relations between every two adjacent joints at the 17 edges.


#### *2.3. Directed Acyclic Graph*

DAG was used to describe the relations of 18 joints. The nodes of DAG represent the joints, and the flows represent the edges. Each edge has a source joint and a target joint. Thus, two DAG matrices, the source matrix and target matrix, can be defined. If the *i*th joint is the source point of the *j*th edge, the element (*j*, *i*) of the source matrix is set as 1. Otherwise, the element (*j*, *i*) is set as 0.

The target matrix is set as the source matrix. Then, each row of the source and target matrices is normalized to avoid overvalues. To match the size of feature, e(0, 0) = 1 is defined as a virtual edge. The sizes of the source and target matrices are 18 × 18. Figure 2a is the source matrix, S, color-none is 0, color-black is 1, and color-original is 0.25. Figure 2b is the target matrix, T.


**Table 2.** The indexes and positions of the joints and edges for the skeletal data.

**Figure 2.** (**a**) Source matrix, (**b**) Target matrix. Color-none is 0, color-black is 1, and color-orange is 0.25.

#### *2.4. Input Features*

We used 300 frames for every film. The joint data built the joint feature (*Fv*), the skeleton data built the skeleton feature (*Fs*), the joint-motion data built the joint-motion feature (*Fmv*), and the skeleton-motion data built the skeleton-motion feature (*Fms*). Figure 3 shows the contents of a feature with x and y values of a data. Thus, the information of the

film was reduced to four 600 × 18 matrices for four features, *Fv*, *Fs*, *Fmv,* and *Fms*. The *Fv* was expanded by the DAG matrix into two features, *Fvin* and *Fvout*.

$$F\_{\rm vin} = F\_v \times \mathcal{S}^T \tag{2}$$

$$F\_{\text{vout}} = F\_{\text{v}} \times T^T \tag{3}$$

*Fs* was expanded by the DAG matrix to *Fsin* and *Fsout*; *Fmv* was expanded to *Fmvin* and *Fmvout*; and *Fms* was expanded to *Fmsin* and *Fmsout*. Table 3 shows the contents of the spatial feature (*Fspatial*) and motion feature (*Fmotion*). *Fsaptial* is the combination of *Fspatial-joint* and *Fspatial\_skeleton*. *Fmotion* is the combination of *Fmotion-joint* and *Fmotion\_skeleton*. We used the two features (*Fspatial* and *Fmotion*) to evaluate the performance of the linear-map CNN.

**Table 3.** The channel contents of the input features.


**Figure 3.** One feature built by the joint data, skeleton data, joint-motion data, or skeleton-motion data, with the size of 600 × 18.

#### *2.5. Linear-Map CNN*

Figure 4 shows the structure of a 10-layer linear-map CNN. The linear-map was used to adjust the number of channels at the beginning of each layer. Batch normalization (BN) can overcome the disappearance of the learning gradient and, thus, use a larger learning rate. In the CNN, the kernel size is a 9 × 1 matrix, the stride is (1,1), and the padding is (4,0). In the input feature, columns represent the different joints, and rows represent the time sequence of the actions. The relation of the adjacent joints was enhanced by the DAG matrix. Thus, the kernel of convolution is a 9 × 1 matrix. Table 4 shows the detailed information of the linear-map CNN. The output layer has 49 nodes representing the 49 action classes. The optimal method was momentum. The batch number is 32.


**Table 4.** The parameters of linear-map CNN in each layer.

**Figure 4.** The structure of the 10-layer linear-map CNN.

#### *2.6. Statistical Analysis*

According to our proposed method, a film is considered as true positive (TP) when the classification action is correctly identified; false positive (FP) when the classification action is incorrectly identified; true negative (TN) when the action classification is correctly rejected, and false-negative (FN) when the action classification is incorrectly rejected. Here, the performance of the proposed method was evaluated using these parameters,

$$Precision(\%) = \frac{TP}{TP + FP} \times 100\% \tag{4}$$

$$Recall(\%) = \frac{TP}{TP + FN} \times 100\% \tag{5}$$

$$Specificity(\%) = \frac{TN}{TN + FP} \times 100\% \tag{6}$$

$$F\_1 score = \frac{2 \times Precision \times Recall}{Precision + Recall} \times 100\%\_{\prime} \tag{7}$$

$$Accuracy(\%) = \frac{TP + TN}{TP + TN + FP + FN} \times 100\%. \tag{8}$$

#### **3. Results**

In this study, the hardware employed was CPU Intel Core i7-8700 and GPU GeForce GTX1080. The operating system was Ubuntu 16.04LTS software, the development system was Anaconda 3 at python 3.7 version, the tool of deep learning was Pytorch 1.10, and the compiler was Jupyter Notebook. We evaluated the performance of DAG with the crosssubject and cross-view sources, and four features (*Fspatial\_joint*, *Fspatial-skeleton*, *Fmotion\_joint*, and *Fmotion\_skeleton*). At last, we used the two-stream concept, class score fusion for *Fspatial* and *Fmotion*, to evaluate the performance of the proposed method with cross-subject and cross-view sources.

Table 5 shows the results without the DAG transfer. The best feature is *Fspatial* under the cross-subject source, which resulted in an accuracy and F1-score of 99.3% and 82.8%, respectively. There were 10 actions with recall rates below 70%: 0, 4, 9, 10, 11, 16, 28, 29, 43, and 48. The worst feature is *Fmotion* under the cross-subject source, its accuracy and F1-score are 99.2% and 79.6%, respectively. There are 11 actions with recall rates below 70%: 3, 10, 11, 16, 24, 28, 29, 31, 36, 43, and 45. Table 6 shows the results with the DAG transfer. The best feature was *Fspatial* under the cross-view source; its accuracy and F1-score were 99.9% and 96.2%, respectively.

Only four actions, 10, 11, 28, and 29, had recall rates below 70%. The worst feature was *Fmotion* under the cross-subject source, which obtained an accuracy and F1-score of 99.1% and 79.1%, respectively. There were 10 actions with recall rates below 70%: 2, 10, 11, 16, 28, 29, 31, 43, 44, and 45. We found that the DAG transfer could significantly improve the recognition rate of different actions, not only for spatial features but also for motion features. Table 7 shows the results of class score fusion with and without DAG transfer. We found that the performance of DAG transfer used in the cross-view source was better than used in the cross-subject source, with an accuracy of 99.9% vs. 99.5% and F1-score of 94.7% vs. 86.3%. The recall rates for all 49 actions were not below 70%.

We used the two-dimensional joint and skeleton features to perform the training and testing of the linear-map CNN, which could reduce the running time more than those using two- or three-dimensional joint and skeleton images. Table 8 shows the training and testing time with and without DAG transfer. We found that the GPU could process about 30 frames/second (fps) in the training phase, and process about 125 fps in the testing phase. Although the DAG transfer required time to process, the delay time was about 30 min in the training phase. The maximum testing time was 141 s.


**Table 5.** The results without the DAG transfer.

**Table 6.** The results with the DAG transfer.


**Table 7.** The results of class score fusion with and without DAG transfer.


**Table 8.** The training and testing time with and without DAG transfer.



**Table 8.** *Cont.*

#### **4. Discussion**

In this study, we used DAG transfer and the two-stream method to improve the accuracy of action recognition. When the input features were transferred with the DAG matrices, the precision, recall, specificity, F1-score, and accuracy were improved by 1.2%, 1.1%, 0.1%, 1.3%, and 0.1%, respectively, for the cross-subject source, and were improved by 9.1%, 7.4%, 0.1%, 7.4%, and 0.5%, respectively, for the cross-view source. In the two-stream method, previous studies have typically used the spatial and temporal, or optical flow features to perform the active score fusion [11–14]. They also proved that the performance of two streams was better than one stream.

We used joint and skeleton sequences as the spatial motion features that had temporal characteristics. We utilized 300 frames to describe an action. Thus, the spatial feature of one action included the spatial and temporal characters. However, the motion relations of the joints and skeletons were different from one action to another. Therefore, we defined the motion variables, *mvi* and *msi*, as shown in Table 1, to establish the motion features.

Our results also show that the precision, recall, specificity, F1-score, and accuracy of the two streams were improved by 0% vs. 6.3%, 0.1% vs. 7.4%, 0% vs. 0.2%, 0.1% vs. 7.1%, and 0.1% vs. 0.4%, over spatial and motion streams with the DAG transfer in the cross-subject source, and improved by 0.5% vs. 3.9%, 0.4% vs. 4.5%, 0% vs. 0%, 0.5% vs. 4.3%, and 0% vs. 0.2% in the cross-view source.

The comparison of our results with the previous studies under the recall rate is shown in Table 9. These studies all used cross-subject and cross-view sources from the NTU RGB+D database to recognize actions and also used three-dimensional characteristics of each posture as the input features [19,25–31]. Our method had the best recall rates of 86.1% and 94.7% in the cross-subject and cross-view sources.

We analyzed the actions with lower recall rates in the cross-subject and cross-view sources in Tables 5–7. The four actions that often had lower recall rates were A10 (reading), A11 (writing), A28 (phone call), and A29 (playing with laptop). Figure 5a is the posture of reading, and Figure 5b is the posture of writing. The subject is standing up, looking down, and holding something. The difference between the two images is only in the gestures of two hands. However, according to the description of the body posture in Table 2, only the right and left wrist joints are marked, which cannot show the gestures of two hands.

The postures of the subject making a phone call (in Figure 5c) and using the laptop (in Figure 5d)) had the same problem. The difference between the two images was also in the gestures of the two hands. These actions were difficult to recognize using spatial features, such as the movement trajectories of the arms, elbows, and wrists. Thus, the results of our method with the DAG transfer and two-stream method in Table 7 show that no action had a lower recall rate in the cross-view source.

**Table 9.** These studies all used cross-subject and cross-view sources in the NTU RGB+D database to recognize the actions, which also used the three-dimensional characteristics of each posture as the input features [19,25–31]. Our method had the best recall rates in the cross-subject and cross-view sources at 86.1% and 94.7%.


**Figure 5.** (**a**) the posture of reading, (**b**) the posture of writing, (**c**) the posture of making a phone call, (**d**) the posture of using the laptop.

#### **5. Conclusions**

The large scale of the collected data in the NTU RGB+D database enabled us to apply the posture-driven learning method for action recognition. The posture information represented by the skeleton data was obtained from the 300 frames of the film. The joint and skeleton sequences were used to build spatial and motion features that included the spatial and temporal characteristics of the actions. The relations of the adjacent skeletal joints were used to build the DAG matrices.

The spatial or motion features were expanded by DAG matrices as color texture images. The expanded features all indicated that the relations between adjacent joints were enhanced. Our method effectively reduced the amount of data for training the linear-map CNN, and its performance was superior to the previous schemes using deep learning methods. Notably, since the computation speed can reach around 125 fps in the testing phase with GPU, our scheme could be used to monitor the daily activities of elders in real time in home care applications.

**Author Contributions:** Conceptualization, T.-H.T. and S.-H.L.; Data curation, J.-H.H.; Investigation, T.-H.T.; Methodology, J.-H.H.; Project administration, T.-H.T.; Software, J.-H.H.; Supervision, Y.-F.H. and M.G.; Validation, Y.-F.H. and M.G.; Writing original draft, S.-H.L.; Writing review and editing, S.-H.L., Y.-F.H., and M.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Science and Technology, Taiwan, under grants MOST 109-2221-E-324-002-MY2 and MOST 109-2221-E-027-97.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

