*Article* **Usability, User Experience, and Acceptance Evaluation of CAPACITY: A Technological Ecosystem for Remote Follow-Up of Frailty**

**Rodrigo Pérez-Rodríguez 1,2,3,\*, Elena Villalba-Mora 2,4, Myriam Valdés-Aragonés 2,5, Xavier Ferre 2, Cristian Moral 2, Marta Mas-Romero 6, Pedro Abizanda-Soler 3,6,7 and Leocadio Rodríguez-Mañas 2,3,5**


**Citation:** Pérez-Rodríguez, R.; Villalba-Mora, E.; Valdés-Aragonés, M.; Ferre, X.; Moral, C.; Mas-Romero, M.; Abizanda-Soler, P.; Rodríguez-Mañas, L. Usability, User Experience, and Acceptance Evaluation of CAPACITY: A Technological Ecosystem for Remote Follow-Up of Frailty. *Sensors* **2021**, *21*, 6458. https://doi.org/ 10.3390/s21196458

Academic Editor: Ivan Miguel Serrano Pires

Received: 7 September 2021 Accepted: 26 September 2021 Published: 27 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Abstract:** Frailty predisposes older persons to adverse events, and information and communication technologies can play a crucial role to prevent them. CAPACITY provides a means to remotely monitor variables with high predictive power for adverse events, enabling preventative personalized early interventions. This study aims at evaluating the usability, user experience, and acceptance of a novel mobile system to prevent disability. Usability was assessed using the system usability scale (SUS); user experience using the user experience questionnaire (UEQ); and acceptance with the technology acceptance model (TAM) and a customized quantitative questionnaire. Data were collected at baseline (recruitment), and after three and six months of use. Forty-six participants used CAPACITY for six months; nine dropped out, leaving a final sample of 37 subjects. SUS reached a maximum averaged value of 83.68 after six months of use; no statistically significant values have been found to demonstrate that usability improves with use, probably because of a ceiling effect. UEQ, obtained averages scores higher or very close to 2 in all categories. TAM reached a maximum of 51.54 points, showing an improvement trend. Results indicate the success of the participatory methodology, and support user centered design as a key methodology to design technologies for frail older persons. Involving potential end users and giving them voice during the design stage maximizes usability and acceptance.

**Keywords:** frailty; home monitoring; user-centered design; usability; user experience; acceptance

#### **1. Introduction**

#### *1.1. Research Context*

Intrinsic capacity, according to the World Health Organization (WHO), is the combination of the physical and mental (including psychological) capacities of the individuals. Intrinsic capacity is thus part of functional ability together with the environment and the interactions with it. The concept of frailty is closely related and complementary to intrinsic capacity. Frailty can be defined as a stage of age-related decline reducing intrinsic capacity and functional reserve of older persons, thus predisposing them to adverse events (mortality and disability, among others). These days, there is a pressing need to develop

comprehensive community-based approaches and to introduce interventions to prevent functional decline [1].

The risk of developing chronic conditions, including disability and dependency, increases with age [2,3], and is changing the classical approach to manage functionally declining older persons. Considering that functional decline is accompanied by a loss in functional reserve, it is very unlikely that disability is reversed. In this way, healthcare systems need to move towards person-centered approaches that anticipate the earliest stages of functional decline (i.e., frailty) to prevent disability, since becoming frail can be delayed, slowed, or even reversed.

Estimated prevalence of frailty is 18% (95% CI: 15–21%), and it seems to be correlated with age, gender (female), and socio-economic factors such as lower education and wealth [4]. Good news is that frailty is reversible, but to achieve it, it is of paramount importance to fight inactivity and sedentariness [5]. Scientific literature supports activitycentered interventions to delay and even reverse frailty and disability [6–11]. Furthermore, interventions on nutrition, such as modifying habits, increasing protein and micronutrient intake, are also recommended [12,13], as well as interventions on inadequate drug prescriptions [14–16]. And finally, it is also important that the physiological and social aspects are not left apart [11].

A frail older person usually shows decreased neurological and muscle function [17], normally accompanied by an accelerated involuntary weight loss and a decline in the skeletal muscle [18]. Moreover, considering the results published in a relatively recent systematic review, in the 30.6% of the studies that were analyzed, associations between gait speed, disability, frailty, sedentary lifestyle, falls, muscular weakness, diseases, body fat, cognitive impairment, mortality, stress, lower life satisfaction, lower quality of life, and poor performance in quantitative parameters of gait were found [19].

Ageing in Place purses that older persons continue living at their homes as they age [20], which brings along important economic benefits given the reduction of the institutionalized care [21]; information and communication technologies (ICTs) can play a crucial role to promote it [22]. For instance, having fresh and periodic information on variables associated to poor health outcomes (e.g., gait speed, muscle power, and involuntary weight loss) can be a great asset to trigger early interventions to prevent disability and dependency. Smart home technologies [23–27], wearable sensors [28,29] or mHealth technology [30] may enable continuous, ubiquitous and transparent monitoring of the independent older adult, supporting the traditional geriatric approach to identify older people at risk of disability. Notwithstanding, more effort is still needed to assess not only how reliable and valid the ICT-based approached to measure frailty are, but also deeply study the associated ethical, technical, and economic issues [31].

Nevertheless, the lack of consensus in terms of technology acceptance by older persons must be considered. Several authors have reached the conclusion that older persons are not interested in innovative technologies [32,33], while others state that older people have already accepted new technologies, mainly because they have been proven useful in meeting their information needs, especially in health [34]. Yet it seems that the use of ICTs by the older population is strongly linked to physical limitations (e.g., abilities, chronic illnesses, etc.), mental limitations (e.g., fear of damaging the technology, electric shocks, making mistakes, etc.), educational limitations (e.g., low levels of literacy, limited electronic literacy, learning barriers, etc.), structural limitations (e.g., design of the appliance), instructional limitations (e.g., instructions on how to use a technology are hard to follow) and to a limited access to the technology (e.g., financial costs) [35,36].

The design of those technologies to be used by older persons must be done according to their characteristics. Methodologies such as user-centered design (UCD) [37] and participatory design (PD) are a good alternative to develop right solutions for a specific audience [33,38], since they help designers better understand the environment of use. Older people are usually excluded from product design activities since they are stigmatized as

people reluctant to engage with technology, and this is probably one of the primary causes that prevent older people becoming loyal users of technological solutions [38,39].

The current demographic challenge is forcing researchers to focus on discovering feasible alternative ways of providing healthcare to the older population who are at an increased risk of suffering adverse events [40]. And, as it has been mentioned earlier, ICTs may help identifying early risk indicators for adverse events, providing a means for self-managing them. To this extent, its use in the context of frailty still at the very beginning [41].

#### *1.2. Objective*

The main objective of this work is to evaluate the usability, user experience (UX), and acceptance of older persons' interaction layer of CAPACITY, a frailty home-monitoring system aimed at the prevention of disability.

The manuscript is structured as follows. First, the CAPACITY ecosystem is presented as a modular infrastructure to monitor frailty and prevent disability. Second, the specific methodology followed in this work is described to later present and discuss the obtained results. Finally, conclusions are extracted and future work proposed.

#### **2. Materials and Methods**

#### *2.1. Overview of the CAPACITY Technological Ecosystem*

CAPACITY is a technological ecosystem aimed at preventing disability among the older population by detecting and intervening regarding frailty; it also provides a substrate to connect all relevant people in the care process (see Figure 1). Using CAPACITY, the older population can be remotely supervised by community care professionals. So, in case worrying declines are detected, specialists (i.e., geriatricians) can be included in the loop. Intervention provided to older persons is grounded on three main pillars: the VIVIFRAIL physical activity program (declared as a success story by the European Commission) [42], personalized nutritional recommendations, and a program to detect risk of polypharmacy.

**Figure 1.** CAPACITY providers and interactions.

A Randomized Clinical Trial (RCT; ClinicalTrials.gov Identifier: NCT03707145) has allowed demonstrating that CAPACITY is an effective tool for a fast improvement of the frailty status as well as to reduce the use of healthcare resources [41].

#### *2.2. CAPACITY Interaction System*

CAPACITY services are offered through a set of user-adapted mobile applications. The functionalities offered to older persons are:


Apart from helping older persons, CAPACITY also offers different functionalities to other relevant stakeholders, namely primary and specialized care professionals, and informal caregivers, as shown in Figure 2. Work published [43] contains a wider description of all functionalities and services offered by the CAPACITY technological ecosystem to all involved people. In any case, this work solely focuses on the older persons as they are the center of the care.

**Figure 2.** Conceptual architecture of CAPACITY.

Older persons being followed by CAPACITY need to use a home monitoring kit aimed at measuring variables with high predictive value for adverse events. This monitoring system consists in a gait-speed sensor [44], a sensor to indirectly (through the chair stand test) measure power in the lower limbs [45], and a wireless commercial weight scale to measure involuntary weight loss. Figure 3 illustrates the prototypes of the sensors originally designed for CAPACITY.

The interaction with the home monitoring kit is handled by a mobile application that acts as a guiding element to the older person, as a data concentrator (bluetooth connection with the monitoring kit; see Figure 3 and Supplementary Material for details), and as data input point, not only enabling the older adult using the sensors but also completing a set of questionnaires to enrich the information handled by the clinical professionals. These questionnaires are adapted versions of the Frailty Phenotype criteria [46], Mini Nutritional Assessment (MNA) [47], Barthel Index [48], FRAIL Scale [49], and the Functional Activities Questionnaire (FAQ) [50].

**Figure 3.** Home-monitoring kit.

This interaction system was iteratively designed under a user-centered approach. Different prototypes were created and tested, first in a laboratory environment, later in a clinical environment, and, finally, at the final users' dwellings. In the two last cases, the system was evaluated by older users. The outcomes of each iteration allowed designers improve and adapt the interaction system to the needs, preferences, and context of use of the older adults.

Figure 4 shows how the interaction system evolved during the process. An in-depth description of this iterative process and the resulting interaction system can be found in [51].

(**a**)

**Figure 4.** *Cont*.

(**b**)

**Figure 4.** (**a**) First and (**b**) second—final—prototypes.

Figure 5 shows the workflow that needs to be followed to interact with any of the components of the monitoring kit. The process starts with the app notifying a pending measurement (prescribed by the physician as part of a personalized follow-up) and the user pushing the corresponding button to start it. Then, the older person is provided with a short explanatory video showing how the measurement will be performed. Once the user is ready, that is explicated by pressing a specific button, the actual measurement takes place; this part is fully guided by voice commands and accompanying pictures (e.g., 'please switch the sensor on', 'please sit on a chair', 'the process will start after the countdown', etc.). Transparently to the user, the app and the sensor establish a bluetooth communication channel used to register the datum. Once the process is over, the older person receives a confirmation with some feedback related to the measurement.

**Figure 5.** CAPACITY's workflow to collect data from the home monitoring kit.

#### *2.3. Assessment Tools*

The usability, UX, and user acceptance related to the CAPACITY technological ecosystem has been assessed.

Usability, defined as the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use [52], has been assessed using the system usability scale [53,54]. SUS is a short 10-item Likert questionnaire that provides a measure of people's subjective perceptions of the usability of a system; concretely, SUS focuses on learnability and usability, which are indeed correlated [55]. These 10 items can be evaluated from '1—fully disagree' to '5—fully-agree'. The total score ranges from 0 to 100. Although SUS is a simple tool, a study carried out by Tullis [56], who compared the effectiveness and accuracy of five questionnaires for assessing usability across different sample sizes, reached the conclusion that it is a reliable scale, especially when the sample is over 12 users.

UX has been assessed with the user experience questionnaire [57]. UEQ does not provide an overall score, but a score related to six different categories: attractiveness, perspicuity, efficiency, dependability, stimulation, and novelty. The score for each category is calculated by averaging the different items within it; each item's value ranges from −3 to 3, where extreme values represent two opposite concepts (e.g., attractive vs. unattractive). UEQ was included as an evaluation tool to complement the domains' SUS addresses.

Acceptance has been evaluated using the technology acceptance model [58] (adapted to the use case. See Supplementary Material for details) and a customized short quantitative questionnaire. TAM evaluates, throughout 12 items (answers range from '1—fully disagree' to '5—fully-agree'), two different categories: perceived usefulness and perceived ease of use. The maximum score is 60 (30 points in each category); final scores are calculated by averaging. To further investigate acceptance, a customized acceptance interview consisting in a Likert-type scale (from 1 to 5, the same as SUS) assessing all three components of the home monitoring kit was used. This interview had the following structure:


For all the scales and questionnaires, Spanish versions were used (local language).

#### *2.4. Recruitment and Data Collection*

The impact of administering a multicomponent intervention partially supported by the CAPACITY ecosystem was assessed by conducting a pilot, prospective, randomized, and blind study. The pilot study lasted 12 months: 6 months were dedicated to recruitment, and 6 months to intervention. Within this wider experiment, where the primary endpoint was to investigate whether the proposed technology helped preventing or reversing frailty, usability, UX, and user acceptance were evaluated as secondary endpoints (along with others). The pilot study was carried out simultaneously in two institutions: Getafe University Hospital and Albacete University Hospital.

Participation criteria were:

	- o +70 years old;
	- o Living at home;
	- o Barthel index [48] ≥90; and
	- o Being pre-frail or frail.
	- o Inadequate home infrastructure impeding the installation of the technology;
	- o Inability to understand how to use CAPACITY;
	- o Medical condition incompatible with the VIVIFRAIL physical activity program;

Pre-frail participants were those meeting two Frailty Phenotype criteria [46] and suffering from at least four comorbidities, since they are the ones with the highest risk for developing frailty. Frail individuals were those meeting at least three Frailty Phenotype criteria and having at least four comorbidities.

Two research groups (arms) were defined. A control group receiving usual geriatric care and an intervention group who received the same multicomponent intervention but partially supported by the CAPACITY system. Stratified randomization by age (70–85, +85), sex (male, female), diagnosis (pre-frail, frail) and educational level (non-formal education, higher education, others) was applied to ensure groups were balanced.

Sample size could not be empirically calculated due to a lack of similar studies aiming to the same primary endpoint, so it was established to 90. Reasons behind this decision were:


Data reported in this manuscript are restricted to those participants who were randomly allocated into the intervention group (*n* = 46), since they were the only ones that used the CAPACITY system during the six months of intervention. The modules that supported the intervention were: (1) monitoring system, (2) evolution of the older person (e.g., access to follow-up information collected by the home-monitoring kit), and (3) basic asynchronous communication. All technological components were preconfigured prior to the delivery to the participating older persons (i.e., a tablet was delivered with the app already installed and configured to receive data from the home monitoring kit), so they only had to follow notifications and instructions. Besides, older participants received an initial training during the installation of the technology in their homes. This face-to-face training was delivered once and lasted approximately one hour. During the session, a user manual was provided that was used as a reference to show all functionalities to the older person, who had to repeat what was learnt (e.g., how to measure gait speed or complete a questionnaire). After this session, a telephone line remained open during the weekdays at working hours to attend any consultation or issue coming from the older participants.

Data related to usability and acceptance were collected at baseline and after three and six months of intervention. SUS and TAM were registered in all three sampling points while UEQ and the ad hoc acceptance questionnaires were only administered in the last data collection point to enrich the collected data with UX information and prospective acceptance.

#### **3. Results**

A total of 46 older persons used the CAPACITY technological solution to undergo an intervention aimed at preventing/reversing frailty; 14 were male (30.43%) and 32 female; mean age was 82.11 (SD = 5.42) years old. Regarding educational level, 20 of participants using technology did not have formal education (43.48%), 20 had primary studies (43.48%), 5 received secondary education (10.87%), and 1 received higher education (2.17%). Finally, most of the participants (30 persons -65.22%-) did not have any previous experience with technology (i.e., smart phones and internet) while 9 of them (19.56%) used it in a daily basis; the remaining part made an occasional use of the technology (3 subjects -6.52%-) or had used it once or twice before this study (4 subjects -8.70%-). Table 1 shows the description of the population that participated in the study.


**Table 1.** Description of the older population that participated in the study.

Nine participants dropped out from the study during the follow-up period, raising a final sample for analysis of 37 subjects. All subjects completed the questionnaires about usability, UX, and acceptance evaluation at the second visit (three months) and at the end of the follow-up; 25 participants completed it at baseline.

Table 2 depicts the adherence to the monitoring plan calculated as the average commitment to the measurements that the users needed to perform as part of their treatment. Full adherence (100%) means that all participants performed all prescribed measurements. Table 2 also shows the default periodicity for the different measurements, but it must be taken into consideration that additional measurements could be requested. For all measurements a push notification was sent to the user through the app.



Table 3 shows the usability results. Usability obtained averaged SUS values of 80.11/100 (SD = 13.66) at baseline, 83.31/100 (SD = 15.07) at month 3 and 83.68/100 (SD = 1.62) at the end of the study.

Table 4 depicts the results regarding UX, that was only assessed at the end of the intervention. Averaged values of 2.20/3 (SD = 0.64) in terms of attractiveness, 2.30/3 (SD = 0.73) in perspicuity, 1.99/3 (SD = 0.75) in efficiency, 2.16/3 (SD = 0.66) in dependability, 2.05/3 (SD = 0.72) in stimulation, and, finally, 2.09/3 (SD = 0.98) in novelty were obtained.

Table 5 contains those results corresponding to assessing the acceptance of the CA-PACITY solution in terms of TAM, that show an improving trend (*p* = 0.15) starting with a value of 49.00/60 (SD = 8.24) at baseline, that gets to 50.68/60 (SD = 6.68) at month 3 and reaches 51.54 (SD = 6.97) at month 6. On the other hand, Table 6 presents the results related to the administration of the ad-hoc quantitative questionnaires, that evaluate individually each component of the home-monitoring kit.


**Table 3.** SUS results.

**Table 4.** Categorized UEQ results.


**Table 5.** TAM results.



**Table 5.** *Cont.*

**Table 6.** Acceptance results (ad-hoc questionnaires).


SUS and TAM data were analyzed according to the educational level (i.e., non-formal education, primary education, secondary education, or higher education), living conditions (i.e., alone, with younger relatives, or with other older person), daily help received (i.e., from nobody, from a younger relative, from other older person, or from social services), previous experience with technology (i.e., no experience, used it once or twice before, occasionally used, or daily use), and frailty diagnosis (i.e., pre-frail, or frail). Table 7 shows the evolution of the reported SUS and TAM according to the category labels.

**Table 7.** SUS and TAM evolution per category label.



**Table 7.** *Cont.*

Statistically significance related to the evolution= of the reported SUS and TAM within the categories described above has been analyzed. Only those older persons living with a younger relative showed a marginal but significant improvement in the reported SUS between baseline and month 3 (*p* = 0.049).

#### **4. Discussion**

This research study shows that the CAPACITY technological ecosystem has a very high-performance in term of usability, UX, and acceptance. Results have been obtained in a real-world scenario, where pre-frail and frail older persons used CAPACITY as the main vehicle to avoid transiting to disability.

Usage information demonstrates a high adoption rate, with an average adherence to the monitoring plan very close to or matching 100% for all components of the monitoring plan (i.e., use of sensors and completion of questionnaires). This endorses the validity of data collected in terms of usability, UX, and acceptance, since these are based on an intensive use of the system under assessment. However, the high usage of the system is not fully correlated with the expected use; for instance, during the experimentation the physicians detected that some users were not complying with the temporality of the monitoring plan (i.e., some measurements were missing, and they had to reach out to the older person to remind him or her). This has a twofold interpretation: on the one hand, sometimes notifications are neglected by the users, implying that new strategies should be found to promote prompt responses; but, on the other hand, the information that is constantly being provided to the clinical team allows a closer follow-up of the older persons, enabling early actuation of potentially worrying situations.

Based on the data collected by Sauro [59], the mean SUS score across a large number of usability studies is 68. If that value is used as a reference, the mean SUS obtained in all sampling points is highly above average. Furthermore, according to the Sauro–Lewis SUS grading curve [60], obtained score would be qualified as an A, with the last measurement really close of reaching A+, set at 84.1. So we can state that the evaluated user interaction is perceived as very good, almost excellent [61]. However, although usability seems to improve with use, obtained paired data Student's test does not demonstrate that this improvement is statistically significant; a plausible explanation for this non-significant result could be linked to a ceiling effect, probably associated to an insufficient sample size. A further analysis by category showed that those older users living with a younger person marginally but significantly improved reported SUS between baseline and month 3 (*p* = 0.049), but this isolated result does not entitle to draw any solid conclusion since no other significant differences were observed.

Although usability is very high, which implies the UCD process was highly successful, it is not the highest possible, so there is still room for improvement. Most of the averaged items scored very close to the edges of the scale, which is good for the evaluation of the system, but some others deviate a bit from the expected value and are those which are susceptible to be improved. SUS items Q4, Q7, and Q10 are the ones lowering the overall score (without significant changes along the follow-up, except for Q4, that seems to show an improvement trend). The description of these items is:


These relatively low evaluations in these items could be linked to the unfamiliarity or insecurity of the older adults who used the technology during the intervention (65.22% of the sample did not have any experience with technology). Actually, obtained results indicate that, after six months of use, there are statistically significant differences in the reported SUS depending on the previous experience with technology (*p* = 0.017), which implies that its relationship with reported SUS needs to be further investigated.

UEQ does not provide an overall score for the UX but an individual score for each category. Scores between −0.8 and 0.8 usually represent a neutral evaluation, while values over 0.8 represent a positive evaluation; values below −0.8 represent a negative evaluation [62]. Obtained results are exceptional since all categories received averaged scores higher to or really close to 2, and given the fact that extreme UEQ values are very uncommon [63]). Furthermore, lower bounds of all confidence intervals per category are significantly above the minimum threshold established to be considered as positive evaluations (*p* = 0.05).

UX results in terms of UEQ have been benchmarked using a dataset with data from 9905 responses corresponding to 246 studies [64]; however, and given the fact that product categories have been not considered, this benchmarking can only be used as a first indicator to assess the UX of the system under study. Figure 6 represents the result of the benchmarking; in all six categories CAPACITY ranks over average (i.e., top 10%).

It is important to analyze whether the UEQ respondents have provided random answers to endorse validity of the obtained results. In this case, given the specific characteristics of the population who participated in the study (i.e., older persons with poor education background and digital literacy), some inconsistencies (i.e., all items in a scale should measure a similar UX quality aspect; if there is a big difference—>3—this is an indicator for a problematic data pattern) in the provided answers have been found for several respondents. These inconsistent answers can be due to misunderstanding of one or several items. One respondent was inconsistent in three categories, six in two categories, and eight in one category. According to Schrepp [62], answers to UEQ with two or more inconsistencies should be considered suspicious. No significant changes are observed when the doubtful information is removed from the analysis: all six categories stay with

averaged values above two; in the same way, all categories remain qualified as excellent in the benchmarking.

Acceptance in terms of TAM reached a maximum score of 51.54 in the last sampling point, also showing an increasing trend amongst data collection points (*p* = 0.15 from baseline to month 6); given the small sample size of this research work, and the fact that statistical significance is a function of both the sample size and the magnitude of the estimated effect, *p*-values lower than 0.2 could be considered statistically significant [65,66]. Furthermore, Student's *t*-test was significant (*p* = 0.02) for the positive evolution of the perceived ease of use between baseline and month 6. All individual items obtained averaged values above 4 in the last sampling point. The item that took the longest in reaching a value of 4 was Q2, under the category of 'perceived usefulness'; this item relates to whether users perceived that CAPACITY contributed to his or her daily life independency. A possible explanation could be related to the fact that results associated to a physical intervention are not perceived immediately. In any case, this seems to be more related to the clinical aspects of the project rather than to the technological ones. On the other hand, the evaluation of each individual device (i.e., each component of the home monitoring kit) indicates very high acceptance: all questions pursuing a value of 5 got an average value over 4, while those aiming at 1 got values below 1.5. Acceptance results not only suggest that the older population would accept using and having CAPACITY devices at home as a way of being constantly monitored in terms of function, but also that all components of the monitoring kit are perceived as empowerment tools to motivate having a healthier lifestyle and to control his or her own health.

Not many RCTs exist in the field of ICTs applied to frailty management, which prevents the availability of a wider number of works focused on assessing the usability of technologies for treating frail population [41]. Works analyzing usability-related aspects of technology during real interventions also report satisfactory results [67,68], however, since the design procedure followed is scarcely described, the sample size is significantly smaller than the one presented in this paper, and data related to adherence are not optimal (i.e., far from 100% adherence as reported in this paper), those results should be interpreted with caution. On the other hand, the majority of the research addressing how older persons interact with technology is done in controlled environments and under the supervision of domain experts [69–73]. Most of the published related research use standardized tools such as those used in this work, which is aligned with the methods followed in the current approach; moreover, despite the heterogeneous approaches in terms of the target application, ranging from rehabilitation [67–69] to exergames [70], monitoring cognitive impairment [71], fall risk [72], or evaluating available health apps [73], the used interaction instruments, including mobile devices [67,69,72,73], personal computers [68,71], or custom prototypes [70], and the diverse characteristics of the target population, that in some cases have a previous experience with the technology to be used [68] and in some others cannot use it without help [71], the UCD approach is a commonality backing almost all approaches from a methodological perspective.

#### **5. Conclusions**

The objective of this research work was to investigate the usability, UX, and acceptance of CAPACITY, a technological ecosystem to prevent disability. This objective has been achieved, obtaining very satisfactory results in all domains under study. The usability of CAPACITY (in terms of SUS) was rated as almost excellent, and UX (in terms of UEQ) as excellent; finally, the proposed technology, both from a software and a hardware perspective, seems to be highly accepted by the target population (in terms of TAM and ah-hoc questionnaires). Besides, adherence to using CAPACITY has been found optimal, which implies both that these superb results are correlated with maximizing the actual use of the proposed solution in a real environment and that data supporting the conclusions are based on reliable and solid information.

The main contribution of this paper is thus the demonstration that following an iterative UCD approach starting in a controlled laboratory environment to come up with a pre-validated interaction system, and later upscale it to a real uncontrolled environment is a valid strategy to maximize usability, UX, acceptance and actual adoption of a system. Furthermore, this research work contributes with a new experience to the scant number of RCTs studying how pre-frail and frail older persons interact with technology.

Findings support UCD as a key methodology. Involving potential end users and giving them voice during the design stage maximizes usability, UX, acceptance and usage. In this research, older persons were involved from the very beginning: first, older people's opinions were captured in a laboratory environment to later move towards clinical and home environments. Insights collected during this process enabled obtaining these excellent data within a RCT. Results indicate a potential high adoption in a wider deployment scenario (i.e., production phase). Some limitations must be taken into consideration when interpreting the results presented in this manuscript. First, the relatively small sample implies that findings need to be construed with prudence. Second, the external validity of the findings is not clear (i.e., whether the tested interaction system would obtain equivalent result in a population with different characteristics, such as culture, education, experience with technology, etc.); moreover, the assessment tools used to measure usability, UX, and acceptance, although they have been conceived to provide objective measurements, are highly dependent on the subjectivity of the respondents, so what is really measured is a perception on the different explored domains, given that humans are prone to bias while rating their experiences after interacting with a system. Third, no data related to the use of the technical assistance telephone line available for participants were collected, which has prevented integrating that information in the interpretation of the results. And, finally, no information on patient–physician communication through the platform was registered, limiting the extent of the presented usage analysis.

The CAPACITY technological ecosystem is constantly being improved, and new services added. From a service perspective, the current version of the solution incorporates functionalities to support a novel organizational model that interconnects all relevant people in the care process: the older person, the informal caregiver, and the primary and specialized care professionals. This evolved version of CAPACITY also integrates mechanisms (algorithms) to automatically detect functional decline and alert professionals and means to provide a multicomponent. Future work includes carrying out a new multicentric field experimentation (RCTs in Spain, Sweden, and Poland) with a higher sample size (ClinicalTrials.gov Identifier: NCT04592146) thus further assessment of the usability, UX and acceptance will be done, including extended work aimed at identifying ways of improving specific usability issues related to the individual answers to SUS, further exploring the relationship between usability and external factors (e.g., previous experience with technology, living conditions, etc.), and finding efficient ways to promote prompt responses to notifications. In addition, the home monitoring kit is in the process of being shifted towards a ubiquitous and transparent paradigm, that will probably maximize acceptance. These new devices will be based IoT technologies, easing their configuration, replacement, and scalability.

#### **Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/s21196458/s1.

**Author Contributions:** L.R.-M. coordinated the research line. R.P.-R., E.V.-M., and X.F. conceptualized and materialized the technological approach. M.V.-A., R.P.-R., and L.R.-M. designed the experimental setup. R.P.-R. led and supervised the correct execution of the experiment, with support from M.V.-A., E.V.-M., C.M., and M.M.-R., and P.A.-S., R.P.-R., and E.V.-M. analyzed the results. All authors participated in writing the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research work was supported by the FACET (Integrated supportive services/products to promote FrAilty Care and wEll function. PGA 16003); and POSITIVE (maintaining and imPrOving the intrinSIc capaciTy involving primary care and caregIVErs. PGA 109091). Both projects funded by the EIT-Health.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of Getafe University Hospital (protocol code 17/85 approved 14 December 2018).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** No new data were created or analyzed in this study. Data sharing is not applicable to this article.

**Acknowledgments:** The authors would like to thank all other partners participating in the FACET (ABBOTT, ATOS, Genesis Biomed, GMV Quirónprevención and University of Aberystwyth) and in the POSITIVE (ATOS, Karolinska Institutet, KTH Royal Institute of Technology andMedical University of Lodz and IDNEO) projects.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


### *Article* **A CSI-Based Human Activity Recognition Using Deep Learning**

**Parisa Fard Moshiri 1, Reza Shahbazian 2, Mohammad Nabati <sup>1</sup> and Seyed Ali Ghorashi 3,\***


**Abstract:** The Internet of Things (IoT) has become quite popular due to advancements in Information and Communications technologies and has revolutionized the entire research area in Human Activity Recognition (HAR). For the HAR task, vision-based and sensor-based methods can present better data but at the cost of users' inconvenience and social constraints such as privacy issues. Due to the ubiquity of WiFi devices, the use of WiFi in intelligent daily activity monitoring for elderly persons has gained popularity in modern healthcare applications. Channel State Information (CSI) as one of the characteristics of WiFi signals, can be utilized to recognize different human activities. We have employed a Raspberry Pi 4 to collect CSI data for seven different human daily activities, and converted CSI data to images and then used these images as inputs of a 2D Convolutional Neural Network (CNN) classifier. Our experiments have shown that the proposed CSI-based HAR outperforms other competitor methods including 1D-CNN, Long Short-Term Memory (LSTM), and Bi-directional LSTM, and achieves an accuracy of around 95% for seven activities.

**Keywords:** activity recognition; Internet of Things; smart house; deep learning; channel state information

#### **1. Introduction**

The Internet of Things (IoT) is a dynamic global information network consisting of internet-connected devices [1]. Due to the recent advancements in communication systems and wireless technology over the last decade, IoT has become a vibrant research field [2]. The concept is straightforward; things or objects are connected to the internet and exchange data or information with each other over the network. Applications of IoT improve the quality of life [3]. As one of the main IoT applications, smart houses allow homeowners to monitor everything, including the health, especially for those with disabilities and elderly people, by exerting Human Activity Recognition (HAR) techniques [4]. Additionally, the joint task of HAR and indoor localization can be exerted in smart house automation [4]. A user's location can change how the IoT devices respond to identical gesture commands. For instance, users can use the "hand down" signal to reduce the temperature of the air conditioner, but they can also use the same gesture to lower the television in front of them [4]. HAR has emerged as one of the most prominent and influential research topics in several fields, including context awareness [5], fall detection [6], elderly monitoring [7], and age and gender estimation [8].

HAR techniques can be categorized into three groups: vision-based, sensor-based, and WiFi-based [7]. Existing sensor-based and vision-based methods for HAR tasks have achieved acceptable results. However, these methods still have limitations in terms of environmental requirements. Strictly speaking, camera-based recognition algorithms are

**Citation:** Fard Moshiri, P.; Shahbazian, R.; Nabati, M.; Ghorashi, S.A. A CSI-Based Human Activity Recognition Using Deep Learning. *Sensors* **2021**, *21*, 7225. https:// doi.org/10.3390/s21217225

Academic Editor: Ivan Miguel Serrano Pires

Received: 6 September 2021 Accepted: 22 October 2021 Published: 30 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

susceptible to environmental factors such as background, lighting, occlusion, and social constraints such as privacy issues. Additionally, in sensor-based methods, people often object to these sensor modalities because they are bothersome or cumbersome. Although the underlying technology employed in these sensors is frequently inexpensive, IoT-connected versions of these sensors can be significantly more expensive due to added wireless hardware and branding. WiFi devices, which are less expensive and powerefficient than the aforementioned technologies, invariant to light, easier to implement, and have fewer privacy concerns than cameras, have recently attracted much interest in various applications [4].

The purpose of WiFi-based activity recognition is to distinguish the executed actions by analyzing the specific effects of each activity on the surrounding WiFi signals. In other words, the individual's movement affects the propagated signal from WiFi access points and can be used to recognize activities. WiFi signals can be described by two characteristics: Received Signal Strength (RSS) and Channel State Information (CSI) [4]. RSS is the estimated measure of received signals' power which has been mainly used in indoor positioning [9]. As RSS is not stable compared with CSI, it cannot properly capture dynamic changes in the signal while the activity is performed [10]. As a more informative specification of WiFi signals for HAR tasks, CSI has drawn more attention than RSS over recent years [10]. CSI can save physical layer information from each sub-carrier of the channel. When a person performs a particular activity between the transmitter and receiver, the reflected wireless signals from the body generate a unique pattern [11]. Furthermore, human body shapes, speed of performing an activity, environmental obstacles, and the path of performing an activity can cause different changes to received CSI signals. For instance, if a person walks in a straight line this activity has a different effect on CSI signal, comparing to the experiment that a person walks around a square path. Many WiFi devices use CSI to assess the quality of their connection internally. The device collects the experimental phase and strength of the signal at each antenna for each channel in the provided spectrum, allowing signal disruptions to be identified. The WiFi-based method takes advantage of the ubiquitous nature of radio frequency transmissions while also potentially allowing for developing a system that takes advantage of the existing WiFi infrastructure in smart houses [4].

Although business applications of HAR are in the beginning stages, many studies in this field introduce the issues that must be addressed before any practical action. One of the main issues is the specific hardware/software combination that is required to extract CSI data. After choosing the proper hardware, the collected CSI data can be further used as inputs of the Deep Learning (DL) algorithms for HAR task. The effects of each activity in characteristics of the collected CSI can be used in different DL algorithms to distinguish activities and finally classify them [11].

Since CSI is a time-series data with temporal dependency, Recurrent Neural Network (RNN) and its subsets have been exerted more than other DL algorithms in the HAR task. Long Short-Term Memory (LSTM) and RNN apply sequential processing to long-term information, meaning that these data pass through all cells in the network before reaching the present cell. RNNs structure cannot perform efficiently when we need to analyze long sequences, resulting in vanishing gradients. The vanishing gradient problem persists even when the switch gates and long memory in the LSTM network are maintained [11]. Furthermore, this module requires a significant amount of memory bandwidth due to the complexity of the sequential path and Multi-Layer Perceptron (MLP) layers in each cell. Despite the LSTMs proficiency for prediction and classification tasks in time series, they are incapable of learning terms with greater than 100 terms [12]. Additionally, LSTMs analyze the sequential data in one direction, meaning that only past CSI data will be considered [11]. Accordingly, they cannot distinguish between two similar activities, such as lie down and sit down, which have the same start position but different final positions.

In real-time activity monitoring, especially for elderly people, each activity's period and further information are essential. Therefore, we consider two approaches: 2D-CNN and Attention-based Bi-directional LSTM. Unlike RNNs and LSTMs, where long-term data is analyzed sequentially, convolutions analyze the data in parallel. Furthermore, the training time of LSTMs is slightly longer than the CNNs, and as a result, they require a greater memory bandwidth for processing. Less consumed time in training and lower computational complexity, along with mentioned problems, encouraged us to use 2D-CNN. Since 2D-CNN has high potential in image processing, we convert CSI data into RGB images. In order to generate RGB images, we made a pseudocolor plot from CSI matrices. Each element of the matrices is linearly mapped to the RGB colormap. Furthermore, we applied BLSTM on raw CSI data. HAR's performance can be improved by using attentionbased BLSTM, which concentrates on regions of greater relevance and assigns them higher weights to improve performance. The main contributions of this research are as follows:


The rest of this paper is organized as follows: Section 2 reviews HAR studies. In Section 3, we provide a brief explanation about CSI, required information on hardware, software and firmware. Furthermore, four used neural network's structures are discussed in this section. Additionally, we briefly discuss other datasets and their public accessibility. The main contributions of this research are summarized in Section 4. We discuss the device configuration to collect CSI, image generation from CSI, and feeding the data to the neural networks. In Section 5, measurement setups and experimental results are reported, and finally, conclusions are discussed in Section 6.

#### **2. Related Works**

HAR techniques can be divided into three groups: vision-based, sensor-based, and WiFi-based. Several image-based methods for HAR have been published in recent years, using datasets such as RGB (red, green, and blue [18]), depth [19], and skeleton images [20]. The RGB dataset may not be qualified and robust enough in this method when the video contains considerable sudden camera movements and cluttered background. To this end, Anitha et al. [21] propose a shot boundary detection method. In their proposed method, the features and edges of videos are extracted. The features were then extracted as images and subsequently merged with the video feature and fed into the classifier. The Kernel Principal Component Analysis (KCPA) technique is applied to locate image features and joint features. The preparation process is thus gradually proficient, making the independent vector analysis increasingly realistic for real-life applications. The human activity videos are

classified by K-Nearest Neighbor, obtaining better results than other cutting edge activity methods. While capturing an image or video of an activity in RGB format, it generates many pixel values, making it more difficult to distinguish the subject from the surrounding background and resulting in computational complexity. The aforementioned obstacles and the view dependency, background, and light sensitivity impair RGB video-based HAR performance and persuade researchers to exert depth images and other formats of images or videos to improve HAR performance. Most of the methods introduced for HAR utilizing skeleton datasets are confined in various ways, including feature representation, complexity, and performance [22]. In [22], authors propose a 3D skeleton joint mapping technique that maps the skeleton joints into a spatio-temporal image by joining a line across the same joints in two adjacent frames, which is then used to recognize the person's activities. The 3D skeleton joint coordinates were mapped along the XY, YZ, and ZX planes to address the view dependency problem. They exploit transfer learning models including MobileNetV2, DenseNet121, and ResNet18 to extract features from images [22].

In sensor-based methods, wearable sensors capture activities, causing inconvenience and long-time monitoring unavailability [23]. In the past decade, smartphones have become more powerful with many built-in sensors, including the accelerator, gyroscope. The main impediments in using smartphones for HAR tasks are their higher noise ratio than wearable sensors and fast battery drain [23]. Several researchers have exerted Radio Frequency Identification (RFID) tags to recognize human activities [24]. Authors in [24] present a framework for HAR and activity prediction by using RFID tags. They utilize RFID tags to detect a high-level activity and object usage. Additionally, they employ weighted usage data and gain activity logs. Since human activities are time series data and the next activity is related to the current activity and previous ones, they use LSTM to predict activities with an accuracy of 78.3%. Although RFID tags are cheaper, RFID-based systems cannot achieve high accuracy in crowded environments. Additionally, as mentioned above, vision-based HAR needs cameras installation in the environment, which highly depends on the light source's consistency and is unable to pass through physical obstacles such as walls. Since indoor spaces such as smart houses, malls, and nursing homes are filled with wireless signals, WiFi-based systems have been exploited more than other approaches in recent years [25].

Due to the growing interest in sensor-less activity detection, the research and industry communities have joined on CSI analytics with the help of neural networks. Common CSI-based applications include a wide range of activity detection scenarios such as WiTraffic [26] to delicate activity recognition systems like Wifinger [27], breathtrack [28]. In [29], authors utilize CSI to sense distinct hand movements. They use predefined windows to monitor activity continuously. This method is time-consuming and yields lower accuracy. To overcome this problem, Wi-Chase [30] does not apply predetermined time windows. Due to detailed correlated information in different subcarriers, Wi-Chase also employs all available subcarriers, unlike Wi-Sleep [31] that uses only a subset of them. The extracted features were trained using machine learning algorithms, including KNN and Support Vector Machine (SVM) [30]. Although different WiFi-based HAR systems have been proposed, one of the major challenges has not been addressed properly. That is, WiFi signal changes are due to the various movement speeds and body types of people. Human activity is made up of many limb movements, such as lifting an arm or leg. The speed and scale of activity can naturally alter according to the scenario or period. Furthermore, physical traits such as body form and height are unique for each person. Therefore, human activity patterns can vary greatly amongst people. To address this problem, a WiFi-based HAR proposed in [15] incorporates synthesized activity data that reduces the influence of activity inconsistency such as varied motion speed. They collect CSI for 10 different activities including make phone calls, jumps, check wristwatch, lie down, walk, play guitar, fast walk, play piano, run, play basketball with Atheros AR9590 WiFi chipset. The combination of CSI spectrogram of overall subcarriers is fed into the network as image inputs. Then, four Dense layers are used to extract spatial features of activities. These

features are entered to a convolutional layer. Then, a BLSTM is used to extract tempo features and a linear layer is applied to predict the activities. Three data synthesis methods are combined with eight types of transformation methods, including dropout, Gaussian noise, time-stretching, spectrum shifting, spectrum scaling, frequency filtering, sample mixture, and principal component coefficient. Dense LSTM with consistent accuracy of 90% is applied to efficiently optimize the system for the small-size dataset while keeping the model compact to minimize overfitting.

For multi-class classification based on extracted features such as HAR, a variety of machine learning algorithms such as RF, SVM, and HMM and also DL algorithms such as CNN, RNN, and LSTM can be applied. In [14], they apply RF, HMM, LSTM on their public dataset which have been collected with NIC 5300 with three antennas for six different activities including sit down, stand up, fall, walk, run, and bed. A 90-dimensional vector of CSI amplitude (3 antennas and 30 subcarriers) has been used as the input feature vector. They apply the PCA on the CSI amplitude for denoising, and Short-Time Fourier Transform (STFT) for feature extraction. First, they use RF with 100 trees for classification, which has unacceptable accuracy for bed, sit down and stand up activities. They also apply HMM on the extracted features obtained by STFT and DWT techniques. The accuracy is improved compared to RF, but with higher training time. Although HMM has obtained good results for walk and run activities, it cannot distinguish between stand up, sit down, and bed activities. They also apply LSTM on activities [14]. The LSTM extracts the features automatically and directly from raw CSI without any pre-processing. In other words in contrast to other methods, the LSTM approach does not need PCA and STFT, but it has more training time [14]. The accuracy of LSTM is reported over 75% for all activities in [15].

Since the static objects in an environment can also affect wireless signals and, respectively, HAR model, authors in [17] propose a deep neural network as baseline classifier based on the features for four simple activities, including standing up, sitting down, pushing and picking, performed in two different complex environments. More precisely, they propose a network with shared weight to make a similarity network for two different complex environments. They used one transmit antenna and two receive antennas and make four grayscale images from CSI amplitude and phase. In feature extraction stage, Gabor filter is applied on grayscale images to extract features. Gabor filter extracts spatial information of an image by convoluting the transformed image with a filter at specific wavelength *λ* and orientation θ [17]. For each gray-scale image, the final output is 5 (the number of *λ*) × 8 (number of θ) × 2 (mean and standard deviation) = 80, and a vector of dimensions 320 = 4 (number of grayscale images) × 80 are fed into the neural network as the input. They exert three FC hidden layers as the baseline network and two identical branches that share the same weight values as the similarity network. A pair of two random data are selected and fed into the two identical networks simultaneously and each one of them enters the fully connected network. If the two data belonged to the same category of activity, they are labeled as "similar", otherwise "non-similar". Their model obtains an accuracy of around 84% overall for the two different environment scenarios.

One of the main issues in Wifi-based HAR is the specific hardware/software combination for CSI data collection. In other words, Linux 802.11n CSI Tool is limited to older linux kernels versions and the required hardware cannot be found easily in market. Following the release of Nexmon CSI Tool [13], it is now possible to extract CSI from a BCM43455C0 wireless chipset, which is used in the Raspberry Pi 3B+ and 4B. As this is a recent release, Ref. [16] examines the performance of the Raspberry Pi 4 in CSI-based HAR. They collect CSI signals for different activities performed in normal life as listed: stand up, sit down, go-to-bed, cook, washing-dishes, brush-teeth, drink, pet a cat, sleeping, walk. They do not apply any denoising filter, as their results are acceptable comparing to other available datasets and also additional filtering may affect important information in data. They pack CSI vectors, collected by Raspberry Pi 4, into windows to train their classification model. As LSTMs and their extensions have been well-suited in HAR task, they use a deep convolutional variant of the LSTM model. They apply two 1D-convolutional layers along

with four BLSTM, which have more training time and computational complexity. Their model achieves 92% accuracy which demonstrated the Raspberry Pi 4 capabilities for HAR in smart houses and it can be superseded the Linux 802.11 CSI Tool.

#### **3. System Model**

#### *3.1. Preliminary*

Transmitting a signal from the transmitter to the receiver, it is deflected, reflected, and scattered when it comes into contact with obstacles and objects. This results in multipath overlaid signals at the receiver when the signal encounters obstacles and objects [7]. Finegrained CSI can be used to characterize this procedure. The Orthogonal Frequency-Division Multiplexing (OFDM) modulation is utilized in IEEE 802.11, and it distributes the available bandwidth across several orthogonal subcarriers [14]. Due to the limited bandwidth available, the fading that each subcarrier experiences are represented as flat fading [31]. Therefore, the small-scale fading aspect of the channel can be minimized by employing OFDM techniques. Narrow-band fading per subcarrier causes a considerable variation in the measured channel dynamics. The greatest advantage of employing CSI is that it can catch changes occurring at a single frequency and avoid averaging out changes across all WiFi bandwidth, unlike RSS.

Several subcarriers can be present in the physical link between each pair of transmitter and receiver antennas. As each subcarrier might serve many data streams, the CSI obtained from each subcarrier will be unique [14]. CSI can be represented as a channel matrix for *t* transmit and *r* receiving antennas, a given packet transmission *n*:

$$CSI\_{\mathbb{H}} = \begin{pmatrix} H\_{1,1} & \cdots & H\_{1,r} \\ \vdots & \ddots & \vdots \\ H\_{t,1} & \cdots & H\_{t,r} \end{pmatrix} \tag{1}$$

*Ht,r* represents a vector that includes complex pairs for each subcarrier. Depending on the hardware we use and channel bandwidth, the number of available subcarriers is different [16]. Raspberry Pi 4 and Tp-link archer c20 paired over 5 GHz in 20 MHz bandwidth can access 56 data subcarriers. *Ht,r* can be expressed as below for *m* data subcarrier in which *hm* is a complex number, containing both amplitude and phase of the CSI:

$$H\_{\mathbf{t},r} = \begin{bmatrix} h\_{\mathbf{t},r,1}, \dots, h\_{\mathbf{t},r,m} \end{bmatrix} \tag{2}$$

#### *3.2. Hardware and Firmware*

To the best of our knowledge, the specialized hardware/software combinations that is required to extract CSI data, are intel 5300 WiFi Network Interface Card (NIC) (Linux 802.11n CSI Tool) [32], Atheros AR9580, AR9590, AR9344, and QCA9558 (Atheros CSI tool) [33], Raspberry Pi (Nexmon CSI Tool) [13]. The intel 5300 NIC has been used for CSI collection since 2011 [32]. Although many researchers used 5300 NIC, such as [14], this hardware configuration has become less important over time since most laptops with this wireless card are not currently available in the market and third-party tools are required to collect CSI. More precisely, some type of Mini PCIe to PCI-Express Adapter with three antennas is required. Atheros CSI tool, as another 802.11n open-source experimental tool for CSI collection, allows extractions of physical layer wireless communication information, including CSI, RSS, the received payload packet, the timestamp, the data rate, etc. [33]. The ath9k open-source kernel driver supports Atheros 802.11n PCI or PCI-E chips; thus, this tool supports any sort of Atheros 802.11n WiFi chipsets. This tool was released in 2015 and there is more hardware with built-in Atheros 802.11n PCI or PCI-E chips rather than 5300 intel NIC, but more expensive.

The release of Nexmon CSI Tool [13] has enabled CSI extraction from Raspberry Pi 3B+ and 4B, Google Nexus 5, and some routers. One of the Nexmon tool benefits is that it permits several transmit-receive antenna configurations (up to 4 × 4 MIMO). Additionally, it includes customizable CSI collection filters that can extract relevant CSI from selected transmitters and the complete CSI data does not need to be suppressed. Although the Raspberry Pi utilizes a single transmit/receive antenna pairing, its price and prospective capabilities make it a suitable tool in WiFi-based healthcare monitoring in smart houses. Nexmon [13] provided a configuration option to assign a different interface to only the monitored frames after being configured on the host for monitoring on Raspberry Pi. This tool can use up to 80 MHz bandwidth and 242 subcarriers. There are three types of subcarriers in OFDM technology, including null subcarriers, pilot subcarriers, and data subcarriers. Null subcarriers (also called zero) are the unused subcarriers mainly employed as a guard against interference from adjacent channels. The pilot subcarriers do not convey modulated data; nevertheless, they are utilized for channel measurements and synchronization between the transmitter and receiver. Furthermore, pilot subcarriers broadcast using a predetermined data sequence and demonstrate an overhead for the channel. The remaining subcarriers from total subcarriers are called data subcarriers. These subcarriers will exploit the same modulation format as 802.11ac [34]. As mentioned in Table 1, we may have different numbers of subcarriers depending on the PHY standard and bandwidth.



#### *3.3. Neural Network*

Once an activity is performed between transmitter and receiver, it will affect CSI characteristics. When a person performs a particular activity, the received CSI signals generates a unique pattern [7]. Recently, DL algorithms have been widely used to automatically learn features from the effects of activities on CSI. While having many layers in these algorithms offers improved classification skills, overfitting and performance deterioration become significant when implementing the neural network on a limited amount of dataset. Using traditional strategies such as weight decay, small batch size, and learning rate might not be enough to help avoid this problem. Accordingly, all of the pre-existed WiFi-based systems, such as those in Section 2, would require the implementation of dedicated numbers of particular neural layers to provide the desired performance. In this research, we present custom deep learning models that is best suited for situations with a small size dataset and has less computational complexity and consumed time compared to other methods.

#### 3.3.1. CNN

CNN is a feed-forward neural network that excavates features from data with convolution operations. It contains several layers, including Convolution, Pooling, Dense and Flatten. This classification network requires less pre-processing rather than other classification techniques. Additionally, CNN can learn required filters or characteristics without the assistance of the user. CNNs use filters (also known as the kernel, feature detector) to extract features which are performed using the convolution function [35]. The initial Convolution Layer (ConvLayer) is designed to handle lower-level features, such as edges and color. When we employ several ConvLayers in the network topology, the network can achieve high recognition accuracy since it can also capture high-level features.

After each two 2D-ConLayer, we use the LeakyReLU activation function, an upgraded variant of ReLU (Rectified Linear Unit). According to the gradient in the negative direction, every value of inputs less than zero causes the gradient to be zero. Therefore, the neurons located in that region are deactivated and may suffer from the dying ReLU problem. In order to address this problem, instead of claiming that negative inputs values should be considered zero, a small linear component of S is defined. LeakyReLU can be formulated as f(S) = max (0.01 × S,S), meaning that if the input is positive, the function returns S and if the input is negative, it returns 0.01 × S. This minor alteration causes a non-zero gradient for negative values; thus, we would not find any dead neurons in that location. Since the feature map output of ConvLayer specifies the specific position of features in the input, a slight movement in the location of the feature in the input data will create a significant difference in the feature map. To address this problem, we use the downsampling strategy. A better and more widespread strategy is to utilize a pooling layer. After feature detection in ConvLayer, the Max pooling layer is applied to down-sampled feature maps and helps in extracting low-level features. After the first ConvLayer with Leaky ReLU activation function and max pooling, Batch Normalization (B.N.) is applied to stabilize the network during training and speed up training. B.N. makes the variable mean and standard deviation estimations more stable across mini-batches and, respectively, closer to 0 and 1. Dropout layers are applied between convolutional layers, decreasing overfitting while improving the network's generalization capability. The pooled features (the max pooling's output) should be flattened. Flattening involves concatenating the feature map matrix to create a single-column matrix. This matrix is passed through a dense layer where we get our predicted classes. The proposed 2D-CNN structure is depicted in Figure 1.

**Figure 1.** 2D-CNN structure used in this paper.

In addition to 2D-CNN exerted on converted RGB images, we also apply 1D-CNN to CSI data as depicted in Figure 2, which will convolve with moving along one dimension. Whether the input is 1D, 2D, or 3D, CNNs all have the same properties and use the same process. The crucial distinction is the dimensionality of the input data and the method in which the filter slides across it. The 1D-CNN has been trained to identify different activities

based on sequential observations and map the internal features to different activities. It is particularly good at learning time-series data such as CSI, as it can leverage raw time series data and requires no domain expertise to hand-engineer input features. We use two ConvLayers with ReLU as an activation function. Same as 2D-CNN, after each ConvLayer, we apply max pooling layer, B.N., and dropout.

**Figure 2.** 1D-CNN structure used in this paper.

#### 3.3.2. LSTM

RNN has been successfully applied to sequential modeling applications, such as language understanding [36] and HAR [37]. Nevertheless, when the learning sequence is long, the standard RNN frequently encounters the problem of the gradient vanishing and exploding. In order to address this issue, Hochreiter and Schmidhuber [38] designed a new RNN structure named the LSTM [38]. The LSTM network seeks to overcome gradient vanishing and exploding by utilizing memory cells with a few gates to retain essential information with long-term dependencies. The memory block comprises three gate sets. Each decides the block's state and produces an output, including forget gate, input gate, and output gate. The information to be eliminated from the unit is determined by the forget gate. The input gate handles which input values cause the memory state to be updated. The output gate determines the output of the block according to the input and the unit memory.

Since CSI signals are time-series and the LSTM can learn complicated and temporal dynamics, this network has obtained a remarkable performance for CSI-based HAR. In the HAR task, LSTM has two advantages. First, it can extract the features automatically without pre-processing. On top of that, it can hold temporal state information of the activity, resulting in better performance for similar activities such as lie down and sit down comparing to 1D-CNN, RF, and HMM. In this paper, we apply a simple LSTM with one hidden layer and 128 hidden units in which the feature vector is a 52-dimensional vector of CSI amplitudes. The proposed LSTM structure is depicted in Figure 3.

**Figure 3.** LSTM structure used in this paper.

The traditional LSTM network only analyze the CSI data in one direction, meaning that the present hidden state only considers the past CSI information. Furthermore, future CSI information is also important for HAR. In this paper, an attention-based BLSTM is utilized to analyze past and future information and overcome long-term dependency. It contains a forward and backward layer for extracting information from the two directions. In other words, it's a two-layer LSTM sequence processing paradigm: one in which the input moves forward and the other in which the input moves backward. As the name suggests, attention is a technique that can allow input sequences of arbitrary length to pay attention to specified timesteps [11]. The concept is based on the studies about human vision systems, which indicate that humans consistently focus on a certain region of an image while identifying it and then altering their focus over time. It has been found to be effective in image recognition to have the machine focus on the region of interest while concealing the rest of the image at the same time for a recognition task. Due to the sequential features learned by the BLSTM network for WiFi-based HAR known to have high dimensions and feature contributions and time steps may vary from case to case, we seek to exploit the attention model to automatically learn features' significance and adjust feature weights based on activity recognition performance. In this paper, as depicted in Figure 4, a BLSTM with one attention layer with 400 units is used to learn the relative importance of features and timesteps and more important characteristics are given higher weights to obtain better performance.

The comparison between these four networks and five other networks in HAR researches, i.e., RF [14], HMM [14], DenseLSTM [15], ConvLSTM [16] and FC network [17], are discussed in Section 5. Note that, the proposed networks for our public dataset significantly outperforms other techniques in terms of accuracy, computational and structural complexity, and consumed time.

**Figure 4.** BLSTM structure used in this paper.

#### *3.4. Human Activity Recognition Datasets*

The amount of data we need for the HAR task depends on the complexity of the task and the chosen algorithm, hence there is no specific rule about the number of samples, needed to train a neural network and it is just a process of trial and error. For vision-based HAR task, [39] used 320 samples for 16 activities and [40] used 567 samples for 20 activities. We investigated the quantity of samples utilized in some CSI-based HAR researches. In ConvLSTM [16], they collected CSI data for 11 activities which were performed 100 times in a home environment (1100 samples). In [41], they collected 600 samples from 3 volunteers for 8 activities. In [30], they collected 720 samples of activities (12 volunteers × 20 samples × 3 activities). The authors in [42] collected 50 up to 100 samples for 4 actions (approximately 200 up to 400 samples). In [43], they collected 1400 samples from 25 volunteers. Authors in [44], collected 50 samples for 10 activities (500 samples). Siamak Yousefi et al. [14], as one of the most cited articles in WiFi-based human activity recognition, provided a public dataset for 6 different activities, performed by 6 users for 20 times (720 samples). According to other researches results, we asked 3 volunteers to perform 7 different activities 20 times, resulting in 420 samples. To the best of our knowledge, the WiFi-based researches data accessibility and number of samples are listed in Table 2. Furthermore, it should be mentioned that we plan to increase number of samples and perform activities in different scenarios.


**Table 2.** Number of samples and data accessibility in different CSI-based HAR researches.

#### **4. Proposed Method**

Despite the numerous advantages that accessibility to CSI would provide to users, chip manufacturers continue to treat CSI as a private feature. Only a few devices that are still using the 802.11g and 802.11n technologies are capable of dumping CSI, and they do so with a number of restrictions. Additionally, the Linux 802.11n CSI Tool is only compatible with older Linux kernel versions, which can cause significant inconvenience. In IoT, wireless connectivity is critical for monitoring and control purposes such as HAR. When it comes to experimentation, the Raspberry Pi might be considered a cheap and available WiFi-enabled platform. We employ Nexmon Tool [13] and collect CSI data for seven daily human activities, including walk, run, fall, lie down, sit down, stand up, and bend. We use Raspberry Pi 4 and a Tp-link archer c20 as an Access Point (AP) in 20 MHz bandwidth on channel 36 in IEEE 802.11ac standard. As depicted in Figure 5, we use Personal Computer (PC) for traffic generation by pinging or watching a movie on the internet. The AP will reply with pong packets to the sent pings from the PC. The Pi is in monitor mode and will sniff through this connection and collect CSI for each sent-out pong packet. CSI is saved as a pcap file which can be analyzed in many software including MATLAB. CSI complex numbers are extracted and after removing null and pilot subcarriers, we export activity rows according to the period of each activity which has been detached depending on the video of activity performed by users and stopwatch. Due to reflections induced by human activity, each subcarrier for any given link experiences a variation [11]. Therefore, each subcarrier includes critical information that will increase recognition accuracy. A higher proportion of subcarriers boosts precise feature detection since it provides additional information and boosts identification of challenging features to analyze a subset of subcarriers. The CSI matrices have 52 columns (available data subcarriers) and 600 up to 1100 rows depending on the period of each activity. The dataset is available in GitHub https://github.com/parisafm/CSI-HAR-Dataset (accessed on 27 October 2021).

**Figure 5.** Configuration for CSI collection.

No data pre-processing is applied on the CSI amplitude since any additional filtering can result in losing important information and affect the system's performance. If the simulation results or generated images are disappointing, we can use a low pass filter for high-frequency reduction, as mentioned in [16]. In order to make RGB images, the data values must be normalized between 0 and 255 for all activities. We make a pseudocolor plot from matrices representing them as an array of colored faces in the x-y plane. In a pseudocolor plot, cells are arranged in a rectangular array with colors specified by the values in C as normalized CSI input matrices. MATLAB creates this plot by using four points near each corner of C to describe each cell. Each element of C is linearly mapped to the RGB colormap. The generated RGB images are resized to the desired size (64 × 64). Some of these images for each class of activities are depicted in Figure 6. Since the images are not noisy, we do not need to apply denoising filters and additional denoising technique may cause information lost.

These images and CSI data are then fed into neural networks. As CSI signals are typical time-series with temporal dependency, the future information in each step is crucial for HAR, and also LSTMs cannot effectively analyze more than 100 s term, we consider two methods. First, we convert CSI signals to RGB images using pseudocolor plot and feed them into 2D-CNN. By converting CSI to RGB images, the signal pattern for each activity can be seen in one look. Meaning that the pattern changes due to the human movements are depicted in image.

**Figure 6.** Generated RGB images: (**a**) walk; (**b**) run; (**c**) fall; (**d**) lie down; (**e**) sit down; (**f**) stand up; (**g**) bend.

Therefore, in contrast to LSTM that does not have any information about future steps, CNN can analyze the whole signals' alteration. Additionally, CNN process information parallelly, resulting in faster training than LSTMs with better accuracy. Another method to address LSTMs mentioned problems is to apply BLSTM on CSI data. BLSTM contains a forward and backward layer and can analyze both past and future information by extracting information from the two directions. Since the sequential features learned by the BLSTM network have high dimensions and feature contributions and timesteps may vary for each activity, we exploit the attention layer to learn the relative importance of features. Although BLSTM have high potential to recognize human activities, it needs a greater memory bandwidth for processing and thus it has more training time than the proposed 2D-CNN. Lower consumed time in training and less computational complexity, along with the ability to observe the whole pattern alteration in one look, make the novel image conversion idea and 2D-CNN implementation the best choice over other mentioned methods.

#### **5. Evaluation**

#### *5.1. Measurement Setup*

Buster lite 4.19.97 raspian and the main branch of nexmon-csi [45] were installed on the Raspberry Pi 4. Nexmon tool was configured as follows: Channel 36, bandwidth 20 MHz, Core 1, NSS mask 1, 4000 samples, 20 s. The AP's MAC address filter was set to make sure the Raspberry Pi will not connect to another AP on channel 36. The data collection was conducted from another device linked to the Pi over SSH to avoid interference, communicating over another 2.4 GHz network. The AP used is a Tp-link

archer c20 wireless router operating a 5 GHz WiFi network on channel 36 at 20 MHz. A PC is paired with the AP to generate traffic by watching a video on the internet or pinging, for which the Pi can capture CSI. We put the Raspberry Pi in monitor mode and with the use of the sniffing method, we were able to collect CSI data. We collect 4000 samples at around 20 s which results in 200 Hz sample rate. Ap and Pi were both 1m above the ground to ensure an unobstructed signal path. They were 3 meters away from each other. The experimental environment is depicted in Figure 7. Each activity performed in the dataset was performed 20 times by three users of different ages. These activities are as listed: fall, stand up, sit down, lie down, run, walk, bend. CSI data were captured in the 20 s, in which an activity has been performed in the middle of this period. More precisely, users remain mostly stable at the start and the end of the capture. As the experiment was managed by the users, the length of time taken for the activity to begin and end may vary slightly, around 3 to 6 s (around 600 to 1100 rows of 4000 total rows). The activity period is extracted according to the video of the activity and stopwatch.

**Figure 7.** Experimental environment.

#### *5.2. Simulations Results*

The proposed deep learning architectures can discover more complex patterns in time series data, compared to hand-crafted features techniques such as RF [14] and HMM [14]. As shown in Figure 8, the ConvLSTM [16] model slightly outperforms the FC network in [17] and DenseLSTM [15]. Our proposed models have achieved better results compared with all of them without any extra data augmentations [15] and complex structure like ConvLSTM [16] and FC [17]. The detailed information about the mentioned methods are available in Section 2. The dataset was split into train and test in a 75% to 25% ratio. We implemented four neural networks on Keras for classification, which has been accelerated by Geforce RTX 2060. The raw CSI amplitude data is a 52-dimensional vector fed into 1D-CNN, LSTM, and attention-based BLSTM. In 1D-CNN model, we have two Conv1D with ReLU as an activation function and after each Conv1D layer, we added a MaxPooling layer. The LSTM network contains one LSTM hidden layer and 128 hidden units. For the BLSTM model, we used one BLSTM layer with 200 hidden nodes and one attention layer with 400 units. The converted RGB images were fed into 2D-CNN with 2xConv2D layer (with Leaky ReLU) and 2xMaxPooling layer (after each Conv2D). The structures of these networks are depicted in Figures 1–4.

**Figure 8.** Accuracy of different methods implemented on the dataset.

CNN can detect simple patterns in data, which are subsequently utilized to create more complex patterns within higher layers. 1D-CNN is highly effective when features are derived from fixed-length parts of the dataset and the feature's position in the section is not crucial, including the analysis of time sequences data (such as gyroscope or accelerometer data or CSI). Since the LSTM network analyzes temporal dependencies in sequential data, it outperforms the 1D-CNN technique. As mentioned in Sections 1 and 2, LSTMs suffer from vanishing gradient and cannot access next step information. For activities like sit down and lie down which are different at last body movements, it is necessary to have knowledge about next step information. To address these problems, we converted CSI data into RGB images for each activity and used them as inputs for 2D-CNN, thus we can access all the information in past or next steps with one look at images. Additionally, we used BLSTM with attention layer to consider both past and next step information and automatically learn features' significance to assign higher weights based on HAR performance. The attention-based BLSTM approach and 2D-CNN have achieved the best performance for the recognition of all activities with an accuracy of around 95%. All of these comparisons are depicted in Figure 8.

Different activities have different CSI values, resulting in different recognition accuracy [7]. We use a confusion matrix (or error matrix) to describe the performance of our proposed classifiers for each activity in which the rows represent anticipated classes and the columns represent actual classes. The activities with more significant body movement, i.e., fall, walk, and run, have higher recognition accuracy (see Figure 9) since they have more influence on CSI characteristics. Furthermore, fall activity is crucial, particularly for elderly healthcare services. Our proposed 2D-CNN and BLSTM network have 98% and 96% accuracy for this activity, making these models efficient in elderly care systems. Another observation is that the action "Lie down" has a recognition accuracy similar to "Sit down" for most methods. The probable explanation is that these activities have a similar impact on CSI values since the start position is the same and the final positions are different. By applying attention-based BLSTM and 2D-CNN, the system is less confused between these two activities. As shown in Figure 9, the model is confused with these two activities around 3% in BLSTM and 2% in 2D-CNN which are acceptable when compared to LSTM with 8% and 1D-CNN with 9% confusion.

**Figure 9.** Confusion matrices of proposed methods: (**a**) LSTM; (**b**) 1D-CNN; (**c**) BLSTM; (**d**) 2D-CNN.

Consumed time is another critical performance evaluation indicator representing how much time the model spends training and testing. Table 3 compares the time consumption (milliseconds per step) of six DL approaches: ConvLSTM [16], DenseLSTM [15], LSTM, BLSTM, 1D-CNN, and 2D-CNN. We can observe that proposed 2D-CNN has the shortest time and highest accuracy (Figure 8) compared to the others, making 2D-CNN a better choice compared with BLSTM, ConvLSTM [16], and DenseLSTM [15] in a fraction of the time. More precisely, a long-term input is processed sequentially in LSTMs' gates, making them not hardware-friendly, as they require greater memory bandwidth to compute parameters, in addition to time-consuming simulations. In contrast, CNN extracts features by utilizing convolution operation, which is easier to compute and faster in training. Furthermore, the CNN accuracy rapidly improved while the BLSTM accuracy slowly improved in a longer training time.

**Table 3.** Consumed-time (milliseconds per step) comparison for different models.


#### **6. Conclusions**

Due to the ubiquity of WiFi devices, HAR based on wireless signals, including CSI, has witnessed more interest in smart house health monitoring systems. A few CSI datasets for the HAR task collected with 5300 NIC or Atheros PCI chips, are currently available. This paper presented a CSI dataset for indoor HAR using a Raspberry Pi, which is one of the most accessible embedded boards. In this work, we have designed four neural networks to conduct WiFi-based HAR with more than 87% accuracy for our dataset. We used a BLSTM network with an attention layer to address LSTM problems with future information. We also convert CSI data to images using pseudocolor plots and feeding them into 2D-CNN to overcome the mentioned limitations of LSTM. We showed that the idea of CSI conversion to images can obtain high accuracy of 95%, close to BLSTM, which is one of the most successful DL algorithms in time-sequential analysis. Additionally, as CNN processes different features parallelly, it is faster than other methods and less complex in computations. The strong performance of the proposed methods indicates that the data collected by Raspberry Pi can effectively be employed in smart house HAR. The proposed methods can boost elderly health monitoring systems since it meets the requirements for acceptable recognition accuracy and recognition speed for the most commonly performed actions in this task.

Nevertheless, we presented the first version of our public dataset and plan to improve it by investigating different environments and scenarios. In the future, we will study human-to-human interactions and the CSI changes in multiple user-multiple environments scenarios. Since different ages may perform activities differently, according to their physical ability, we collected CSI data from three different ages, including an adult, a middleaged person, and an elderly person and try to study other ages, including child and teen. Additionally, we will investigate activities with different initial movements, such as standing + walking and running + walking.

**Author Contributions:** Conceptualization, P.F.M. and S.A.G.; methodology, P.F.M. and S.A.G.; software, P.F.M.; validation, P.F.M. and M.N.; formal analysis, P.F.M. and R.S.; investigation, P.F.M.; resources, S.A.G. and P.F.M.; data curation, P.F.M.; writing—original draft preparation, P.F.M.; writing—review and editing, S.A.G., R.S. and M.N.; visualization, P.F.M.; supervision, S.A.G. and R.S.; project administration, S.A.G. and R.S.; funding acquisition, S.A.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study are available in GitHub: https: //github.com/parisafm/CSI-HAR-Dataset (accessed on 27 October 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Evaluations of Deep Learning Approaches for Glaucoma Screening Using Retinal Images from Mobile Device**

**Alexandre Neto 1,2, José Camara 2,3 and António Cunha 1,2,\***


**Abstract:** Glaucoma is a silent disease that leads to vision loss or irreversible blindness. Current deep learning methods can help glaucoma screening by extending it to larger populations using retinal images. Low-cost lenses attached to mobile devices can increase the frequency of screening and alert patients earlier for a more thorough evaluation. This work explored and compared the performance of classification and segmentation methods for glaucoma screening with retinal images acquired by both retinography and mobile devices. The goal was to verify the results of these methods and see if similar results could be achieved using images captured by mobile devices. The used classification methods were the Xception, ResNet152 V2 and the Inception ResNet V2 models. The models' activation maps were produced and analysed to support glaucoma classifier predictions. In clinical practice, glaucoma assessment is commonly based on the cup-to-disc ratio (CDR) criterion, a frequent indicator used by specialists. For this reason, additionally, the U-Net architecture was used with the Inception ResNet V2 and Inception V3 models as the backbone to segment and estimate CDR. For both tasks, the performance of the models reached close to that of state-of-the-art methods, and the classification method applied to a low-quality private dataset illustrates the advantage of using cheaper lenses.

**Keywords:** deep learning; glaucoma screening; retinal images; segmentation; classification

#### **1. Introduction**

Glaucoma is one of the main causes of vision loss, mainly due to increased fluid pressure and improper drainage of fluid in the eye. In 2013, it was estimated that 64.3 million people aged 40–80 years were diagnosed with glaucoma worldwide. This disease is expected to reach nearly 76 million by 2020 and 111.8 million by 2040. The prevalence of glaucoma is 2.5% for people of all ages and 4.8% for those above 75 years of age [1]. Glaucoma is an asymptomatic condition, and patients do not require medical assistance until a late stage, making the diagnosis frequently too late to prevent blindness. Population-level surveys suggest that only 10–50% of people with glaucoma are aware that they have the disease. As early diagnosis and treatment of the condition can prevent vision loss, glaucoma screening has been tested in numerous studies worldwide [2]. An ophthalmologist can directly examine the eye with an ophthalmoscope or can examine a fundus image capture with a fundus camera, as can be seen in Figure 1. The examination of these fundus images is important because the ophthalmologist can record indicators and parameters related to cupping to detect glaucoma, such as disc diameter, the thickness of the neuroretinal rim (decreasing in the order inferior (I) > superior (S) > nasal (N) > temporal (T) (ISNT rule)), peripapillary atrophy, notching and cup-to-disc ratio (CDR), with this last indicator being the most used measurement by specialists [3–5].

**Citation:** Neto, A.; Camara, J.; Cunha, A. Evaluations of Deep Learning Approaches for Glaucoma Screening Using Retinal Images from Mobile Device. *Sensors* **2022**, *22*, 1449. https://doi.org/10.3390/s22041449

Academic Editor: Ivan Miguel Serrano Pires

Received: 29 December 2021 Accepted: 10 February 2022 Published: 14 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Representation of capturing an image of the interior surface of the eye (retina).

Usually, glaucoma is diagnosed on the basis of the patient's medical history, measures of intraocular pressure (IOP), a visual field loss test and manual evaluation of the optic disc (OD) using ophthalmoscopy to examine the shape and colour of the optic nerve. The examination of the OD is important since glaucoma begins to form a cavity and develops an abnormal depression/excavation at the front of the nerve head, called the optic cup, which, in advanced stages, facilitates the progression of glaucoma, blocking the OD (Figure 2) [6–8].

**Figure 2.** Retinal image from the normal and glaucomatous eye. **Green line:** OD boundary; **red line:** cup boundary.

After gathering the retinal images, they must be inspected and analysed to look for indicators of ophthalmologic pathologies. These diagnostic systems offer the potential to be used on a large scale for the early diagnosis and treatment of glaucoma. However, they require subjective evaluation by qualified experts, and it is time-consuming and costly to inspect each retinal image manually. In this regard, deep learning (DL) algorithms help in the automatic screening of glaucoma and assist ophthalmologists in achieving higher accuracy in glaucoma screening, especially in repetitive tasks [3,6,9].

The main objective is to develop a system that allows the screening of low-resolution retinal images captured by a low-cost lens attached to a smartphone. To accomplish this, secondary objectives must be achieved. In this study, state-of-the-art DL methods were explored, tested and applied to high-resolution public databases and then applied to a private database containing low-quality images captured through a low-cost lens attached to a mobile device. These classification methods provide activation maps that allow the model's decision to be analysed and discussed. Segmentation methods were applied as well, using the CDR to classify images after OD and cup segmentation. These segmentation methods can help an ophthalmologist in a subjective and difficult task, enabling more consistent results that are similar to a clinician's segmentations. For this purpose, state-of-the-art works on classification and segmentation methods for glaucoma screening were reviewed.

#### **2. Literature Review**

DL techniques have yielded good results in research on glaucoma screening due to the development of technologies to detect, diagnose and treat glaucoma. The main approach is to conduct screening through computer-aided diagnosis (CAD) systems that use DL to learn and train models through previously labelled available data, identifying patterns and making decisions with minimal human intervention [10]. This section surveys key works with methods using automatic classification models and classification methods using segmentation models.

#### *2.1. Classification Methods*

The use of classification methods for screening glaucoma lesions in retinal images is another well-established approach. An overview of the best methods recently published is provided in the following.

The study of Gómez-Valverde [2] used the VGG19, GoogLeNet (also known as Inception V1), ResNet50 and DENet models. With these models, Valverde compared the performance between transfer learning and training from scratch. To confirm the performance of VGG19, 10-fold cross-validation (CV) was applied. Valverde used three different databases: RIM-ONE and DRISHTI-GS (public) and Esperanza (private dataset). In the RIM-ONE database, the images classified as suspect were considered to be glaucomatous for the study. The best result was obtained with the VGG19 model using transfer learning.

Diaz-Pinto [11] applied five different ImageNet pre-trained models (VGG16, VGG19, InceptionV3, ResNet50 and Xception) for glaucoma classification and used a 10-fold CV strategy to validate the results. Five databases were used for this work: ACRIMA, HRF, DRISHTI-GS, RIM-ONE and Sjchoi86-HRF. The images were cropped around the OD using a bounding box of 1.5 times the OD radius. All models passed the AUC threshold of 0.96, indicating excellent results.

Serener et al. [12] selected the ResNet50 and GoogLeNet models and trained them with two public databases: a database from Kim's Eye Hospital (total of 1542 images, including 786 photos from normal patients and 756 from glaucoma patients) and RIM-ONE r3. The database from Kim's Eye Hospital was used to train the two models, and for the performance evaluation, the models were tested with the RIM-ONE r3 database. With GoogLeNet, Serener obtained better results for early-stage glaucoma than for the advanced glaucoma stage.

The work performed by Norouzifard [13] used two DL models, namely, VGG19 and Inception ResNet V2. These two models were pre-trained and then fine-tuned. For this work, two databases were used: one from the University of California Los Angeles (UCLA) and another publicly available one called high-resolution fundus (HRF). From the UCLA database, they randomly selected 70% of the images for training, 25% for validation and the remaining 5% for testing. To solidify the work, the models were then re-tested with the HRF database. The Inception ResNet V2 model with the UCLA database obtained a specificity and sensitivity above 0.9, even when re-tested with the HRF database.

The study by Sreng [5] was performed in two stages: first, DeepLabv3+ detected and extracted the OD from the entire image, and then three types of convolutional neural networks (CNNs) were used to identify images in the segmented OD region as glaucomatous or normal. After the image was cropped around the OD, 11 ImageNet pre-trained models 23re used: AlexNet, GoogleNet, InceptionV3, Xception, ResNet-50, SqueezeNet, ShuffleNet, MobileNet, DenseNet, InceptionResNet and NasNet-Large. This method was trained with five public databases: REFUGE, ACRIMA, ORIGA, RIM-ONE and DRISHTI-GS. The results

showed that DenseNet with the ACRIMA database had the best performance, followed by MoblieNet with the REFUGE database.

#### *2.2. Segmentation Methods*

Several methods have been published in the literature on segmenting the OD and the cup disc, mostly using adaptations of U-Net. The following presents an overview of the best methods recently published.

Al-Bander [14] proposed a method with a DenseNet incorporated with an FCN with U-shaped architecture. Al-Bander's approach involved the use of five databases of colour fundus images: ORIGA, DRIONS-DB, DRISHTI-GS, ONHSD and RIM-ONE. For the preprocess, only the green channel of the colour images was considered since the other colour channels contain less useful information. The images were then cropped to isolate the ROI. For OD segmentation, the model achieved better Dice and intersection-over-union (IoU) results with the DRISHTI-GS database compared to RIM-ONE, and the same results were obtained for cup segmentation but with lower values of Dice and IoU compared to OD segmentation.

In the work of Singh [15], a conditional generative adversarial network (cGAN) model was proposed to segment the OD. The cGAN is composed of a generator and a discriminator and can learn statistically invariant features, such as the colour and texture of an input image, and segment the region of interest. For this method, skip connections were used for concatenating the feature maps of a convolutional layer with those resulting from the corresponding deconvolutional layer. To train and evaluate the model, the DRISHTI-GS and RIM-ONE databases were used, with the size of the images reduced to 256 × 256 and the value of each pixel normalised between 0 and 1. For OD segmentation, the model for both databases achieved values above 0.9 for accuracy, Dice and IoU.

Qin [16] proposed neural network constructs utilising the FCN and inception building blocks in GoogleNet. The FCN is the main body of the deep neural network architecture, and to this method, they added several convolution kernels for feature extraction after deconvolution based on the Inception structure in GoogLeNet. Qin's experiments used two databases: REFUGE and one from the Second Affiliated Hospital of Zhejiang University School of Medicine. For this technique, the authors used a fully automatic method using the Hough circle transform that recognises and cuts the image to obtain an image of the ROI. In the segmentation of the OD and the cup, the model obtained values above 0.9 for Dice and the IoU.

In the work by Yu and others [17], a modified U-Net with a pre-trained ResNet-34 model was developed. This work comprised two steps: first, one single-label modified U-Net model was applied to segment an ROI around the OD, and then after this, the cropped image was used in a multi-label model whose objective was to segment the OD and cup simultaneously. In Yu's study, the RIGA database was used to train and evaluate the CNN, but then to achieve robust performance, the model trained on RIGA was applied on the DRISHTI-GS and RIM-ONE r3 databases. All of the database images were pre-processed with contrast enhancement, followed by resizing to 512x512 dimensions. In this method, the segmentation of the OD and the cup produced better results with DRISHTI-GS than RIM-ONE r3.

#### Cup-to-Disc Ratio

Glaucoma progression is assessed based on the ratio between OD and cup measurements. The cup-to-disc ratio (CDR) is a clinical method that compares the ratio of the cup to disc, which is currently determined manually, limiting its potential in mass screening. Manual segmentation is dependent on the experience and expertise of the ophthalmologist, so it ends up being subjective and differing between observers [18]. The CDR is commonly used in clinics to classify glaucoma, and specific patterns of change in the region of the OD and cup are used as evidence of glaucoma or glaucoma progression along with other clinical tests, such as intraocular pressure and visual field acuity [18,19].

Accurate segmentation of the OD and cup is essential to a reliable CDR measurement, and reliance on manual effort restricts the deployment of CDR for mass screening, which is fundamental in the detection of early glaucoma for effective medical intervention [18]. Machine learning approaches automatically segment the OD and cup regions and then measure the CDR or extract features that may help to determine whether or not the images contain glaucoma, as can be seen in Figure 3. A higher CDR indicates a higher risk of glaucoma [5,20].

**Figure 3.** CDR: (**a**) VCDR; (**b**) HCDR; (**c**) ACDR.

Different parameters can be measured for the CDR to determine the cupping and assess the eye for the presence of glaucoma, such as the horizontal cup diameter to the horizontal OD diameter, the vertical cup disc diameter to the vertical OD diameter and the area of the cup to the area of the OD [19]. If the vertical CDR (VCDR) and horizontal CDR (HCDR) are more than 0.5, the eye is considered to be at risk of abnormality; otherwise, it is considered a normal eye [21]. VCDR and HCDR equations are presented in Equations (1) and (2):

$$\text{VCDR} = \frac{\text{V}\_{\text{cup}}}{\text{V}\_{\text{disc}}} ; \tag{1}$$

$$\text{HCDR} = \frac{\text{H}\_{\text{cup}}}{\text{H}\_{\text{disc}}} \,. \tag{2}$$

Alternatively, considering the criteria by Diaz [21], the assessment can be performed through the area CDR (ACDR) using a threshold of 0.3, as presented in Equation (3):

$$\text{ACDR} = \frac{\text{A}\_{\text{cup}}}{\text{A}\_{\text{disc}}} \,. \tag{3}$$

Diaz [21] presented an automatic algorithm that uses several colour spaces and the stochastic watershed transformation to segment the cup and then obtains handcrafted features, such as the VCDR, HCDR and ACDR. Diaz's method was evaluated on 53 images, obtaining a specificity and sensitivity of 0.81 and 0.87.

After segmentation, Al-Bander [14] calculated the VCDR with varying thresholds and compared the results with an expert's glaucoma diagnosis, achieving an AUC of 0.74, very close to the 0.79 achieved using ground-truth segmentation. After that, the same approach was used, but with HCDR achieving an AUC of 0.78, close to the 0.77 achieved by the expert's annotation and higher than the results obtained with the VCDR.

#### **3. Materials and Methods**

The model pipeline in this work is illustrated in Figure 4. In the first task (Task 1: Data preparation), data pre-processing and organisation processes are described. In the second task (Task 2: Glaucoma screening), different glaucoma classification methods are explained based on classification models alongside the respective activation maps and based on OD and cup segmentation models for CDR calculation. The models and hyper-parameters

used for each approach are described in the model setups. In the third and last task (Task 3: Evaluation), the models are evaluated based on each approach's glaucoma classification.

**Figure 4.** The model pipeline for glaucoma screening.

#### *3.1. Data Preparation*

Three public databases were used: RIM-ONE r3, DRISHTI-GS and REFUGE. The RIM-ONE r3 database has a balanced proportion between normal and glaucomatous samples, with 85 healthy images and 74 glaucomatous images with a resolution of 2144 × 1424 pixels. The images in this database vary significantly the quality of the illumination and contrast: some are low-light images, making it difficult to identify the OD and cup, and others have good illumination and contrast, helping to identify the retinal components. DRISHTI-GS has a larger representation of glaucoma samples (70 images) than healthy samples (31 images), and the images have a resolution of 2896 × 1944 pixels. Compared to RIM-ONE r3, DRISHTI-GS images have more homogeneous illumination and contrast, which helps to identify and segment the OD and the cup. The REFUGE database is composed of 400 images with a resolution of 2124 × 2056 pixels, but we only had access to the validation set, which has a lower representation of glaucoma samples compared to healthy samples (40 glaucomatous images and 360 normal images).

For each dataset, retina images were divided into a training set (70%), validation set (15%) and test set (15%). The models were trained with each database separately for the segmentation and classification approach. The respective OD and cup masks are available in all of these databases. In the RIM-ONE r3 database, the images classified as suspect were considered glaucomatous, as was also the case in the work by Gómez-Valverde [2]. Since we had little data, the three databases were merged into a larger database (called K-Fold CVDB, standing for K-Fold Cross-Validation DataBase) to perform K-fold cross-validation (CV). The K-Fold CVDB was divided into 5 similar folds for the cross-validation, and one set was left out to test and validate each model and verify the robustness after the training as the final step. The data organisation process is explained in Figure 5.

All images used to train and test the different models were normalised and centralised in the OD and then cropped to focus the CNNs on the ROI. The cropped images have 512 × 512 resolution and did not suffer from changes in illumination or contrast. Augmentation processes were applied to the databases to avoid overfitting the model, such as rotations (range = 0.2), zooms (range = 0.05), shifts (width and heigh shift range = 0.05) and horizontal flips.

**Figure 5.** Data organisation for training the models.

#### *3.2. Glaucoma Screening*

For both approaches, different models were trained with each database separately, and then CV was performed (more precisely, leave-one-out K-fold CV). For this step, the data were partitioned into K equal-sized subsets. K-1 subsets were used to train the CNN, and the remaining set was used for testing. Additionally, the leave-one-out dataset was used for testing the model at the end, giving a more thorough evaluation of each model's performance since these data were not used to train or test any model. All models were fine-tuned either for image classification or for OD and cup segmentation. Fine-tuning is a procedure based on transfer learning to optimise and minimise the error through the weight initialisation of the convolutional layers using pre-trained CNN weights with the same architecture. The exception is the layer whose number of nodes depends on the number of classes. After the weight initialisation, in the last fully connected layer, the network can be fine-tuned, starting with tuning only the last layer and then tuning the remaining layers, incrementally including more layers in the update process until achieving the desired performance. The early layers learn low-level features, and the late layers learn high-level features specific to the problem in the study [22,23]. For all of the classification and segmentation models used to detect glaucoma, ImageNet pre-trained weights were used. All models selected were based on the best results reported in the reviewed literature.

#### 3.2.1. Classification Methods

Classification used the same principles as segmentation, using pre-trained models with good results inspired by state-of-the-art works. These models were trained with transfer learning using ImageNet weights. First, the four additional layers were pre-trained, freezing the remaining layers before the new ones, and after that, the models were finetuned, unfreezing the first layers and training all layers present in the models. We selected the Xcpetion (C1), ResNet 152 V2 (C2) and Inception ResNet V2 (C3) models.

Xception is an extension of the Inception architecture and stands for Extreme Inception. It replaces the standard Inception modules with depthwise separable convolutions called "separable convolution" in frameworks such as TensorFlow and Keras. In the Inception module, filters of different sizes and dimensions are concatenated into a single new filter, acting as a "multi-level feature extractor" by computing 1 × 1, 3 × 3 and 5 × 5 convolutions within the same module of the network. Based on these modules, a more complex and deeper architecture compared to all previous CNN architectures was developed [24]. Depthwise convolution is a spatial convolution performed independently over each channel, followed by a pointwise convolution, i.e., a 1 × 1 convolution. This architecture's premise is that cross-channel correlations and spatial correlations are sufficiently decoupled to be mapped separately [25].

ResNet is a deep residual network developed with the idea that identifying shortcut connections allows for increasing the depth of convolutional networks while avoiding the gradient degradation problem. These shortcut connections help gradients flow easily in the backpropagation step, which leads to increased accuracy during the training phase. ResNet is composed of 4 blocks with a lot of convolutional blocks inside. Each convolutional operation has the same format in all versions of ResNet (50, 101 and 152), with the only difference being in the number of subsequent convolutional blocks. This deep residual network exploits residual blocks to overcome gradient gradation [23,26].

Inspired by the performance of ResNet, hybrids of Inception and ResNet models were developed. They are two sub-versions of Inception ResNet, i.e., V1 and V2. Inception ResNet V1 has a computational cost similar to that of Inception V3 and Inception ResNet V2 and is similar to Inception v4, with the only difference being in the hyper-parameter settings. They introduce residual connections that use the output of the inception module's convolution operation as the input of the next module. Therefore, the input and output after convolution must have the same dimensions. To increase depth after convolution, 1x1 convolutions were used after the original convolutions [24].

To train all of these models, images and their respective labels (normal or glaucoma) were used as inputs, and the probability of being one of the classes, normal or glaucoma, was the output.

#### 3.2.2. Segmentation Methods

The availability of a huge dataset such as ImageNet with a high capacity to train the model led to a large variety of pre-trained models for the feature encoder in a CNN. The encoder in a U-Net model is a stack of convolution layers combined with activation functions and pooling layers that can adopt the architecture that is frequently employed for feature extraction with pre-trained models. For the segmentation approach in glaucoma screening, the pre-trained models selected were Inception ResNet V2 and Inception V3 (for simplification, called S1 and S2, respectively). These pre-trained models are used as feature encoders in modified U-Net and use the retina image as input and the respective masks of the OD and cup for training. As output prediction is given, a mask of the OD or cup segmentation is also then used to measure the indicators of glaucoma presence, such as CDR. The predicted mask applies morphological processes to remove holes and anomalies of the prediction if they are present.

#### 3.2.3. Model Setups

**Segmentation models:** The models trained for segmentation were pre-trained for 20 epochs and fine-tuned for 100 epochs with a batch size of 2 for the validation and training sets. The encoder weights were frozen for the pre-training step, and for fine-tuning, the encoder layers were unfrozen; the model was trained again to update all weights. The learning rate started at 10−<sup>4</sup> with Adam optimiser, and binary cross-entropy was used as the loss function. To prevent the learning rate from stalling on the plateau, the callback reduces the learning rate on the plateau by a factor of 0.90 and only saves the best training weights.

**Classification models:** The classification model was pre-trained for 20 epochs and fine-tuned for 200 epochs with a batch size of 2 for validation and training sets. The learning rate started at 10−<sup>4</sup> with Adam optimiser, and binary cross-entropy was used as the loss function; to prevent the learning rate from stalling on the plateau, the callback reduces the learning rate on the plateau by a factor of 0.90. All of these models are available in TensorFlow Core and were loaded. The classification layer (last layer/dense layer) was removed, and after that, 4 new layers were added: a global average pooling 2d layer, a dropout layer (dropout = 0.5), a batch normalisation layer and, finally, a dense layer with 2 outputs with SoftMax as the activation function (2 outputs for 2 classes, glaucoma and normal).

#### *3.3. Model Evaluation*

The metrics for the evaluation of the segmentation model were the intersection over union (IoU) and the Dice coefficient.

The IoU metric measures the accuracy of an object detector applied to a particular database. It measures the common area between the predicted (P) and expected (E) regions, divided by the total area of the two regions, as presented in Equation (4):

$$\text{IoU} = \frac{\text{Area}(\text{P} \cap \text{ E})}{\text{Area}(\text{P} \cup \text{ E})} \tag{4}$$

The Dice coefficient is a statistic used to gauge the similarity between two samples (in this case, between predicted and reference (Ref) segmentation). TP is true positives, FP is false positives and FN is false negatives, as can be seen in Equation (5):

$$\text{Dice} = \frac{\text{2TP}}{\text{2TP} + \text{FP} + \text{FN}} \tag{5}$$

The CDR equations are described in the previous section in Equations (1)–(3). For the evaluation of the classification models, other metrics were used. The accuracy (Acc) (6) is the fraction of correct predictions by the model.

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{6}$$

where TP is true positives, TN is true negatives, FP is false positives and FN is false negatives. Sensitivity (Sen) (7) measures the proportion of positives that are correctly identified, and specificity (Sep) (8) measures the proportion of negatives that are correctly identified.

$$\text{Sen} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{7}$$

$$\text{Sep} = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{8}$$

The F1-score (F1) (9) indicates the balance between precision and recall, where precision is the number of TP divided by the number of all positives, and recall is the number of TP divided by the number of all samples that should have been identified as positive. The F1-score is the harmonic mean of the two.

$$\text{F1 Score} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}} \tag{9}$$

#### **4. Results and Discussion**

The results are organised in the same way that the methodology is presented in the workflow. For both methods, glaucoma screening was performed by training the models with each database separately and with merged data with the K-Fold CVDB. The results are discussed and compared with the results published by the scientific community. In the end, both methods are compared to assess their capability for glaucoma screening and determine how much they can contribute to supporting this important and challenging task.

#### *4.1. Glaucoma Screening Based on Classification Methods*

First, for the classification approach, each database was used to train each model separately to determine which model has the best performance and which database has the best quality to produce better model results. The challenge of this methodology is that separately training the models on each database decreases the amount of data that the models learn since there are fewer data to train and validate the training. The results are presented in Table 1.


**Table 1.** Results for the models trained separately on each database.

Overall, the database that showed better results was REFUGE for the C1 and C3 models, with an AUC close to one and with high sensitivity and specificity close to the results presented in Table 2. The C1 and C3 models outperformed those reported by Sreng [5], who also used the REFUGE database for pre-trained networks with a transfer learning model ensemble.


**Table 2.** Results of state-of-the-art glaucoma classification methods.

The RIM-ONE and DRISHTI-GS databases showed lower AUC values than REFUGE and the results reported by Sreng [5], who also evaluated each database separately. Better results with a significant difference suggest that the quality of the samples in the REFUGE database is superior to that of the others, with homogeneity in the contrast, illumination and resolution of all samples, in contrast to RIM-ONE and DRISHTI-GS, whose images have heterogeneous quality with a lot of variations in the same factors.

As mentioned in the methodology, the K-fold CV technique was used to train the models with K-1 folds, and then the other test set was used for testing. The K-Fold CVDB was divided into five folds, with an extra set left out to test at the end (leave-one-out method) in all iterations of each model. Of these five folds, four were used to train the model, and the other was used to evaluate it, changing the test fold and training folds in each iteration. The results of the classification in each fold are presented in Table 3.


**Table 3.** Results for the models with K-fold CV for each test set of each fold with the mean results of the models for the 5 folds and the standard deviation.

Compared to the results presented above, when the models were trained with each database separately, the K-fold CV technique showed an immediate enhancement and direct correlation between the amount of data for training and the quality of classification. All of the models showed similar results, with slightly better performance for the C1 model. These models outperformed most of the state-of-the-art works mentioned previously, with some exceptions (Diaz-Pinto [11] and some results of Sreng [5]) owing to fewer data for training. To evaluate the robustness of the models, they were tested with the test set omitted from training, and the results are shown in Table 4.

**Table 4.** Results for the models in K-fold CV for the leave-one-out test set with the respective mean and standard deviation for the 5 folds of each model.


The results of the classification of the leave-one-out set decreased compared to those discussed above. Nevertheless, most of the models yielded better results than most of the state-of-the-art works, with the same exceptions as those noted previously. The most significant decrease in the results was in the sensitivity, showing a lack of representation and a high rate of false positives for glaucoma samples. The evaluation with the leaveone-out dataset demonstrated that the best model of the three used is C1, as mentioned previously, with the smallest decrease in every metric among all of the models.

The classification models can be a "black-box": extremely hard to explain and hard for non-experts to understand. Explainable artificial intelligence (AI) approaches are methods and techniques that can explain to humans why the DL models arrived at a specific decision. Explainable AI can create transparency, interpretability and explainability as a foundation for the output of the neural networks. For a visual interpretation of the output to supplement the results of the classification models, activation maps (Figure 6) were created that show the regions of the input images that cause the CNNs to classify the samples as glaucomatous or normal, thus helping clinicians to understand the reason for the output classification.

Gradient-weighted glass activation mapping (Grad-CAM) uses the gradients of any target concept flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. These heatmaps can reveal some important indicators or factors for the classification. The used models focussed more on the centre of the OD, where the cupping zone is responsible for and highly correlated with glaucoma cases. The larger the cup area, the more suspicious, and the more probable that the patient has a case of glaucoma. This type of indicator can help ophthalmologists to make a better and more reliable decision, with one of these indicators being the CDR. To calculate this ratio, first, the OD and cup must be segmented, and the trustworthiness of the screening depends on how well they are segmented. The segmentation procedure is time-consuming and inconsistent when performed manually, so to facilitate a more consistent segmentation, we present models for segmentation with a consequent glaucoma classification based on CDR calculation.

**Figure 6.** Activation maps of the classification models of the K-fold CV (left, original; right, heatmap). The importance is indicated by the emphasized shades in the following order, from the most important to the least important: red, orange, yellow, green and blue.

#### *4.2. Glaucoma Screening Based on Segmentation Methods*

The OD and cup were segmented by two different CNNs, and then the different CDRs were calculated; glaucoma was then classified based on the CDR model. This requires the reference (Ref) masks of each database with annotations of the segmentation made by clinicians, and these were available in the databases selected for this work. Finally, the segmentation and glaucoma screening were compared with the reference masks using the same criteria of the glaucoma classification based on CDR values. To perform the segmentation, two different models, S1 and S2, were used. First, the OD segmentation results are presented, followed by the cup segmentation results, and finally, the glaucoma classification based on the CDR calculation with segmentation masks is provided.

#### 4.2.1. OD Segmentation

The procedure in the segmentation methods is the same as the one presented for the classification approach, with the segmentation in each database performed separately and with the K-Fold CVDB. For the K-fold CV, the means of IoU and Dice of the five folds in each model were obtained. The final mask is the intersection/agreement of at least four masks of the five iterations of each model to compute the final CDRs. The results for OD segmentation are presented in Table 5.

At first view, the results in every dataset segmentation are very similar to every compared state-of-the-art method, with a slight but non-significant difference that does not change the outcome of the CDR calculation. This can be explained by the fact that the segmentation of OD is an easy task because of the visible contrast and outline of the OD and the retina, which facilitate identification and segmentation by the neural network. The K-fold CV showed decreases in the IoU and Dice in both models compared to the other results since they represent the mean of five iterations in each model. This can affect the final results, with divergence in the agreement of OD segmentation. However, this difference was not significant enough to jeopardise the CDR calculation, at least in most of the samples. The two models had similar results, with a slightly better performance for S1. After OD segmentation, the procedure was repeated but with different CNN models, this time training the model to segment the cup.


**Table 5.** Results for OD segmentation for each model and the K-Fold CVDB. For comparison, the results from the literature review are presented as well. S1 is Inception ResNet V2, and S2 is Inception V3.

#### 4.2.2. Cup Segmentation

For cup segmentation, the same models were used, but this time, the network was trained to localise and segment the excavation region inside the OD. Contrary to the previous task, cup segmentation is much harder since there is not a high contrast between the exaction zone and the OD (at least not as high as the contrast between the OD and the retina). The results from the two models are presented in Table 6 with the same structure as the one presented for OD segmentation.

**Table 6.** Results for cup segmentation for each model and the K-Fold CVDB. For comparison, the results from the literature review are represented as well.


Overall, the results achieved the same baseline as the state-of-the-art methods. When directly compared on the RIM-ONE database, the S1 model had better results than S2 and had better Dice than Al-Bander [14], and IoU and Dice were only worse compared to Yu's [17] work. With DRISHTI-GS, the two models had better IoU and Dice than Al-Bander [14] and a slight difference in IoU and Dice compared to the remaining works, with an overall better performance observed for the S1 model. In the REFUGE database, the results from our models and Qin's [16] work are very similar, with a minor difference in the Dice, and as observed in the segmentation of the other databases, the S1 model had better results as well.

In the K-fold CV, both models had a major decrease in performance compared to the other works and the performance of the same models using each database separately. As in the previous verification, the S1 model continued to produce better results. Compared to OD segmentation, the IoU and Dice were much lower, which is a consequence of these coefficients being too sensitive to small errors when the segmented object is small and not sensitive enough to large errors when the segmented object is larger.

The results of OD and cup segmentation were used to calculate the CDRs to use as an indicator of glaucoma presence. Reference segmentation by clinicians was used as the ground truth but is not an absolute truth since the segmentation process can be subjective, and the results can differ between clinicians. Thus, the segmentation predicted by the CNNs can sometimes cause the misclassification of the images but can be considered another opinion, especially in cup segmentation since the perimeter of the cup is not as delimitated and visible as the OD. After the segmentation of the OD and cup, the CDRs were calculated to obtain the glaucoma classification.

#### 4.2.3. Glaucoma Screening Based on Estimated CDR

The segmentation masks of the OD and cup of both models were computed and used to calculate the ratio between them. In this work, all CDRs were calculated, including the vertical and horizontal CDRs and the ratio between the areas of the OD and cup. For the VCDR and HCDR, the criteria used were CDR < 0.5 for normal and CDR ≥ 0.5 for glaucomatous, and the ACDR was normal if <0.3 and glaucomatous if ≥0.3, as described in Diaz's work [21]. The same criteria were used for the Ref masks to allow a direct comparison between the results of our models and the segmentation performed by ophthalmologists to gauge the reliability of segmentation by the S1 and S2 models. The results are expressed in Table 7.


**Table 7.** Results of glaucoma classification with CDR calculations for S1, S2 and Ref masks.

The results for both models were similar to the results using the Ref masks, which indicates that they produced similar segmentation results or at least provided similar CDRs. Overall, the results from CDRs based on the Ref masks were better than the results from the two models, but the difference between the models' classification and the classification in the Ref masks, in a lot of cases, was not significant.

With RIM-ONE, the Ref had a better F1-score for the VCDR and HCDR, but the difference in the F1-scores between the two models was very small. S2 achieved better sensitivity and specificity, but this difference was also small, which may indicate that the masks were very close to each other or had similar forms that led to the computation of similar CDR values. DRISHTI-GS and the K-Fold CVDB were the two datasets with the worst results for both models in comparison with the Ref results, showing a greater difference, but the AUC indicated that the difference was not that large. The results from REFUGE were better for the S2 model compared to S1 and Ref for sensitivity, specificity and F1-score, but in all models and the Ref, the values were very low, which may suggest that, in this case, the CDRs are not a sufficient indicator to produce a classification of glaucoma or normal; thus, for a better decision, complementary information is needed to support the final call. All of the ROC curves from the different databases for all CDRs of the models and Ref masks are presented in Figure 7.

**Figure 7.** ROC curves for the glaucoma classification through CDR calculations for each database separately with the S1 and S2 models and the respective CDR calculations with the Ref masks.

Of all CDRs, the VCDR and ACDR had the best results. The HCDR was the worst result in the two models and the Ref, and the model with the overall best results was S2. This is also shown in the ROC curves, with the models and the Ref having very similar results in all AUCs for the glaucoma classification based on the different CDRs. The difference between the AUCs of the models and the Ref was not significant and was generally very small, with the Ref showing slightly better performance than the S2 model. This can reinforce the notion that the masks originating from the S1 and S2 models are very close to the Ref masks or compute similar CDRs that lead to a similar glaucoma classification based on CDRs. In the work by Diaz [21], the model obtained specificity of 0.81 and sensitivity of 0.87, and Al-Bander [14] achieved an AUC of 0.74 using the

VCDR and 0.78 using the HCDR. The majority of our results surpass the results of the state-of-the-art glaucoma classification methods based on CDRs.

For the K-fold CV, the results of the ROC curves for both models were very similar to those of the Ref and had close AUC values, except for the HCDR. As mentioned previously, the HCDR was the CDR that differed the most, as can be seen in Figure 8.

**Figure 8.** ROC curves for the glaucoma classification through CDR calculation for K-Fold CVDB with the S1 and S2 models and with the respective CDR calculations with the Ref masks.

For visual comparison, in the following images, the masks and outlines of both models are drawn in red, and those of the Ref masks are drawn in green. The intersection between the masks predicted by the models and the Ref masks is indicated by the combination of green and red (true positive). The green area represents a false negative since there is no intersection between the masks, and the red area represents a false positive since the model's prediction does not correspond to the same result as the Ref mask.

In Figure 9, the CDRs values for higher Dice cases are extremely close to the Ref CDR values. In the K-fold, the resulting masks are the intersection of at least four agreements in the model of the different folds. The predicted masks of the OD and cup are very close to the Ref masks, reflecting high IoU and Dice values. In the lower Dice cases, the CDRs significantly differ compared to Ref CDRs, but despite this, there is complete agreement in the final decision for the classification based on CDRs since all apply the same threshold values, although they differ more than the higher Dice cases.

The two models used achieved state-of-the-art results for the segmentation, and the outcome was similar to the glaucoma classification based on the CDR with the Ref masks, indicating that these types of models can mitigate these labour-intensive and subjective tasks, that is, the segmentation of the OD and cup, providing a more consistent final result. To complement the CDR indicator, additional examination must be performed to make the final diagnosis of the patient using, for example, IOP values, anamnesis data and medical records. Another problem is the thin margin in the threshold CDRs, potentially resulting in an arbitrary classification; to resolve this obstacle, more diagnosis classes can be added based on CDRs, such as a suspicious case of glaucoma in the samples for which the CDR value barely passes or reaches the threshold.

**Figure 9.** Higher Dice cases and lower Dice cases for each model for OD and cup segmentation with the final outlines of the OD and cup for the K-Fold CVDB. Green represents the masks and outlines of the Ref, and red represents the masks and outlines of the predictions from the models. The lower right corner shows the mean IoU/Dice for OD and cup segmentation for every fold in each model with the respective standard deviation; the "Outlines" columns at the top are the Ref values of the CDRs (VCDR/HCDR/ACDR), and the bottom of the images show the results from CDRs of the models in the same order as that described for the Ref.

#### *4.3. Classification Methods on a Private Dataset*

For the glaucoma screening based on DL methods, only classification models were applied on the private dataset, since this did not have ground-truth masks for the application of segmentation techniques. For the classification, the same K-fold CV approach was applied to a private dataset of D-EYE (Portable Retinal Imaging System) images with lower resolution. The goal was to see if applying the classification methods to images acquired by mobile devices could achieve similar results to those obtained using highresolution images captured by clinical equipment. This would impart some portability to the glaucoma screening process, expanding it to more people and preventing glaucoma cases. This dataset was approved by the Ethical Committee of the Universidade Aberta of Lisbon and by the Health Ministerium of Brazil following the dispatch of the information DW/2018 of 02-21-2019 provided by the Brazilian Research Ethics Committee. This dataset consists of D-EYE images collected between October 2018 and March 2020 from patients aged above 40, either treated or untreated for glaucoma; subjects accepted the research protocols and allowed the use of data for studies on applications of automatised methods of glaucoma screening.

The images were obtained using a lens of D-EYE coupled to the camera of an iPhone 6S, which allows photographing the patient's optical papilla through 75 and 90 diopters and recording a short video that is stored in .mp4 format and collected in an environment with low light. From the videos, images that had 1080 × 1920 pixels of resolution and underwent the same pre-processing treatment as described for the glaucoma screening with the public databases were selected, and the OD was cropped and centred, obtaining dimensions of 512 × 512. From the database, a total of 347 images were selected, of which 293 were classified as normal, and 54 were classified as glaucomatous.

For the classification, since it is a small database, K-fold CV was applied with a leaveone-out set to classification neural networks. Each database was divided into five folds, in which each fold had 49 normal samples and 9 glaucomatous samples. The leave-one-out set had 48 normal samples and 9 glaucomatous ones and was used to validate the models after training. To train the CNNs, the same pre-trained classification models were used, namely, C1, C2 and C3. The results are presented in Table 8.


**Table 8.** Results for the models with K-fold CV for each test set of each fold with the mean of results of the models for the 5 folds and the standard deviation.

The models obtained high AUC values and specificity but low sensitivity and F1-score results, showing that they had difficulty classifying the glaucomatous samples since the database lacks sufficient representation of glaucoma samples, and most of these images were classified based on the clinical record, family history and IOP values. The glaucoma images do not show consistent patterns that indicate glaucoma incidence directly in the image. To validate the model's performance, each one was tested with the leave-one-out dataset, and the results are presented in Table 9.

In Figure 10, image Figure 10a illustrates a case of glaucoma with cataract opacity that worsens the overall quality of the image. Nevertheless, the C3 model predicted the sample correctly. The activation map points to a peripheric region with the presence of vases, but the spot that indicates the incidence of glaucoma is located in the centre of the excavation zone. Other good examples of a poor focus on the region of interest are images Figure 10b,c, where the output prediction was correctly classified, but the activation map points to an excentric zone instead of focussing on the cup area.


**Table 9.** Results for the models in K-fold CV for the leave-one-out test set with the respective mean and standard deviation for the 5 folds of each model for the private dataset.

**Figure 10.** Activation maps for D-EYE images of the CV models. The importance is indicated by the emphasized shades in the following order, from the most important to the least important: red, orange, yellow, green, blue. All images report glaucoma cases (**a**–**d**).

Figure 10c shows slightly better recognition of the centre zone, but the models focus on the veins and not on the optic cup. The CNNs provide additional information for diagnosis by an ophthalmologist but, in most cases, have an excentric focus (on veins) instead of focussing on the centre area (cupping of the OD).

The CNNs can help clinicians to expand glaucoma screening and accelerate the early screening of glaucoma. This database lacks visual representation of glaucoma samples since most of the glaucoma images collected do not have visual signs of glaucoma, and the diagnoses were made based on other indications, such as IOP, clinical records and family history, as mentioned previously. To improve the results, a more balanced database is needed with more glaucoma samples with visual patterns and indicators that evidence the presence of glaucoma. Another way to obtain better results would be to use other types of clinical data in neural networks to complement the image data.

#### **5. Conclusions**

Glaucoma disease has a high incidence around the world, the main cause of which is the lack of tools for and accessibility to early screening to prevent the evolution of the disease. Since it is a disorder that is usually asymptomatic, it is frequently detected in the late stage; by this time, medical treatment cannot reverse the injuries and vision loss but can only prevent the spread of glaucoma. Glaucoma screening is carried out in clinical centres by specialised clinicians with expensive tools. Mass screening is time-consuming and, most of the time, subjective, especially in the early stage, depending on the expertise of the ophthalmologist. For this reason, different approaches using CNNs can help to expand mass glaucoma screening, save time and money and help medical staff to perform more reliable screening with more consistent decisions, speeding up the process and relieving hard and repetitive work.

All classification models achieved results similar to those of state-of-the-art methods, with the Xception model showing an overall better performance. The CNN models for classification, unlike the CDR and segmentation method, are "black-boxes"; they do not provide a visual representation of their decisions. Thus, in this work, the model's activation maps are presented to provide visual interpretations and analyse the model's classification, thus helping medical experts to understand the CNN's decision. A careful analysis reveals that, in this case, the CNNs focus on the centre of the OD in the cup, reinforcing the significance of the cupping area as an indicator of glaucoma presence. The OD and cup can provide the CDR, which is usually used as an indicator for glaucoma screening. In other cases, the activation maps focus on peripheral veins that, in most cases, do not correlate with the incidence of glaucoma.

Since the ratio between the OD and cup is the most used indicator in the ophthalmology field, segmentation methods were applied to classify the samples after classification based on CDRs. The segmentation of the cup is more difficult than segmenting the OD, which does not usually have a well-defined boundary to help in the segmentation, making it difficult for clinicians to perform this task. For this reason, the CNNs have proved to be helpful in facilitating a subjective and hard task that highly depends on the experience of the ophthalmologist. The CDRs computed through the segmented masks were very close to the Ref CDRs, reinforcing that the CNNs can conduct an evaluation similar to that performed by a clinician. The model that produced the best results overall for these tasks was Inception V3 as the backbone of U-Net, with slightly better performance for the different CDRs. A way to improve classification based on CDR calculations is to use an additional class instead of binary classification, providing an extra margin for the threshold.

The classification methods were applied to a private database with images collected through a lens attached to a mobile device, and the results are promising since this lens is cheaper and can expand the accessibility and accelerate mass glaucoma screening. The model with the best results in the private database was Inception ResNet V2, which had higher sensitivity compared to the remaining models. The Xception model achieved similar AUC results but had a lower sensitivity compared to Inception ResNet V2. The classification results of images in this private database are promising but did not achieve the sensitivity of the models trained with public databases. The model's classification can facilitate mass screening with images collected by lenses attached to mobile devices, serving as an extra opinion and providing activation maps to explain the model's decision. These new approaches of collecting retinal images with posterior CNN classification models can accelerate and contribute to mass screening, mostly in remote areas, helping to redirect people to medical centres to prevent glaucoma as early as possible.

**Author Contributions:** Conceptualisation, A.N. and J.C.; methodology A.N. and J.C.; validation, A.C.; formal analysis, A.N. and J.C.; investigation, A.N. and J.C.; writing—original draft preparation, A.N. and J.C.; writing—review and editing, A.N., J.C. and A.C.; supervision, A.C.; project administration, A.C.; funding acquisition, A.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is financed by National Funds through the Portuguese funding agency, FCT—Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Multi-Device Nutrition Control**

**Carlos A. S. Cunha \*,† and Rui P. Duarte †**

CISeD—Research Centre in Digital Services, Polytechnic Institute of Viseu, 3504-510 Viseu, Portugal; pduarte@estgv.ipv.pt


**Abstract:** Precision nutrition is a popular eHealth topic among several groups, such as athletes, people with dementia, rare diseases, diabetes, and overweight. Its implementation demands tight nutrition control, starting with nutritionists who build up food plans for specific groups or individuals. Each person then follows the food plan by preparing meals and logging all food and water intake. However, the discipline demanded to follow food plans and log food intake results in high dropout rates. This article presents the concepts, requirements, and architecture of a solution that assists the nutritionist in building up and revising food plans and the user following them. It does so by minimizing human–computer interaction by integrating the nutritionist and user systems and introducing offthe-shelf IoT devices in the system, such as temperature sensors, smartwatches, smartphones, and smart bottles. An interaction time analysis using the keystroke-level model provides a baseline for comparison in future work addressing both the use of machine learning and IoT devices to reduce the interaction effort of users.

**Keywords:** precision nutrition; food plans; IoT; machine learning; food logging

#### **1. Introduction**

Disease caused by inappropriate diets is responsible for 11 million deaths and hundreds of millions of disability-adjusted life years [1]. The use of technology to support health (eHealth) opens an expansive landscape of opportunities. The emergence of a large set of smart devices capable of facilitating physiological data recording and other forms of recording the health status has potentiated many new eHealth applications. Mobile phones and smartwatches are among the devices with the most potential because of their ubiquity and sensor capabilities installed [2–4].

The importance of nutrition to health is unquestionable. However, the specificity of nutritional requirements for a person demands personalized nutrition control. Nutritional requirements lean on body parameters, genetic and epigenetic makeup, daily routines, and history of disease or allergies. Thus, health professionals (e.g., doctors and nutritionists) must intervene to keep food plans adequate for the target person. Nonetheless, the biggest challenge is not elaborating the food plan but instead is the follow-up. That includes keeping the food plan always present to the user, replacing unavailable or undesired foods, adjusting food quantities to exceptional energy consumption, and using logged food intake data to readjust future food plan revisions. Food intake logging, in particular, benefits from automation since it is time-consuming, and the discipline demanded by its operationalization leads to high dropout rates of food plan execution.

State-of-the-art approaches for automation food intake logging exploit the recognition of food and quantities in images [5,6] taken using the phone camera and unconventional intrusive devices to detect swallowing patterns associated with calories intake [7]. Notwithstanding the innovation inherent to these approaches, they suffer from measurement errors summing to the error introduced by food tables to quantify nutrients. Plus, these solutions

**Citation:** Cunha, C.A.S.; Duarte, R.P. Multi-Device Nutrition Control. *Sensors* **2022**, *22*, 2617. https:// doi.org/10.3390/s22072617

Academic Editor: Ivan Miguel Serrano Pires

Received: 25 February 2022 Accepted: 21 March 2022 Published: 29 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

still require some interaction (e.g., opening the application and taking pictures). A more realistic solution to reduce human interaction costs is integrating the nutritionist and user systems and resorting to off-the-shelf smart devices.

Smart devices are essential tools to enable the ubiquity of food plans by allowing their visualization anywhere. Plus, they act as a data-gathering mechanism for logging macronutrients, micronutrients, and hydration levels. These data feed into a nutritional model that can support the nutritionist (or other health professionals) adjusting the next food plan iteration.

This article presents the requirements and concepts of a solution covering the food plan life-cycle from its creation by the nutritionist to its visualization, adaptation, and logging of food intake by the person. It also discusses the system architecture and design by focusing on


The rest of this article is organized as follows. Section 2 presents the related work. Section 3 defines the problem addressed in this article and enumerates the requirements of a possible solution. Section 4 presents the concepts and formulas used in food plan creation. Section 5 describes the system architecture and implementation. Sections 6 and 7 describe the scenarios where the system will be tested. Finally, Section 9 presents the conclusions.

#### **2. Related Work**

This paper addresses a multidisciplinary problem connecting several research areas, such as precision nutrition, Internet of Things (IoT), web technologies, and machine learning.

*Precision nutrition* is an eHealth research area that depends on the person's characteristics to deliver nutritional advice [8]. One prominent research topic in this area is when advice is supported by machine learning models created from several sources of data—e.g., dietary intake (content and time), personal, genetics, nutrigenomics, activity tracking, metabolomics, and anthropometric. Food intake monitoring, in particular, provides a fundamental source of data to machine learning algorithms for creating adequate diet models. However, traditional food logging systems are intrusive, forcing users to change their routines. Hence, user interaction with the system makes this activity one of the main contributors to food plan execution dropouts.

Several approaches for automatic food intake logging have been proposed. Wearables are devices with high potential in healthcare [9], since they could automate the process of food intake logging. The results of their exploratory use in nutrition to reduce the burden of manual food intake logging are presented in [7]. The authors explored using a smart necklace that monitors vibrations in the neck and a throat microphone to classify eaten food into three food categories. The resultant models trained with data produced by these wearables revealed higher accuracy for the microphone when compared to the vibrations sensor. Notwithstanding the potential of wearables for automatic logging of food intake, they are still in their infancy, requiring development to reduce intrusiveness and achieve close to perfect accuracy.

Visual-based dietary assessment approaches represent another type of appealing solution that resorts to pictures to determine the intake of food nutrients. Lo et al. [5] explores deep learning view synthesis for the dietary assessment using images from any viewing angle and position. An unsupervised segmentation method identifies the food item, and a 3D image reconstruction estimates the portion size of food items. Despite the high accuracy of the approach, the results depend on depth images with separable and straightforward objects, notwithstanding typical dishes that may overlap several food items. Another work estimates food energy based on images using the generative adversarial network (GAN) architecture [6]. It resorts to a training-based system, which contrasts with approaches based on predefined geometric models which bound the evolution of models to

food with known shapes. The authors' approach provides visualization of how food energy estimation is spatially distributed across the image, enabling spatial error evaluation.

While visual food inference represents a promising research topic for automatic logging of food intake, its accuracy is still unacceptable for most applications. An alternative method for food logging is using speech-to-text conversion to reduce the user's interaction effort required to introduce nutrient information in the software application. Speech2Health [10] allows recording of food intake through natural language. A useracceptance study using Speech2Health has shown several advantages of a speech-based approach over text-based or image-based food intake recording. Nevertheless, even minor errors resulting from identifying food names and portion sizes from voice excerpts are unacceptable for generic use. Privacy represents another issue that speech-to-text introduces in public environments.

Most related work addresses the problem of automatic food intake monitoring. Instead of explicitly addressing that problem, we devised a holistic approach that depends on food plans created by nutritionists and followed by target users. By confirming meals or logging changes, these users produce data for feeding the feedback loop that approximates the food plan progressively to the actual user's needs. The availability of a baseline plan and the use of intelligent devices to record hydration, temperature, and energy expenditure reduce user interaction effort. Additionally, machine learning is applied to user preferences modeling, helping nutritionists choose the best food for the plan.

#### **3. Problem Definition**

Nutrition is a topic that has received more attention in the last decades due to its potential for benefiting from advances in technology. The ubiquity of smartphones and the emergence of wearable devices has created the opportunity to gather data automatically and support the user in deciding the best food to eat at each meal.

Many smartphone apps provide features to log intake meals and present nutritional statistics. However, choosing the best food plan for an individual requires a professional analysis that considers their physical condition (e.g., fat mass, lean mass, and weight), clinical condition, and goals. Discarding the health practitioner from the process may lead to inadequate food plans and be dangerous for individuals with health issues. Fortunately, it is possible to use technology to reduce the manual effort needed to manage the food plan life-cycle. The problems solved by a holistic solution span over the nutritionist and user (person following the food plan) domains.

We specified the requirements for the user and nutritionist domains with the support of several experts, such as nutritionists and doctors from a private hospital. We scheduled several meetings with these experts in two different phases: (1) requirement analysis, with the support of high-definition interface prototypes, and (2) deliverable analysis, where we tested software increments within a limited group of people by creating appointments, food plans, and performing food logging. Appendices A.1 and A.2 in Appendix A present the use cases for each of these domains.

#### *3.1. Nutritionist Domain*

We identified the following requirements for the nutritionist domain:


Nutritionists gather several types of data in the course of the appointment, which allows determining the person energy expenditure (Section 4) and other metrics and goals that can further support decisions during food-plan-making.

Energy expenditure is the core metric for devising the food plan. It provides the calories further distributed between macronutrients (i.e., proteins, carbs, and lipids) as follows:

$$\text{energy} = \pi \ast \text{protein} + \gamma \ast \text{crbs} + \beta \ast \text{lipids} \tag{1}$$

After providing the data required to calculate the energy expenditure to the system, the nutritionist defines values for *α*, *γ*, and *β*. These values represent the contribution ratio of each macronutrient to the energy expenditure, which is fixed to a specific day and distributed between meals.

The energy expenditure and its distribution between macronutrients and meals are dependent on the person's profile. For example, athletes have an increased demand for energy compared with sedentary people, and distribution of nutrients needs to be adapted to specific days (e.g., carbohydrate intake before and after exercise to help restore suboptimal glycogen reserves).

Fiber, water, and micronutrients are essential food plan elements unrelated to energy expenditure. The nutritionist adjusts the quantity of each nutrient to the person's goals and condition. For instance, during demanding physical activity, the person may need drinks with added sodium to replace electrolyte losses. On the other side, a person with the risk of high blood pressure would benefit from lowering sodium intake.

Food plan creation is time-consuming because it involves the combination of different types of food adequate to the person. That combination should fulfill the target energy expenditure and its distribution between macronutrients, and approximate the micronutrients specified for the food plan. As for selecting alternative food when the user follows the plan (user domain), the user preferences model also supports the nutritionist in choosing the food to be added to the plan. Here, the contribution of each food to the goals established for energy, macronutrients, and micronutrients represents a crucial input for the classifier.

The nutritionist needs to revise the food plan to adjust the energy and nutrients to the user goals, respecting the subsequent appointments. For example, suppose the user goal is not to reduce fat mass but increase muscle instead. In that case, the total energy intake specified for the plan must be reduced and, consequently, the proportion of macronutrients contributing to that energy. Since energy expenditure occupies the top of the energy breakdown hierarchy, it will drive food plan adaption according to data gathered during previous food plan executions. Smart devices may improve the accuracy of energy expenditure in further food plan revisions. The *physical activity energy expenditure* (Section 4.2) represents one component of energy expenditure that can be easily captured with acceptable accuracy by smartwatches (or fit bands), alone or combined with heart rate straps. These data combined with food and water intake logs—registered through the system interface or obtained through intelligent bottles—provide elements required to tune the successive food plan revisions.

#### *3.2. User Domain*

We identified the following requirements for the user domain:


Nutritionists must design food plans aligned with user conditions and preferences. Further, users demand ubiquitous food plan visualization and logging mechanisms with

small interaction costs. While interaction efforts depend heavily on user interface design, off-the-shelf IoT devices can be valuable tools to reduce human interaction with the system. These devices may be balanced with efficient user interfaces to reduce food plan execution abandonment.

As food plans are fixed to days of the week, repeating for several weeks, users may often lack some ingredients when executing the plan. Hence, the system may suggest alternative food according to the nutritional equivalence and user preferences—using historical data for similar meals, days of the week, months, or even weather contexts.

#### *3.3. Automation Limits*

The number of interactions with the system and the individual interaction cost determine the total user interaction effort. Logging of meals intake as in the food plan requires a small interaction effort since the only input is the user confirmation in either the smartphone or smartwatch. Sometimes that happens in batches (e.g., by the end of the day), resulting in low interaction costs and a small number of interactions (one per meal), as presented in Table 1. In this scenario, the user domain can benefit from integrating the food plan built by the nutritionist with the smartphone application that allows its visualization and confirmation of intake meals.

Water intake logging demands a higher number of user interactions when compared with meal confirmation. The user may take a sip of water dozens or hundreds of times a day to be hydrated. Consequently, water intake logging is more complex unless they stick to a standard behavior, such as drinking from the same bottle and logging the bottle storage capacity when they finish. However, even that standard method has flaws because the user may never finish the last bottle refill during the day or replace it with new water. Smart bottles may potentially reduce the number of user interactions for water intake logging since all the logged water intake is sent to the cloud service and made accessible to our system without user interaction.

While the previous scenarios offer an automation opportunity, some actions are difficult to automate, such as logging food not registered in the food plan. As shown in Table 1, notwithstanding the small number of interactions during the day, the interaction cost of individual actions is high—justified mainly by the search for additional food and the introduction of respective quantities. In addition, their automation is complex, and the closest state-of-the-art approaches rely on machine learning to identify food in pictures taken using the phone. However, these approaches are still far from one hundred percent of accuracy, which leads to large errors summed from


Reduction of interaction costs with respect to activities with low automation potential needs to be handled at the interface design level. The user application interface should be optimized to reduce the effort of food-searching for the *changing meal* and *add extra food* actions.


**Table 1.** Interaction effort of main actions for each device.

#### **4. Energy Expenditure**

The user energy expenditure drives the creation of food plans. Adequate diets approximate intaken calories to the total energy expenditure, which includes the resting energy expenditure (REE), physical activity energy expenditure (PEE), and thermic effect of food (TEF).

CB (caloric balance) in the human body approximates the CC (caloric consumption) to the sum of PEE, REE, and TEF.

$$\text{CB} = \text{CC} - \text{PEE} - \text{REE} - \text{TEF} \tag{2}$$

This section presents the calculation of PEE and REE. Notwithstanding the low contribution of the TEF (between 3% and 10%) to the total energy expenditure (TEE), it may have an impact on obesity. However, we do not handle it in this article due to its high measurement complexity [11] created by dependency on several other variables (e.g., measurement duration) [12].

#### *4.1. Resting Energy Expenditure*

REE is considered equivalent to the basal metabolic rate (BMR). BMR is the minimum number of calories required for basic functions at rest. On the other side, RMR is the number of calories our body burns while at rest. Despite both definitions slightly differing, the Harris–Benedict equation [13,14] can approximate REE or other equivalent equations presented in Table 2 for calculation of BMR.

#### *4.2. Physical Activity Energy Expenditure*

PEE calculation involves converting metabolic equivalents of activities to calories expended per minute (cal/min), based on body weight and the varying exercise intensities. The physical activity level (PAL) is an inexpensive and accurate method for calculation of PEE, based on the average values of 24 h of TEE and REE, as follows:

$$\text{PAL} = \text{TEE/REE} \tag{3}$$

The effect of gender does not interfere with PAL calculation because the BMR absorbs the gender difference in energy needs accentuated by the heavier weight of men.

A table that associates physical intensity lifestyles to PAL values (Table 3) can simplify PAL calculation. In that context, TEE is the result of multiplying REE by the PAL value associated with the person's lifestyle category [15].

Another method for PAL calculation combines the time allocated to habitual activities and the energy cost of those activities (Table 4). In this case, PAL represents an energy requirement expressed as a multiple of 24-hour physical activity ratio (PAR). Here, PAR is a factor of BMR (PAR is 1 when there is no energy requirement above REE). Intuitively, the energy cost (PAR) is multiplied by the activity time to obtain PAL [15,16].

#### *4.3. Distribution of Nutrients*

The TEE estimate represents the total calories in the food selected for the food plan. TEE is then broken down into macronutrients complemented with micronutrients.

Macronutrients are typically specified in grams per kilo of body weight; such is the case of protein, carbohydrates, and fat (lipids). The exception is fibers that are specified in total grams. Water is frequently classified also as being a macronutrient [17]. However, water and fiber have zero calories, unlike protein, fat, and carbs. Notwithstanding that fibers do not usually count as calories in food plans, one type of fiber, named soluble fiber [18], may be absorbed by the organism and thus provide the body with calories.

Compared with macronutrients, the number of micronutrients is vast, and for that reason, nutritionists only select a few to be used as control metrics during food plan creation. From the conversation with several nutritionists, we have chosen iron, calcium, sodium, and magnesium, because of their transversality over several population groups. However, the selection of micronutrients depends always on the target population group (e.g., elderly, young people, and athletes).


, *22*, 2617


**Table 2.** *Cont*.


**Table 3.** Classification of lifestyles according to physical intensity (PAL values).

**Table 4.** Total energy expenditure for a population group.


#### **5. Architecture and Implementation**

This section presents the architecture and implementation of the solution proposed in this article, divided between two front-ends: *nutritionist front-end* and *user front-end*.

#### *5.1. Nutritionist Front-End*

The nutritionist front-end (Figure 1) implements two important concepts: *appointment* and *food plan*.

The appointment is the concept responsible for managing the energy expenditure—and its distribution throughout macronutrients—and micronutrients, as presented in Section 4. Moreover, to support user monitoring between appointments, it should present all historical data entailing previous food plans and energy distribution by day of the week, event, and meal type.

Monitoring of physical conditions frequently resorts to the person's goals, specified in terms of:


Control and analysis of generic user goals depend on the previous metrics, although specific people groups may require other specific metrics; such is the case of groups with specific diseases that require the control of specific body parameters.

Other important appointment data required for food plan making include the following:


(**a**)


Nutritionists rely on the appointment of data for food plan creation. While adding new meals and foods to the food plan, the nutritionist can balance the food calories with target energy and nutrients. They can also visualize other relevant information gathered during the elaboration of appointments.


(**c**)

**Figure 1.** Nutritionist front-end. (**a**) Appointment. (**b**) Client details. (**c**) Food plan.

#### *5.2. User Front-End*

The user front-end (Figure 2a) uses the food plan as the basis for preparing meals, searching for alternative foods, monitoring consumption of water and calories during the day, and food logging. Food is presented on the plate (Figure 2b)—useful for elderly, people with vision impairment, or those that may find it difficult using mobile/smartphones with mobile devices—and in the list format.

Daily statistics (Figure 2c) are valuable assets for monitoring calories, macronutrients, micronutrients, and hydration during the day. These values are paired with target values defined by the nutritionist in the food plan.

**Figure 2.** User front-end. (**a**) Daily meals. (**b**) Meal visualization. (**c**) Daily statistics.

Notwithstanding the small screen sizes of smartwatches, they are practical for presenting meals (Figure 3a), sending notifications, and logging food intake. They also present statistics regarding nutrients intake (Figure 3b) and hydration (Figure 3c).

#### *5.3. Architecture*

Figure 4 presents the solution architecture composed of four different interfaces. The nutritionist interacts with the system to create appointments and food plans using a web application. On the other side, the user visualizes the current food plan or logs food ingestion using a mobile phone or smartwatch.

#### 5.3.1. Web Applications

The mobile application is delivered as a PWA (progressive web application). PWAs represent a new class of applications alternative to traditional mobile phone apps, with several advantages over them. Instead of being developed to a specific platform (e.g., iOS or Android), they are built as a web application that can work offline and be installed on any smartphone. A previous study reported PWAs 157 times smaller than React Nativebased interpreted apps and 43 times smaller than Ionic hybrid apps [26]. The Twitter PWA consumes less than 3% of the device storage space as compared to Twitter for Android [27], and the Ola PWA is 300 times smaller than their Android app [28]. Additionally, they are cross-platform, although current implementations may require adaptation between some browsers.

**Figure 3.** Smartwatch. (**a**) Food plan visualization and logging. (**b**) Daily control of nutrients. (**c**) Daily control of water.

Both applications with respect to the user and nutritionist front-ends were developed in LitElement [29], a base class to create lightweight web components. Design of the user front-end for smartphones embraces the PWA principles [30] (e.g., web application installability, and offline usage).

#### 5.3.2. Smart Bottle

Water consumption is logged either by the user—using the smartphone or smartwatch or automatically by a smart bottle. We tested several smart bottles and decided on the Hidratespark [31], justified by its mature API and good construction and usability of the bottle. Plus, it can be easily integrated with Fitbit [32], which is used as a gateway to retrieve data to the user's back-end.

Water intake goals defined in the food plan are adjusted according to the environment temperature. Temperature sensors provide the inputs to make that adjustment according to the rules stated in the food plan.

#### 5.3.3. Smartwatch

As explained in Section 4, determining the energy expenditure of one person is one of the main challenges in the creation of a food plan. Modern smartwatches provide a good approximation of energy consumption during physical activity. They provide valuable information to be used by food plan revision activities, enabling correction of energy expenditure values predicted by traditional methods during follow-up appointments (Section 4.2). Pedometers and heartbeat monitors incorporated in devices provide a good approximation of calories burned data [33].

**Figure 4.** Architecture.

#### 5.3.4. Preference Learning

Exploring machine learning techniques on logged data makes it possible to help nutritionists model user food preferences. These techniques build up a recommendation system [34], based on food preference models, that supports the selection of food during food plan creation. That system will also allow proposing food alternatives to the person following the plan. That may occur when the food is unavailable, or the person prefers other equivalent food.

Reinforcement learning seems an adequate tool for applying preference learning to food recommendation [35]. Starting without knowledge, the agent helps the nutritionist to choose the food and quantity for the food plan without breaking the constraints imposed by the goals established for macronutrients and micronutrients. The agent accuracy improves with the feedback received from the nutritionist and the intake of food logged by the user. The same agent can help the user choose equivalent food and quantities when executing the plan based on learned preferences and goals of nutrients.

#### *5.4. Security*

Security is a complex and wideband problem. It spans the human-related processes and the system level (e.g., network and application). Human misconduct is in the origin of several security threats in eHealth systems [36]. Training people and auditing security procedures is a natural way of reducing the risk of threats occurrence. Coordination between developers, users, organizations, and government regulators represents another security flaw source [37].

In this work, we handle security at the system design level. E-Health systems contain data that are sensitive to confidentiality, integrity, and availability threats [38]. There are different types of data sensitiveness. Personal data are the most critical data under management; thus, ensuring the confidentiality of these data is of the utmost importance. Hence, we segregate the user data in the application and provide one feature to remove

these data anytime without compromising their food plan while an anonymous entity. The latter offers less security risk when unrelated to the person.

The design of the nutritionist application allows deletion of the user's personal data without compromising the food plan management features, as long as an *ID* can identify the user. The segregation of functionality and data between the user and nutritionist applications offers an additional protective barrier. The user application uses an application token to communicate with the nutritionist application, and the former does not store or handle personal data—an ID identifies the user.

As much as personal data, authentication credentials are sensitive data demanding theft protection. The HTTPS already ensures protocol-level privacy in the communication channel. Plus, the front-end encrypts passwords before transmitting them to the back-end, and they are then handled and stored in an encrypted form.

Feature-oriented access control constrains the access to features available on each web page. There are three profile types: nutritionists, administrators, and users.

Risk management models, such as the one presented in [39], may complement our system design. Additionally, other protection schemes against complex attacks [40] are orthogonal to our system and may also be used.

#### **6. Case Study: Alzheimer's**

Alzheimer's disease is a progressive loss of mental function, characterized by degeneration of brain tissue, including loss of nerve cells, accumulation of an abnormal protein, and development of neurofibrillary braids [41]. Alzheimer's patients become dependent on others, even for the most basic tasks. Controlling feeding and hydrating for an Alzheimer's patient is thus a crucial activity performed by the person who supports their daily routine, called the informal caregiver (IC).

Conditions of malnutrition, super nutrition, and dehydration are common in people with diseases causing dementia. The loss of autonomy also manifests itself in their inability to demonstrate food needs. Therefore, it is fundamental to support the nutritionist in the preparation and follow-up of a food plan aligned with the patient's needs. Food plan monitoring is undoubtedly a process that demands much discipline from the IC and the ability to deal with possible circumstantial adaptations, such as replacing foods prescribed in the food plan with other equivalents or changing the quantity of water consumed as a function of ambient temperature.

This case study investigates the problem of creating and monitoring diet plans in patients with dementia—such as those with Alzheimer's. It allows the creation of nutritional plans by the nutritionists and the follow-up of these plans by the ICs through a mobile app to significantly increase the patient's quality of life. The app will send the IC notifications regarding proper nutrition and hydration in the due moment. It also controls hydration using the smart water bottle. In addition, the application will suggest alternatives to plan foods if they are unavailable or rejected by the patient. Another feature important for this group is the dynamic adaptation of water administration to the patient as a function of environmental conditions observed by temperature and humidity sensors. This feature is vital when the patient is unable to express thirstiness.

#### **7. Case Study: Sports**

The recent growth in the pursuit of sporting activities, motivated by a widespread increase in the perception of the importance of maintaining physical fitness, campaigns explicitly aimed at combating physical inactivity, and opportunities created by the revelation of lesser-known modalities, has brought forward fundamental questions such as the correct nutrition of the practitioners. Several institutions and individuals involved in physical activity have integrated these concerns into their scope, including nutritionists.

Food plan elaboration and monitoring present two main challenges: (1) obtaining the person's biometric data, eating habits, and energy consumption, and (2) monitoring user food intake and providing dynamic adaptation of the food plan.

Sports nutrition is one of the most complex areas of nutrition. It requires observing a comprehensive set of metrics, encompassing the athlete's physical aspects, physical activity, and eating habits. Fortunately, devices for measuring specific physical parameters represent a common practice among athletes. The creation of data repositories to help nutritionists build the plan is only possible by automatically integrating data collected by these devices with other data not directly observable—such as dietary habits and subjective metrics. These repositories also contain data that can help adapt the food plan at its execution stage. For example, variations in temperature or physical intensity may demand quick changes in individual energy or hydration needs. In these scenarios, the support system uses data collected by devices to dynamically adjust the food plan and send alerts to athletes to eat food or water at the right time.

#### **8. Interaction Results**

This section presents the human–computer interaction cost associated with typical user tasks to visualize the food plan and log food intake.

Traditional methods used in the usability evaluation of an interface fall into two categories: (1) subjective opinion of users and experts—mainly applying questionnaires [42] and inspection methods [43,44]—and (2) objective techniques such as rules [45], analytics modeling [46], and automated testing [47,48]. Notwithstanding that these approaches provide important tools to determine the usability of the user interface, there is both cost and time needed to implement user interaction evaluation with acceptable coverage, coupled with the need to use experts to compensate for the user's faults.

#### *8.1. Keystroke-Level Model*

We applied the keystroke-level model (KLM) [49] to the user interface depicted in Figure 2, for testing the quality of the human–computer interaction and estimating the time spent in critical tasks. In this model, a unit task is defined with two parts: *task acquisition* and *task execution*. The total time to complete a unit task is given by *Ttask* = *Tacquire* + *Texecute*.

At the execution level, KLM provides physical, mental, and response operators with predefined time values. These operators are defined by a letter and include *K* (keystroke ≈ 0.12 s), *P* (point ≈ 1.1 s), *H* (homing the hand(s) on the keyboard or other device ≈ 0.4 s), *D* (draw is measured in real time), *B* (button press ≈ 0.1 s), *M* (mental preparation for action ≈ 1.35 s), and *R* (system response, which is a parameter measured in real time). The execution time is the sum of the time for each of the operators from the final KLM string *Texecute* = *TK* + *TP* + *TH* + *TD* + *TB* + *TM* + *TR*.

#### *8.2. Interaction Results*

Table 5 presents the time required to execute each application task. The KLM string generated is represented in the *sequence of operators* column and the respective time required to execute each task in the *estimated time* column. The task *"update food entries for train and competition"* allows the creation of periodic food requirements and is specific to the sports scenario. In contrast, the Alzheimer's and the sports scenarios share the other tasks. The results are presented for the user application since we aim to reduce user abandonment motivated by interaction costs resultant from food logging activities.

As expected, results show that tasks that change the original food plan for logging purposes manifest higher interaction costs. Food plan visualization requires 1.2 or 2.3 s, depending on the UI view. Logging one meal by confirming the original food plan only requires 1.2 s. On the other hand, logging tasks regarding food intake not present in the food plan are costly. Each extra food added to the food plan requires 8.66 s of the user's time.

Manual logging of water using the application requires 3.6 or 4.8 s, depending on the view. The adoption of smart bottles avoids that interaction, which may repeat dozens of times during the day.

The interaction time of tasks performed by the smartwatch (e.g., energy expenditure logging) is not presented in this section. Despite the automation of the data logging process, the user can not perform any equivalent task manually.


#### *8.3. Analysis of Results*

The observed results of human–computer interaction times pinpointed the tasks requiring improvement of interaction times. They provide a baseline for evaluating other interaction schemes and assessing the contribution of automation (e.g., using IoT devices) to the goals established in this article. The lower the interaction time, the lower the user discipline needed to maintain a food plan visualization and logging process, and the lower the user abandonment rate.

We designed the application to minimize human interaction with the support of UI experts. The most challenging tasks using a UI (those with more significant interaction times) require the search of new food manually. Although the interface implementation can still be questionable in terms of the specific design that may compromise the generalization of results, it is evident that there is little space for improvement when we need to perform a generic search for food using text.

Machine learning techniques are natural solutions to help reduce the time required for logging extra food in addition to—or in replacement of—those present in the food plan. As referred to in Section 2, there have been several attempts to recognize food objects in pictures taken with the mobile phone to reduce the burden of manually logging food. However, interaction is still required to take the picture, and an accuracy less than perfect could even increase the interaction time since the user would need to correct these data. The previous rationale leads to a different strategy for exploring machine learning for reducing interaction time. Creating a food preferences model customized to each user would likely lessen the food search interaction time considerably. By resorting to historical data and observable features (e.g., user location, day of the week, and weather), the system

can anticipate the consumption of specific food. In that scenario, the interaction time would be equivalent to confirming a meal in the food plan.

#### **9. Conclusions**

This article unveils the concepts, requirements, and technologies needed to build a system that could support the nutritionist in creating food plans aligned with the individual profile. Further, it presents an architecture and software developed for smartphones (PWA) and smartwatches. The software furnishes food plan visualization logging of food and water intake, among other related features. It also integrates other devices, such as smart bottle technology and temperature sensors, to reduce human–computer interaction.

The availability of off-the-shelf devices has brought unprecedented ways of gathering data from physical phenomena without resorting to direct human–computer interaction. We propose an architecture that integrates the nutritionist back-office, the user application, and smart devices, focused on interaction cost reduction when users follow a food plan. We presented a baseline of the human interaction effort associated with several tasks pinpointing the most critical (expensive) operations. Such baseline sustains the evaluation of future machine learning and IoT approaches targeting the reduction of human interaction effort when completing critical operations.

As future work, we plan to explore machine learning techniques to reduce interaction times in two demanding user groups: Alzheimer's patients and athletes. The Alzheimer's group offers interaction challenges since several caretakers are elderly and have difficulties using apps or are not motivated to use apps as a data logging mechanism. On the other hand, athletes are very disciplined but need tight control of food intake before, during, and after physical activity.

**Author Contributions:** Conceptualization and project administration, C.A.S.C.; software interaction design, R.P.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is funded by National Funds through the FCT—Foundation for Science and Technology, I.P., within the scope of the project Ref. UIDB/05583/2020. Furthermore, we would like to thank the Research Centre in Digital Services (CISeD) and the Polytechnic of Viseu for their support. Moreover, the authors greatly thank IPV/CGD-Polytechnic Institute of Viseu/Caixa Geral de Depositos Bank, within the scope of the projects PROJ/IPV/ID&I/002 and PROJ/IPV/ID&I/007.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** We want to thank the nutritionist Marco Pontes, Sao Mateus Hospital, Social Works of City Hall at Viseu, Valter Alves, Maria Joao Sebastiao, Maria Joao Lima, Carlos Vasconcelos, Lia Araujo, and Carlos Albuquerque for their support in this project.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

This appendix presents the use cases described using the unified modeling language (UML) related to the application requirements.

*Appendix A.1. Nutritionist Use Cases*

**Figure A1.** Nutritionist login.

**Figure A3.** Appointment creation.

**Figure A6.** Food plan creation.

**Figure A10.** Visualize statistics.

**Figure A11.** Update periodic food entries and train for competition.

**Figure A12.** Change active food plan.

**Figure A13.** Connect watch API.

**Figure A14.** Provide Fitbit consent.

#### **References**


### *Article* **Empowering People with a User-Friendly Wearable Platform for Unobtrusive Monitoring of Vital Physiological Parameters**

**Maria Krizea 1,2, John Gialelis 1,2,\*, Grigoris Protopsaltis 1, Christos Mountzouris <sup>1</sup> and Gerasimos Theodorou <sup>1</sup>**


**Abstract:** Elderly people feel vulnerable especially after they are dismissed from health care facilities and return home. The purpose of this work was to alleviate this sense of vulnerability and empower these people by giving them the opportunity to unobtrusively record their vital physiological parameters. Bearing in mind all the parameters involved, we developed a user-friendly wrist-wearable device combined with a web-based application, to adequately address this need. The proposed compilation obtains the photoplethysmogram (PPG) from the subject's wrist and simultaneously extracts, in real time, the physiological parameters of heart rate (HR), blood oxygen saturation (SpO2) and respiratory rate (RR), based on algorithms embedded on the wearable device. The described process is conducted solely within the device, favoring the optimal use of the available resources. The aggregated data are transmitted via Wi-Fi to a cloud environment and stored in a database. A corresponding web-based application serves as a visualization and analytics tool, allowing the individuals to catch a glimpse of their physiological parameters on a screen and share their digital information with health professionals who can perform further processing and obtain valuable health information.

**Keywords:** wrist-wearable device; PPG processing; physiological parameters; web-based applications; data analysis

#### **1. Introduction**

Average life expectancy has increased over the years, resulting in a rise in senior populations [1]. The attitude of society towards senior citizens and their well-being is an indicator of its organization and civilization. Elderly people are a demographic that needs expert care and dedicated assistance, sometimes even on an everyday basis. They tend to feel even more vulnerable especially after experiencing health issues and having been released from a health care facility. This is a crucial point in their recovery and well-being, and they need all the help they can get, in either physical or virtual form. Assistive technology based on the Internet of Things (IoT) can support unobtrusive health monitoring at home with the use of electrical devices, such as sensors and other gadgets (wearable or not) that provide feedback and remote access to the end user, aiming at improving inhabitants' quality of life by providing more independence and better care [2]. According to [3], existing smart home health monitoring technologies include physiological monitoring, functional monitoring/emergency detection and response, safety monitoring and assistance, security monitoring and assistance, social interaction monitoring and assistance, and cognitive and sensory assistance.

Treading on the groundwork of assistive technology, the aim of minimizing the hospitalization days of the elderly and sending them home without compromising their safety seems to be doable. The achievement of this goal has at least two advantages. First, the elderly benefit from returning home to a safe environment as soon as possible and this

**Citation:** Krizea, M.; Gialelis, J.; Protopsaltis, G.; Mountzouris, C.; Theodorou, G. Empowering People with a User-Friendly Wearable Platform for Unobtrusive Monitoring of Vital Physiological Parameters. *Sensors* **2022**, *22*, 5226. https:// doi.org/10.3390/s22145226

Academic Editors: Ivan Miguel Serrano Pires and Georg Fischer

Received: 26 May 2022 Accepted: 8 July 2022 Published: 13 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

can work in favor of nurturing a positive psychology for them. Second, the health care system also profits from early yet safe discharge of the elderly to their home. The financial resources dedicated to the health sector are not unlimited [4] and the struggle to mitigate the consequences of the pandemic is ongoing. Unburdening the health care system is a tangible positive ramification of assistive technology development, and the work we envisaged could be a major abetment to this line of action.

Wrist-wearable devices are a common accessory of everyday life, worn by many people just to tell time in the beginning. With the integration of proper sensors, they evolved to non-invasive monitoring units that aggregate vital signals. With the application of befitting embedded algorithms, several physiological parameters critical for health status assessment can be extracted. These devices can also log activities, steps, calories and sleep patterns. The autonomy and portability of such an apparatus grants the user the possibility to wear it everywhere and at any time, which is a notable advantage. Identifying outliers or anomalies in heart rates and other features can help establish patterns that play a significant role in understanding the underlying cause of the demise of physical well-being. Additionally, the accumulation of this valuable information in a backend system, in a secure manner, can be leveraged to the advantage of the end users for optimizing their living standard. This can be accomplished by providing medical experts with the tools to interact with the data and draw valuable conclusions regarding health status and everyday lifestyle.

Light, the essence of photoplethysmography, known as PPG, is used to measure the volumetric variations of blood circulation. This measurement provides invaluable information regarding humans' physiological parameters entailing their health status. PPG technology has been proven less complex operationally, more comfortable for the user and more cost efficient compared to other monitoring techniques [5]. The continuous detection and monitoring of human physiological parameters such as heart rate (HR), blood oxygen saturation (SpO2), and respiratory rate (RR) are of paramount importance. A wearable device that unobtrusively grants the elderly an opportunity to continuously and without any intervention extract those parameters constitutes a major achievement. The state-ofthe-art techniques in modern practices approach the task of obtaining such measurements using validated pulse oximeters, which are worn on individuals' fingers. These pulse oximeters are based on the transmission-mode PPG. The continuous monitoring of the human vital signals has shifted from simply an appealing idea to fitness enthusiasts into an everyday habit for a plethora of people using a smartwatch. Sensors integrated within smart watches use reflectance-mode PPG to gather vital signals. Although they are widespread amongst sportspeople, they have not been widely used in clinical practice [6,7]. This can be attributed to the fact that PPG signals are vulnerable to Motion Artifacts (MAs) caused by hand movements, which affect considerably the accuracy of the entailed physiological parameters [8,9].

PPG signal acquisition becomes increasingly challenging when additional factors such as environmental noise or sensor misplacement are considered, thus further affecting the accurate assessment of its features. Measuring physiological parameters utilizing wrist-wearable devices can be a more perplexing procedure compared with the finger or another part of the body due to the low blood perfusion in the wrist area. The design of a wrist-wearable device must consider factors such as the spacing of the light-emitting diodes (LEDs), photodiode surface, as well as biological factors such as skin tone. It is clear that efficient PPG processing must be performed to enable reliable reading of its features and consequently accurate extraction of the physiological values. All the aforementioned challenges highlight the need for the implementation of a device capable of monitoring physiological parameters continuously and in an unobtrusive manner, that can easily be operated without need for excess previous training.

This work is about presenting the architecture and the constituent components of a comprehensive user-friendly system which provides well-being monitoring services promoting peace of mind to senior citizens. The overall solution consists of two subsystems that are integrated via well-defined interfaces, but each one performs autonomous functions in an opaque manner: The subsystem of bio signal recording and physiological parameters extraction and the back-end subsystem of storage, data processing and services. The novelty of the proposed system predominantly lies on the minimal design of the wearable sensing device while entailing accurately all the parameters, allowing a least disturbing interaction with the end user. This system has the capacity to continuously extract in real time the physiological parameters of HR, SpO2 and RR, to perform well-being status assessment, and to provide personalized feedback to improve health status. Furthermore, in comparison with similar solutions, the proposed system has the advantage of functioning without the aid of a paired smartphone for collecting and delivering the bio parameters to the back-end system, thus enhancing its user-friendly attribute for users with low digital literacy.

The remainder of this paper is organized as follows: Section 2 introduces the main challenges regarding the monitoring of physiological parameters using wearable devices and gives details on related works and their limitations. Section 3 presents design and implementation aspects of the proposed integrated system at both physical and operational level as well as performance evaluation features of the proposed wrist-wearable device. Section 4 describes the implementation and the core components of the web-based application, which comprises a front-end to display physiological parameters to end-users, and a back-end to allow data processing and service provisioning. Finally, Section 5 concludes this paper with future research directions.

#### **2. Related Work**

As shown further, research has been conducted towards producing simultaneously more than one physiological parameter using the PPG method, through wrist-wearable devices as enablers of continuous health monitoring in everyday life. With the evolution of medical science and technology, the wearables are incorporated with multiple different sensors so that they can keep track of a wide range of measurements such as heart rate, blood oxygen level, body temperature and activity monitoring including many others. Nowadays, wearables play an important role in making health care practices more efficient and cost-effective. These devices can be connected to smart phones or web apps allowing people to store their data for future reference. An overview of such wearables follows.

Li et al. [10] predicted the outbreak of Lyme disease and inflammation by combining sensor data along with medical measurements. This work gave evidence of how wearables can monitor activity along with physiology.

Mishra et al. [11] detected COVID-19 by utilizing physiological (HR) and activity (steps) data acquired by wearable devices. 5200 subjects participated in the analysis, including individuals with COVID-19. The study indicated elevated resting heart rates relative to the subject's baseline. Two algorithms—resting heart rate difference (RHR-diff) and heart rate over steps anomaly detection (HROS-AD)—were developed. The first algorithm was based on standardizing the resting heart rate over a fixed time frame to observe baseline residuals.

Seshadri et al. [12] performed a data-driven COVID-19 prediction employing an early detection algorithm (EDA) based on HR, HRV and RR collected from wearables devices. The EDA can detect physiological changes and alert users of possible infection with SARS-CoV-2 before they develop clinical symptoms.

Downey et al. [13] showed that only 16% of the subjects remained connected to obtrusive monitoring systems after 72 h. Furthermore, the cost for a complete vital sign monitoring system can be quite significant.

Zenko et al. [14] proposed a battery powered wearable device along with a simple algorithm for the extraction of the physiological parameters of HR, Pulse Rate Variability and SpO2. This work evaluates the acquired HR parameter while calibration and verification of the SpO2 parameter still needs to be performed.

Son et al. [15] introduced a wearable device which measures oxygen levels in the blood using a light reflection method while it integrates hardware for wireless data transmission. Experimental results were compared to the Texas Instruments development board (SpO2

AFE44x0 EVM), and the maximum deviation was 6.7% in HR measurement and 4.4% in SpO2 measurement.

Jarchi et al. [16] integrated the AC components of red and infrared PPG signals in a complex waveform and then by applying the bivariate empirical mode decomposition algorithm, the SpO2 value is estimated with an approximate error of 3%.

Preejith et al. [17] developed a wrist-based optical heart rate device which, in order to eliminate the noise, ignores the measurements when motion is detected. The accuracy of the HR measurements equals to 0.9, expressed as Pearson's r. The coefficient r indicates the strength of a correlation between estimated and real values, and the magnitude of 0.9 nominates that the variables can be considered highly correlated.

Eugene et al. [18] designed a wearable device equipped with PPG sensors for extracting bio-information and the Centralized State Sensing (CSS) algorithm was developed for estimating HR. After comparisons on readings taken across sensors, it was proved that this specific algorithm achieved more accurate HR measurements.

Wojcikowski [19] proposed an algorithm for real-time HR estimation by a wristwearable device. The device incorporates PPG and accelerometer sensors. The acceleration signal is used to detect body movements which distort the PPG signal. The evaluation results evidenced that the developed algorithm for HR measurements outperformed the other algorithms from the literature.

Münzner et al. [20] presented methods for development of robust deep learning (DL) methods for human activity recognition (HAR) addressing the problems of normalization and fusion of multimodal sensor HAR data. The results show that sensor-specific normalization increases the prediction accuracy of the convolutional neural networks (CNN). In the context of multimodal HAR, further normalization techniques should be investigated which focus on other modalities such as physiological sensors.

Tang et al. [21] proposed a new CNN that uses hierarchical split (HS) for a large variety of HAR tasks, which can enhance multi-scale feature representation ability via capturing a wider range of receptive fields of human activities within one feature layer. Benchmarks demonstrated that the proposed HS module is an impressive alternative to baseline models with similar model complexity and can achieve higher recognition performance.

Zhang et al. [22] presented a Deep Neural Network (DNN) to detect lumbar-pelvic movements (LPMs), including flexion, lateral flexion, rotation, and extension, locally ondevice, where the data were collected from a clinically validated sensor system. Continuous monitoring of these movements can provide real-time feedback to both patients and medical experts with the potential of identifying activities that may precipitate symptoms of low back pain (LBP) as well as improving therapy by providing a personalised approach.

Aside from the prototypes that emerged from literature research, as depicted in Table 1, there are wrist-worn commercial products available which utilize PPG sensors for obtaining physiological measurements, as shown in Table 2. Empatica E4 [23] is a wearable wireless multisensory device for real-time data acquisition and computerized biofeedback. E4 comprises four embedded sensing modules: a photoplethysmography (PPG) module, an electrodermal activity (EDA) module, a 3DOF accelerometer module, and a temperature sensing module. E4 offers the readings of HR, activity status and temperature while being capable of characterizing the function of the autonomic nervous system, EDA for assessing the sympathetic activation and HRV for assessing the parasympathetic activation. The device is compliant with international safety and emissions standards. MaxRefDes103# [24], is a physiological signal sensing band reference design available to the research community for further development. It is a wrist-worn wearable exhibiting high sensitivity and algorithmic processing capabilities comprising an enclosure and a biometric sensor hub with an embedded algorithm that processes PPG signals in real time for extracting HR and SpO2 only. Eventually, its corresponding output and raw data are streamed via Bluetooth to an Android application or PC GUI for demonstration, evaluation, and further elaboration. In addition to displaying the extracted HR and SpO2, the Android application furnishes additional algorithms for calculating RR, HRV, and sleep quality. Other wrist-worn wearables

used for health monitoring are the Fitbit Versa 3 [25], Samsung Galaxy Watch 4 [26] and Apple Watch Series 7 [27]. The Fitbit smartwatch records the physiological features: HR, SpO2, skin temperature variation, sleep stages and RR measurement during sleep. Fitbit utilizes the BLE communication protocol and has mobile applications compatible with Android and iOS. Both the Samsung Galaxy Watch 4 and the Apple Watch Series 7 hold the capability for ECG and sleep monitoring. Moreover, both wearables calculate the HR and SpO2 physiological parameters and provide their data wirelessly through BLE/Wi-Fi and BLE correspondingly. The Samsung wearable is compatible with Android while the Apple wearable is compatible with iOS. As can be seen, among the approaches and the wearable solutions, there are limitations concerning the set of physiological values provided.

**Table 1.** Indicative functionalities and features of prototypes emerged from literature research.


<sup>1</sup> N/A: non-available. <sup>2</sup> NS: non-stated.


Our proposed comprehensive system introduces a non-invasive wrist-wearable device capable of capturing the PPG signal and extracting in real time the HR, RR, SpO2 physiological parameters simultaneously as well as rendering the raw PPG signal available which allows further processing for the sake of new bio parameters and features assessment. The extraction process takes place in situ ensuring the optimal utilization of its resources as well the network ones. It is a lightweight and embedded device with minimum add-ons exhibiting optimized memory capacity and processing power. It supports direct connection to 802.11.xx communication infrastructures which makes it an ideal candidate for instantaneous unhindered use in existing communication infrastructures offering high speed information sharing. Additionally, a cloud based back-end infrastructure offers all the required means to securely store the aggregated via https data in a time-series manner where end users and health professionals can perform visualization and algorithmic processing, respectively.

#### **3. Proposed Device Components**

This section focuses on the design and development aspects of the proposed wristwearable device for the continuous monitoring of the PPG physiological signal and the extraction in real time of the HR, SpO2 and RR physiological parameters. Its scope is to provide an unobtrusive means to accurately assess the physiological data of the users' enabling them to monitor their well-being status. The proposed design follows a modular approach both at physical (hardware modules) and at operational level (software modules) as described below.

#### *3.1. Hardware Modules*

The wrist-wearable device is a microcontroller-based device designed for continuous monitoring of PPG. The device extracts the HR, SpO2 and RR physiological parameters by implementing dedicated algorithms and transmitting the information over Wi-Fi to a developed web-based platform. The device is powered by a Lithium Polymer (LiPo) battery which can be charged using a USB cable. It is also equipped with an on-off switch for turning on or shutting down the device accordingly. Figure 1 depicts the block diagram of the proposed embedded device along with its external peripherals.

**Figure 1.** Block diagram of the developed wearable device.

The hardware components of the device are surface mounted on a custom printed circuit board (PCB) which was designed considering effortless repair, analysis, and field modification of circuits with dimensions of 66.0 × 42.0 mm.

• Microcontroller and Radio The device's microcontroller board incorporates the Espressif ESP-WROOM-02D, which is based on the ESP8266 chip implementing the Wi-Fi communication protocol [28]. Specifically, the core of the platform is the ESP8266 processor of Espressif systems, which is a Wi-Fi SoC integrating the full TCP/IP stack. The developed firmware code for the acquisition and the processing of the PPG signal, the

extraction of physiological measurements and the code for the wireless transmission is being executed on the ESP8266 microprocessor. The ESP8266 integrates a Tensilica L106 32-bit RISC processor, achieving ultra-low power consumption and reaches a maximum clock speed of 160 MHz Moreover, ESP-WROOM-02D integrates an RF switch, matching balun, and a PCB antenna.


A custom casing was designed to enclose the device using a 3D printer with PLA material. Figure 2 presents the printed circuit board of the device as well as the complete wearable device mounted on the wrist. The device is designed to be worn on the left hand and the placement is approximately 2 cm from the beginning of the wrist. Constant pressure between the PPG sensor and the skin is applied with the aid of the attached wrist strap. Inappropriate device placement results in insufficient light detected by the photodetector, a condition which activates a notification for proper alignment on the interface of the web-based platform.

**Figure 2.** The proposed wearable device.

#### *3.2. Software Modules*

The main operation of the proposed SW modules include


Figure 3 overviews the device's algorithmic operation.

**Figure 3.** Flow chart of the algorithm.

#### 3.2.1. PPG Acquisition

The principle of the device's operation is PPG signal recording. During the initial stage, the Automatic Led Emission Control (ALEC) technique is performed, which algorithmically mimics the Automatic Gain Control closed-loop feedback circuit. ALEC automatically adjusts the system to the specific characteristics of each user, as it regulates the LEDs luminosity depending on skin tone and the diameter of the wrist of each user. During experimentation it was observed that the ADC output tends to alter in time. The former observation was attributed to the fact that even with perfect placement, the device is subject to slight movement, whereas the latter, to the unique physiology of each subject's wrist and skin tone. To eliminate this problem, the LED level is adjusted until the ADC output reaches a satisfactory level that is optimal for the imminent signal processing and for not reaching saturation. This technique was developed to lead the system to a consistent response.

Subsequent to ALEC the device starts recording the IR and RED PPG signals along with acceleration data. The acceleration is monitored with the aim to detect conditions of intense movement. The segments of the PPG signal captured in circumstances of extensive movement impose catastrophic Motion Artifacts (MAs) onto the raw PPG signal, deeming the extraction of any physiological parameter unfeasible [31]. To overcome this, segments of the PPG where motion is detected are excluded and the sampling restarts. To detect motion, the built-in logic of the ADXL362 is utilized, whose activity and inactivity events are used as triggers for manipulating the PPG sampling. An activity event is triggered when acceleration of the device remains above a predetermined threshold for a specified time period. The accelerometer features two modes of operation, the absolute and referenced configuration. During the absolute type of operation, each incoming acceleration sample is compared with a user defined threshold, which when surpassed for a certain amount of time, signals that activity is detected. On the contrary, during the referenced mode of operation, each acceleration sample undergoes a regularization, to remove the effects of gravity, able to reach 1 g, and account for the status of the device prior to sampling. This is achieved by subtracting an internally determined value, captured regularly during inactive periods, from the acceleration sample. The corrected acceleration value is then compared with the user defined threshold, and in cases it is surpassed, an activity event is issued. Consequently, activity is detected only when the acceleration has deviated sufficiently from the initial orientation. The threshold selected for the activity event was set at 350 mg for at least 1 s. This helps to eliminate only movements whose intensity can obscure the physiological parameters estimation and allows lower intensity motion to be handled by the signal processing algorithm.

#### 3.2.2. Signal Processing

The PPG signal consists of AC and DC components. The DC component corresponds to non-pulsatile tissue, while the AC component alternates according to the heart cycle. Only the variable part of the signal is relevant for HR and RR determination; thus, the mean value is usually subtracted from the signal used in the HR and RR measurement. The recorded raw PPG signal is shown in Figure 4.

**Figure 4.** Raw PPG signal.

The implemented device, which is thoroughly described in [32], deploys an algorithm for the digital processing of the PPG signal in the time domain to remove the effect of MAs and the DC component. Given the fact that the appropriate to our analysis frequencies of the PPG signal are ranging from 0.1 to 5.0 Hz [33], an IIR Butterworth bandpass filter with a passband of [0.1, 5] Hz is applied prior to signal manipulation. The variable component of the signal after filtering is shown in Figure 5.

**Figure 5.** Filtered PPG signal.

Aiming at producing reliable estimations of the physiological parameters, extensive experiments were performed which resulted in the requirement for a PPG signal collection of at least 30 s. Within this time, useful raw PPG signal is certainly included, enabling the procedures of processing and physiological data extraction. Consequently, the measurements are produced every 30 s in cases of absence of considerable movement. As long as there are new measurements, the data are sent to the backend through Wi-Fi. Implementing the mentioned operating specifications, the device sustains a battery life of 48 h in continuous operation.

#### 3.2.3. Heart Rate Estimation

For the estimation of the HR, the acquired signal of the infrared (IR) led source is utilized. After filtering the IR PPG signal, the Slope Sum Function (SSF) is applied [34], as shown in Equation (1). The procedure amplifies the peaks of each pulse and suppresses the noise represented by lower amplitudes.

$$\text{SSF} = \sum\_{\mathbf{k}=0}^{m} \Delta \mathbf{x}\_{\mathbf{k}} \,^2 \text{ where } \Delta \mathbf{x}\_{\mathbf{k}} = \{ \Delta \mathbf{s}\_{\mathbf{k}} : \Delta \mathbf{s}\_{\mathbf{k}} > 0, \, 0 : \Delta \mathbf{s}\_{\mathbf{k}} \le 0 \}, \tag{1}$$

At this stage, the emphasized peaks of each window of the SSF output signal are identified as local maxima, as shown in Figure 6. After locating the peaks and estimating the time difference -d- between them, the instantaneous HR is computed for every pair of successive peaks using the Formula (2) [35]:

$$\text{HR}\_{\text{inst}} = \frac{6 \times 10^4}{\text{d}},\tag{2}$$

Finally, the HR measurement is reckoned as the average of the instantaneous HR values in a 30 s time window.

#### 3.2.4. Blood Oxygen Saturation Estimation

To enable the assessment of the blood oxygen saturation in the blood, two LEDs, operating at the RED and IR wavelengths are utilized [36]. The principle of pulse oximetry is based on the comparison of the two waveforms, whose deviation is a direct indicator for the oxygen saturation. Their deviation occurs due to the different amount of light absorbed and emitted by the two types of hemoglobin, namely oxyhemoglobin and deoxyhemoglobin. Regarding the red wavelengths, deoxyhemoglobin absorbs a higher amount of light than

oxyhemoglobin, while the opposite happens in the infrared region. Hence, the responses from the RED and IR LEDs captured from the photodetector are different.

**Figure 6.** SSF output.

The PPG waveform consists of two different components: the DC component corresponding to the light diffusion through tissues and non-pulsatile blood layers, and the AC (pulsatile) component due to the diffusion through the arterial blood. The developed algorithm locates the existing peaks and valleys and subsequently calculates the AC and DC components of both RED and IR PPG waveforms [37]. The DC component fluctuates slightly with respiration, while the AC component oscillates in concurrence with the changes appearing in the volume of arterial blood during the cardiac cycle [38]. Given the AC and DC components, a ratio R is calculated by the Equation (3):

$$\mathcal{R} = \frac{\text{AC}\_{\text{Red}} / \text{DC}\_{\text{Red}}}{\text{AC}\_{\text{IR}} / \text{DC}\_{\text{IR}}},$$

Eventually, the SpO2 value is estimated using the Equation (4) provided by Maxim Integrated:

$$\text{SpO}\_2 = -45.006 \times \text{R}^2 + 30.354 \times \text{R} + 94.84,\tag{4}$$

#### 3.2.5. Respiratory Rate Estimation

To perform RR estimation, two modulated signals need to be extracted from the original PPG signal obtained from the IR signal [39]. The two components, namely the Frequency Modulation (FM) and the Amplitude Modulation (AM), illustrate the effects of respiration as a physiological process on the PPG signal. Respiration is a complex process consisting of various mechanisms which cause many subtle changes to the original PPG signal. The most prominent of those effects regard FM which is the manifestation of the spontaneous increase in the heart rate during the inspiration and the corresponding decrease during expiration [40] and AM which is the result of reduced stroke volume during inhalation reducing the pulse's amplitude [41].

Following the raw PPG acquisition, a bandpass filter is applied to eliminate frequencies not related to respiratory information. The process of peak characterization includes separating the waveform in individual pulses and detecting their maximum value. The FM can be defined as the time difference between two consecutive peaks as described in Equation (5), whereas the AM is formed by each individual amplitude peak of the signal as shown in Equation (6). Each time value assigned to the FM and AM signal samples is calculated as the mean of the time of occurrence of two peaks as shown in Equation (7).

$$\mathbf{x}\_{\rm FM} = \left| \mathbf{t}\_{\rm peak\_{i-1}} - \mathbf{t}\_{\rm peak\_i} \right|, \; \mathbf{i} = \mathbf{2}, \ldots, \mathbf{N}, \tag{5}$$

$$\mathbf{x}\_{\text{AM}} = \left| \mathbf{y}\_{\text{peak}\_i} \right|, \text{ i } = 1, \text{ 2,...,N}, \tag{6}$$

$$\mathbf{t} = \frac{\left| \mathbf{t}\_{\text{peak}\_{i+1}} + \mathbf{t}\_{\text{peak}\_i} \right|}{2}, \text{ i } = 2, \dots, \text{N}, \tag{7}$$

The values of the modulated signals are not homogenous, thus inhibiting the signal processing. To evenly sample the two waveforms, Shape-Preserving Piecewise Cubic Interpolation is performed on the acquired data points and the sampling rate is set at 4 Hz. Prior to the Fast Fourier Transform (FFT), a Hamming window is applied to minimize the side lobes of the frequency response.

At this stage, the two power spectra are combined to amplify the potential peaks, the dominant frequency (Fd) in the plausible range is identified and the final RR value is then computed by Equation (8):

$$\text{RR} = \text{F}\_{\text{d}} \times 60 \,\text{(breaths/min)},\tag{8}$$

#### 3.2.6. Data Transmission

The wearable device transmits data to an API endpoint via Wi-Fi, implementing the HTTPS communication protocol which follows the HTTP protocol over a secure and encrypted connection. A time window of 30 s interposes between two data transmissions. A payload in URL encoded format is generated, which includes the values of physiological parameters collected by the wearable device; HR, SpO2, and RR. In addition, the payload includes the MAC address of the Wi-Fi's Access Point (AP) and the battery level of the wearable device. The parameters of the payload are described in Table 3.



The API service, located on the server-side of the proposed web-based application, provides an endpoint to wearable device in which it can POST requests. Therefore, a parametric URL, presented in Figure 7, is utilized by the API service. By the time a POST request is applied to the API service, it parses the URL, deploys the GET variable to extract the value of each parameter, validates that each parameter has the appropriate format and stores the parameters into the database.

**Figure 7.** API service parametric URL.

The MAC information is included as the proposed device has the capability of connecting to the Wi-Fi AP with the best signal strength in the case there is more than one

Wi-Fi AP in the surrounding space. The AP with the highest signal strength will be the closest to the user. Thus, the MAC address of a particular AP can be exploited for assuming the approximate location of the user, given the location of each AP.

Typically, the SSID and the password of the Wi-Fi network are assigned into variables within the device's code. This would require the end-users to enter their Wi-Fi credentials and upload new code on the device. To overcome this, the Wi-Fi Manager library is implemented, which allows end-users to connect to different APs without having to interact with the firmware. More specifically, when the device is activated by a user, a connection to a previously saved AP is attempted. If this process fails, the AP mode is enabled, allowing the user to configure a new set of SSID and password. The user has to navigate to a web page with default IP address 192.168.4.1 and enter his Wi-Fi credentials into a form. Once a new valid set of SSID and password is set, the device automatically reboots and establishes a connection.

The device also has the capacity to locally store data in cases of dropped or lost Wi-Fi connection. ESP8266 module provides the user with a flash memory of 1 MB, from which the 0.4 MB are occupied by the device's firmware. The remaining available memory is utilized for storing the measurements produced during the absence of network connection. The saved data are transmitted when the Wi-Fi connection is restored and then the device continues its regular operation. The device extracts the physiological parameters every 30 s and transmits them along with other data. As mentioned, the 30 s time period is a specification, which emerged during the trials, since within this time period, useful raw PPG signal is definitely included, facilitating the production of reliable physiological data values.

#### 3.2.7. Measurements Evaluation

Aiming to evaluate the performance of the suggested wrist-wearable device and the accuracy of the corresponding extracted physiological parameters, commercial off-theshelf certified devices were used. The values obtained by these devices are considered as reference and are compared with the values of the proposed wrist-wearable device.

Regarding the evaluation process of HR and SpO2 physiological parameters a medical finger pulse oximeter was utilized. The commercial finger pulse oximeter chosen is a certified medical device manufactured by Berry: BM2000D Bluetooth Pulse Oximeter [42]. The accuracy of RR determination methodology was evaluated utilizing the chest worn Zephyr BioHarness device, which is a physiological monitoring system with proven reliability in determining RR [43].

Ten healthy subjects with varying wrist circumferences and skin tones were provided with the wrist-wearable device and the reference devices. In particular, the subjects were equipped with the proposed wrist-wearable device and the Berry Pulse Oximeter along with the Zephyr BioHarness as ground truth devices. The experiments were performed at a sedentary state and the total duration of the experiment for each subject was 1 h, yielding an aggregation of 10 h of data.

To analyze the alignment between the data acquired from our proposed device and those from the reference instruments, the Bland–Altman plot was deemed ideal. The Bland–Altman graph consists of a plot of the difference between paired readings of two variables, in our case the derived and the reference values, over the average of these readings. Incorporated into the plot are the ±1.96 SD lines (the Confidence Interval) parallel to the mean difference line.

Figures 8–10 present the comparative analysis of our data and render the proposed device a reliable system for the extraction of the desired physiological parameters.

**Figure 8.** Bland–Altman plot for HR.

**Figure 9.** Bland–Altman plot for SpO2.

**Figure 10.** Bland–Altman plot for RR.

The Bland–Altman plot displays four types of data misbehavior: systematic error (mean offset), proportional error (trend), inconsistent variability, and excessive or erratic variability. It should be noted that the Bland–Altman comparison method innately assumes that the two methods compared are portraying close results, a condition satisfied in our analysis.

#### **4. Web-Based Application**

Nowadays, web-based applications dominate markets as they offer considerable advantages over traditional desktop applications. Web-based applications run directly on web browsers, demand minimal computer resources, do not require any installation process, are highly scalable, and maintenance tasks are performed centrally on the web server. In addition, they are portable and cross-platform available, which allows users to access them from any place and any type of device. Therefore, web-based applications are deemed to be the optimal solution for flexible projects.

Web-based applications can be structured with various architectural patterns. The proposed one adopts the client-server model. The client-side refers to the visible and interactive part of the application, whereas the server-side is responsible for processing client requests and responses. In most cases, the server-side includes a database to support transactions with data and API services to support interoperability with end-devices and systems.

#### *4.1. Front-End Implementation*

Front-end stands for the client-side of a web-based application. It provides a graphical interface, which allows end-users to interact with the web-based application and visualizes the information acquired by the wearable device. The front-end of the proposed web-based application employs the core web technologies; HTML5 structures the UI components, CSS styles the UI components, and JavaScript enables interactivity between the UI components and the end-users. In addition, a series of external frameworks and libraries were utilized to optimize the development process, i.e., to meet the business objective of the webbased application with minimal effort cost, achieve high performance standards, and ensure the appropriate infrastructure for further scalability. More specifically, Bootstrap framework, an open-source framework for front-end development, employed to accelerate the development process, optimize the overall performance in terms of computational resources, and provide a set of user-friendly and highly customizable UI components. This framework supports responsive design for web-based applications and incorporates webaccessibility standards. Chart.js, an open-source JavaScript library, employed for advanced chart implementation within the front-end. It offers pre-built and highly customizable UI chart components, allows efficient handling of data objects, optimizes the performance of charts' drawing process, and improves the style of visualized data. jQuery, a lightweight and open-source JavaScript framework, employed to simplify event handling, improve the manipulation of UI components, and empower the interactivity capabilities.

The dashboard, shown in Figure 11, is the core interface of the front-end, and serves as personalized analytics overview and real-time monitoring tool. The primary objective of the proposed dashboard relates to the visualization of the user's latest measurements, captured by the wearable device, in an intuitive and user-friendly way. Therefore, it employs UI components, mainly embedded line charts within cards, to depict the trend of each physiological parameter over the last one hour. Charts establish an asynchronous connection with the database; thus, they are automatically updated in a real-time manner with the latest measurements recorded by the wearable device. The horizontal axis of each chart represents the time, whereas the vertical axis represents the captured values of the specific physiological parameter displayed. Below each chart stands a card footer which informs end-users for the time of the chart's data latest update.

**Figure 11.** The UI of dashboard.

Another core component of the front-end is the horizontal navigation menu bar, placed at the UI's header area. It comprises a minimal information area, which displays the connection status of the wearable device and the battery level, as well as a list that maps the analytics and account areas of the web-based application. Analytics areas (Figure 12) visualize historical data for the corresponding physiological parameter in a user-friendly way. In addition, they implement a data picker filter to allow end-users apply the desired time span of the displayed data.

**Figure 12.** The UI of heart rate analytics.

Account area (Figure 13) displays a list with the available Wi-Fi networks and permit users to perform two (2) actions; edit network and delete network. The 'edit network' action allow users to change credentials of an existing Wi-Fi connection and 'delete network' action allows users to remove a specific network from the saved networks list. Moreover, from the 'Account' page, users can configure a connection with a new network.


**Figure 13.** The UI of user account page.

#### *4.2. Back-End Implementation*

Back-end stands for the server side of a web-based application. On the back-end side, the web-based application implements embedded PHP scripts within the front-end code to provide dynamic functionalities. The dynamic aspect of the web-based application enables the capability to retrieve and visualize data from the database in a near real-time manner. The back-end comprises a RESTful API that allows submission of collected data from the wearable device into the database. More specifically, it parses a JSON object with the payload's data and decodes it to extract each value and insert it into the database. A data validation process takes place to ensure the quality of data, such as checks for duplicate entries, appropriate data format and type, null values. In addition, SQL queries are executed to confirm that the extracted identifiers exist on the database to avoid conflicts on the data selection process.

A user authentication mechanism stands at the back-end, too. It has been implemented with Keycloak, an efficient, reliable, and extendable authentication and resource access management framework for web-based applications. The proposed application supports rights for two (2) roles, the administrator and the user. Administrators can access only an administrative dashboard page from which they can perform administrative tasks such as manual password resets, user account management, and service health monitoring. Users can access only the application UI resources—dashboard, analytics pages and account configuration page—from which they can keep up with visualized insights, apply changes related to their account information, and configure connections with networks.

The web-based application employs a MySQL database to perform transactions with data automatically collected by the wearable device and data manually provided by the end-users. Automatically collected data are included within the transmitted payload, i.e., physiological parameters, timestamp and MAC address. The database adopts a relational schema which structures the data into tables, as shown in Figure 14. The core functionality of the web-based application, i.e., data visualization, user authentication,

and connection configuration, relies on four (4) tables; "Users", "Devices", "Connections" and "HealthRecords".

#### **Figure 14.** Database schema.


#### **5. Discussion**

Physiological parameters provide critical information about individuals' well-being status and signal early signs of a body dysfunction. For example, detecting an abnormally high HR could be an indicator that actionable measures should be taken to achieve healthy levels in order to reduce the risk of cardiovascular disease, while monitoring RR can detect early signs of a respiratory illness or allergy. SpO2 is useful in any setting where an individual's oxygenation may be unstable or low for determining the sufficiency of oxygen or the need for supplemental oxygen.

Digital health technologies facilitate an individual-focused preventative approach through continuous monitoring of physiological parameters. This approach paves the way for personalized treatment, better care access and quality of service.

Providing that the subject be in a sedentary state, the proposed wearable apparatus introduced here, can unfailingly detect PPG signals and then reliably extract the physiological

parameters of HR, SpO2 and RR. Specifically, the device achieves a mean percentage error equal to 2.47% and 0.8% for HR and SpO2, respectively, while estimating the RR parameter with a deviation of ±1.4 breaths per minute. The evaluation procedures showed that the wrist-wearable device can accurately detect fluctuations of the physiological parameters in a sedentary state. Consequently, it can be used to effectively monitor well-being status and provide valuable information. Moreover, a portable and cross-platform available web-based application has been developed, which serves as an informative and Wi-Fi connection establishment tool, from which individuals and health professionals can access the entailed parameters in a user-friendly and efficient way regardless the location, the type of device, and the operating system, and handle connections with networks in a simple way.

Impetus for future work is to enhance the accuracy of the extracted parameters in all respects. Our endeavor aspires to adopt an accelerometer-based detection and removal of faulty PPG segments or a signal preprocessing approach for MAs elimination. Alongside this, more trials with different subjects should be further performed. Collecting a larger amount of sample data from users of a broader spectrum in terms of age or skin color would also provide an opportunity for better algorithm calibration. Our future research interest is also focused on a multi wavelength photoplethysmography approach, which has proven superior performance than the single wavelength. Recently, advances on the sensing modality for detecting light from multiple sources enabled the development of a single chip sensor, removing the need for spectrometers, which have a prohibitive size for a wearable device. Moreover, in an effort to expand our knowledge on the effect of the light wavelength on the quality of the PPG signal, experiments are being conducted at wavelengths other than the red and infrared, which are currently used. Last but not least, such a device paves the way for injury prevention, early detection of illnesses or disorders, as well as early interventions with the aim to avoid the deterioration of health conditions.

#### **6. Conclusions**

This work presents in both physical and operational level all the discrete components of a comprehensive, user-embracive system able to unobtrusively record and process vital physiological parameters. It comprises a non-invasive wrist-wearable device able to incessantly detect PPG signals and solidly extract the physiological parameters of HR, SpO2 and RR and a multimodal web-based application via which the end users visualize real-time or historical data while allows health professionals to interact with that data for further algorithmic processing. The configuration of the system and its Wi-Fi connection was designed to be effortless even for older individuals that are not so much accustomed to highend technology. The wrist-wearable device is a lightweight modular embedded device with a microcontroller based main board exhibiting optimized memory capacity and processing power, as well as long autonomy, and portable mounted off-the-shelf sensors for capturing the PPG signal. Moreover, by supporting direct connection to 802.11.xx communication protocols, it is an ideal device for the utilization of existing communication infrastructures that offer high speed information sharing. The cloud based back-end infrastructure offers all the required means to securely store the transmitted data from the wearable device over HTTPS protocol in a time-series manner. Both health professionals and end users themselves have easy access to historical and real-time data. The professionals can utilize the collected historical data to perform statistical analysis or execute AI/ML methods, aim to obtain valuable health information either for an individual or a group of them, thus unlocking a vast field of possibilities. The end users can glance at their data according to their preferences, simply by applying filters to adjust the friendly UI chart components of the front-end.

The accuracy assessment of the extracted physiological parameters, along with the evaluation of the system performance, were carried out against commercial off-the-shelf certified equipment, which was worn by healthy subjects with different anatomical characteristics.

Future actions include the increase in accuracy of the extracted parameters in all respects, the enhancement of the algorithmic processing capabilities and the execution of more trials on diverse subjects.

**Author Contributions:** Conceptualization, J.G.; Data curation, G.P., C.M. and G.T.; Methodology, J.G. and G.T.; Resources, J.G.; Software, M.K., J.G., G.P., C.M. and G.T.; Writing—review & editing, M.K., J.G. and C.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was co-funded by the Greek General Secretariat of Research and Technology, through ESPA 2014–2020, under the project T1EDK-02489 entitled "Intelligent System in the Hospitals ED and Clinics for the TRIAGE and monitoring of medical incidents—IntelTriage" and the European Union, under the Horizon 2020 project N<sup>a</sup> 957736 entitled "Intelligent Interconnection of Prosumers in PEC with Twins of Things for Digital Energy Markets—TwinERGY", H2020-LC-SC3- 2020-EC-ES-SCCRIA.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** J.G. and G.T. made substantial contributions to the conception, the design and implementation of the hardware components and embedded software modules of the proposed integrated system, provided approval for publication of the content, and agreed to be accountable for all aspects of the work. G.P. and M.K. made substantial contributions to the design and implementation of the embedded software modules, provided approval for publication of the content, and agreed to be accountable for all aspects of the work. C.M. made substantial contributions to the design and implementation of the web-based application, provided approval for publication of the content, and agreed to be accountable for all aspects of the work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **PATROL: Participatory Activity Tracking and Risk Assessment for Anonymous Elderly Monitoring**

**Research Dawadi 1, Teruhiro Mizumoto 2, Yuki Matsuda <sup>1</sup> and Keiichi Yasumoto 1,\***


**\*** Correspondence: yasumoto@is.naist.jp; Tel.: +81-90-2460-3965

**Abstract:** There has been a subsequent increase in the number of elderly people living alone, with contribution from advancement in medicine and technology. However, hospitals and nursing homes are crowded, expensive, and uncomfortable, while personal caretakers are expensive and few in number. Home monitoring technologies are therefore on the rise. In this study, we propose an anonymous elderly monitoring system to track potential risks in everyday activities such as sleep, medication, shower, and food intake using a smartphone application. We design and implement an activity visualization and notification strategy method to identify risks easily and quickly. For evaluation, we added risky situations in an activity dataset from a real-life experiment with the elderly and conducted a user study using the proposed method and two other methods varying in visualization and notification techniques. With our proposed method, 75.2% of the risks were successfully identified, while 68.5% and 65.8% were identified with other methods. The average time taken to respond to notification was 176.46 min with the proposed method, compared to 201.42 and 176.9 min with other methods. Moreover, the interface analyzing and reporting time was also lower (28 s) in the proposed method compared to 38 and 54 s in other methods.

**Keywords:** elderly monitoring; successful aging; mobile application; gerontechnology

#### **1. Introduction**

Advancements in medicine and health care technologies have led to an increase in life expectancy over the years. It is expected that, by 2050, there will be at least 2 billion people over the age of 60 years [1]. The statistical handbook of Japan released in 2021 by the Statistics Bureau, Ministry of Internal Affairs and Communications, Japan has revealed that, in 2015, there were about 22 million households with residents aged 65 and above, including 6 million who lived alone [2]. Living independently, especially for the elderly, is risky because, in addition to mental problems such as memory loss, depression, and loneliness, there can be physical problems such as falling down, issues with eyesight, hearing loss, back pain, etc. [3]. Though different remedies have been developed for different types of physical and mental ailments, with an increasing number of elderly people, it is apparent that there is a need for monitoring and anomaly detection mechanisms. A lot of research has thus contributed to recognizing, predicting, and monitoring activities inside smart homes [4,5].

As people get older, their involvement in different physical and mental activities decline [6]. They go out less, engage in activities related to physical fitness less, have difficulty with reading for a long time due to weakened eyesight, and so on. Similarly, they deal with issues they had not dealt when they were younger, such as the need to take medication every day and the adverse effects of missing a meal. Similarly, falls or any similar incidents tend to make the elderly cautious in their activities, impacting their confidence, activity completion, and social interactions. Therefore, it becomes imperative

**Citation:** Dawadi, R.; Mizumoto, T.; Matsuda, Y.; Yasumoto, K. PATROL: Participatory Activity Tracking and Risk Assessment for Anonymous Elderly Monitoring. *Sensors* **2022**, *22*, 6965. https://doi.org/10.3390/ s22186965

Academic Editors: Ivan Miguel Serrano Pires and Antoni Martínez Ballesté

Received: 4 July 2022 Accepted: 12 September 2022 Published: 14 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to track whether the elderly has completed basic day-to-day activities every day in order to detect any abnormal conditions that might have occurred or might occur [5,7]. There have been many advancements in human monitoring, collecting vital health statistics and tracking human behavior over the recent years [8]. Off-the-shelf sensors can be now used in houses that can provide information about light intensity, temperature, and usage of doors and appliances of houses [9], making it possible to determine activities inside the house.

Research has also been carried out in health care centers, but implementing such technology in the home environment is more suitable for the elderly. The elderlies have made memories over the years in their home and have possessions they cherish [10]. Hence, they feel more comfortable to live in their own home as well as conduct their basic everyday activities. Moreover, hospitals and health care centers are either expensive or overbooked. The cost can be reduced by up to 52% when patients receive treatment and help in their home compared to hospitals [11]. It is therefore necessary to develop systems that can help to enhance elderly care in their own home rather than hospitals or support homes. Professional caretakers are expensive as well, and with the increasing number of elderly people they tend to be overbooked and busy [5]. Home monitoring technologies can help family members and relatives who are far away be assured about the safety and contentment of the elderly [1]. However, their busy schedule may not allow them to monitor the activities regularly, which is why personnel dedicated to remote monitoring such as remote caretakers or volunteers should be assigned the monitoring responsibilities.

With these issues in consideration, in this paper, we propose a monitoring system, PATROL (Participatory Activity Tracking and Risk assessment for anOnymous eLderly monitoring) that can track basic activities of the elderly anonymously inside their home and detect or prevent any potential risks in their day to day activities using a smartphone application. For the successful implementation of the PATROL system, the following requirements need to be fulfilled: *(Req. 1) anonymous monitoring*, *(Req. 2) timely monitoring and report of activities*, and *(Req. 3) easy and intuitive risk detection* because of the following reasons.

Home monitoring can be considered intrusive as in some cases, the elderly may prefer to hide things in their house if there is a video based monitoring or surveillance system [12]. Similarly, they are also usually concerned about privacy and security, and the types of information about them that are disclosed [1]. This is why we propose anonymous monitoring (Req. 1), where any personal details of the elderly being monitored is not disclosed to the monitoring person. Smartphones are a suitable device for regular tracking and monitoring since many people carry them the whole day or they are always in the vicinity of the users. Furthermore, notifications have become an essential feature of most of the smartphone applications [13]. This is why we propose a smartphone application that can be used by volunteers for tracking and monitoring activities of elderly people. Similarly, we send frequent notifications in the smartphone application, which ensures that the monitors can quickly access information about the activities of the elderly, compared to using web pages (Req. 2). Continuous usage of smartphone applications in general has been attributed to factors such as ease of navigation, ease of carrying out actions within the application, and appropriate visual clues [14], which is why we focus on the visualization of activities and propose a method for visualizing activities and detecting risks in the daily activities that not only helps to identify risks in the activity visualization easily, but also incurs a lesser burden to the monitoring person (Req. 3).

Therefore, in this paper, we propose an elderly monitoring system that can be used by anonymous volunteers to check everyday activities of the elderly and determine if there are any risky situations in their day to day activities. The anonymity is maintained by not disclosing any personal or private information of the elderly to the volunteers, and similarly by not disclosing any personal or private information of the volunteers to the elderly person. Using volunteers for elderly care is a very common practice in Japan [15] where part-time civil servants committed by the Minister of Health, Labor, and Welfare as volunteers, locally known as minsei-iin, are assigned to regularly check the elderly people personally, have a conversation with them, etc. These part-time civil servants are people

who volunteer themselves in the area of helping children, elderly people, people with disabilities, etc. and have no mandatory obligation to serve in such areas. We believe that our system is an extension of such practice in the field of elderly care. Instead of visiting the elderly, our volunteers can check the elderly by using the smartphone application even if they are not in the vicinity of the elderly. This is helpful in cases when the elderly might not prefer an unknown person to visit them personally, and also in cases where the number of people serving as minsei-iin might not be enough. Since in our system, we aim to use multiple monitors, we ensure that the activities of the elderly are regularly checked. To maintain anonymity, even if the volunteers discover a risky situation in the daily activities of the elderly, the handling of such a situation, in person, is carried out by emergency contacts of the elderly, and not the volunteers themselves. For our system, we define risk as a deviation in start/end time and duration of activities from the usual routine of the elderly people.

We developed an Android based smartphone application that provides information about the completion of seven basic activities: sleep, shower, medication, breakfast, lunch, dinner, and entertainment (use of television (TV)). We created a dataset by including some risky situations in the elderly activity dataset [16] to determine if those situations can be detected using our application design. To make the monitoring process less burdensome and intuitive, we also included visualization features such as a candlestick chart representation of activities, single interface design, and textual and color codes for their current state, through which it is easy to infer any deviation in the completion time and duration of activities. Similarly, we focused on quick tracking and monitoring of activities by including two types of notifications to trigger frequent use of the smartphone application: one sent every two hours, and another sent immediately after the elderly completed an activity.

The main contributions of this paper are the following:


The rest of the paper is organized as follows: Section 2 introduces some available research and how they relate to our study. In the next section, Section 3, we introduce our system followed by the explanation of our smartphone application. We explain the evaluation study and findings of the study in Section 4 and in Section 5, and we discuss the significance of the results for our system along with the limitations of this study. Finally, we conclude with our contributions in Section 6.

#### **2. Related Studies and Challenges**

Increasing demands in safe, secure, and smart homes for the elderly have led to many research and advances in the field of home monitoring and home automation [4,5,17]. Similarly, with increasing use of smartphone notifications to provide various information to users, we look into studies that explored reliable triggers to inspire people to respond early to mobile notifications. With these factors in mind, we studied existing research, which are divided into two subsections that deal with activity detection and remote monitoring, and importance of smartphone notifications.

#### *2.1. Activity Recognition and Remote Monitoring*

In recent years, research dedicated to monitoring people and their activities inside their house has been increasing rapidly since activities of people can be identified with the help of various sensors that can be attached to different household objects [18]. Most home monitoring methods utilize camera or video captures to learn about the activities of the elderly [7]. Video and microphone based monitoring can be time consuming for monitoring, burdensome, and also intrusive [12], and also restrict the area of the house the elderly can occupy to regular monitoring [5]. Numerous research studies have been carried out to tackle not only such problems, but also improve recognition accuracy and reduce the burden of using wearable sensors. The daily activity pattern of elderly people was identified using only motion and domotic sensors by identifying the duration of occupancy of a certain room by the elderly [1]. Similarly, using energy harvesting PIR (passive infrared sensor) and door sensors, an activity recognition system was developed that was efficient as well as cost effective [19].

Many other activity recognition systems utilise non-wearable sensors such as motion sensors [20], Bluetooth Low Energy (BLE) beacon [4,17], wireless accelerometers [21], a combination of temperature, humidity, and illumination sensors [22], and a combination of ECHONET Lite appliances and motion sensors [8]. Similarly, deploying a system that used motion sensors, environmental sensors, and a button to be pressed at the start and end of an activity, daily activities of the elderly were collected for a period of about two months in houses consisting of elderly people [16]. All these studies help to highlight that it is possible to collect activities in the house using sensors such as motion sensors, environmental sensors, etc. accurately without the use of any wearable sensors in a costeffective way and handling concerns for privacy and security.

Activity recognition systems also allow the elderly to live an independent life in their own house whilst their activities are monitored remotely [5]. There have been measures to monitor vital signs and biomedical signals of adults with medical conditions [23] or people working in extreme conditions such as firefighters [24]. The Allocation and Group Awareness Pervasive Environment (AGAPE) system used on-body sensors to monitor the elderly and contacted nearby caregiver groups in case it detected an anomaly in sensor data [25]. Systems can also contact the emergency contact, or caregivers for the elderly if any anomaly in the collected data are observed, for example, when the data exceed a predefined threshold [26,27]. When it comes to elderly remote monitoring, fall recognition systems are also very important, with some systems recording the average response time of fall detection between 7 min and 21 min [28]. The systems can detect falls using various types of sensing strategies such as acoustic sensors [29], wearable sensors [30], or accelerometers in smartphones [31].

Many commercially available products are also available that are used to monitor the elderly remotely. Systems such as Mimamori [32] and Canary [33] are specially designed to monitor activities of elderlies by their children and close family members who live in a distant location. Another system, GreatCall Responder, uses a physical button, called a responder, that the elderly can press in case they feel they have an emergency, and the system contacts their caregiver [34]. Similarly, there are systems that track numerous activities using motion sensors that remote caregivers can monitor using a private and secure webpage [35,36]. There are also systems that include secure video communication between doctor and patients for regular or emergency situations, remote health monitoring, and emergency care services [37,38].

Many elderly people, however, regard new technologies as an invasion of their privacy and security [10], and tend to accept technologies only if it is beneficial to them or it adheres to their day to day activities without providing any hindrance [39]. A study revealed that being monitored in their house, conducting their day to day activities did not affect regular daily behavior of the elderly [40]. Their extensive study requested the elderly to answer online questionnaires weekly and included daily activities of sending, reading and deleting emails, along with tracking their total everyday activities, walking speed, and time spent outside their home. Hence, if issues of privacy and security are tackled, and the elderly feel that the activity recognition system will be valuable to them, then there is higher chance of acceptance of such a system.

These systems also provide some areas of concern. The alerts are sent to caretakers of health professionals via text or email [28] or direct phone calls [26]. However, the number of false alarms, which can be as high as 5 in one hour [29], can cause annoyance to the caretakers. Similarly, even though the accuracy of fall detection systems is high such as 97.5% [28] or 94% [30], the information regarding the time it takes such systems to inform the caretaker or the time it takes caretakers to respond are not explicitly evaluated. In another system, the activities of elderly were divided into critical, stable, scheduled and overlooked, and alerts for them were generated in a smartphone application as per the type such as after 5 min of usual time for critical activity such as medication and after 30 min for other activities [41]. These alerts were first sent to the elderly, and if they failed to respond, the caretakers were alerted. However, it is difficult to determine the exact time the elderly might prefer to do their daily activities. Similarly, in the case of emergency, the elderly may not be physically able to respond to alerts [41] or press the emergency button [34].

#### *2.2. Smartphone Notifications*

Smartphones have become a daily necessity as it helps to tackle isolation, as well as helping to stay in contact with family and friends easily [42]. Smartphones have become an essential tool to be updated about personal health, work, and news updates [43]. Smartphone owners interact with their phones an average of 85 times a day [44] which makes them a befitting tool for remote monitoring. Notifications are essential to keep the users updated about news, emails from work, and information from social media [45]. Although initially they were intended for short message services (SMS) or emails, these days, notification features are used by almost all of the applications to attract attention of the users. A study determined that notifications can be divided into two categories: personal notifications like emails, SMS, or those from social networking sites; and mass notifications like news and advertisements [46]. They concluded that people tend to attend to personal notification faster and more frequently than mass notifications.

The response to notifications depends on different factors such as sender, type of alert, and the visual representation of the alert [47]. In a recent study, it was shown that users receive approximately 64 notifications each day [48], hence the context of a notification plays an important role in the response of the notification. Time of notification reception, activeness of the user, and amount of time the user will take to respond to the received notification are influential for opening the notification promptly [46]. From a study of about 200 million notifications from more than 40,000 users [13], it was discovered that users view each notification differently and prefer to respond to notifications from social networking sites quickly over those from the smartphone system or emails.

Notifications can however lower task performance and affect attention of the user negatively [45]. Response time and response rate of notifications were determined by analyzing the current context of the user through audio from their smartphones [49]. They concluded that the present context of the user plays a very vital role in the response time as well as response rate of the notifications. Similarly, a systematic review on the effects of context aware notification management systems found that context aware notifications increase the response rate [50]. However, it is difficult to predict what time and context can be considered as appropriate for interruption. Since remote monitoring technologies can send multiple notifications in a day, it is essential to determine if such notifications will be viewed as disruptive. Similarly, to our knowledge, the effectiveness of smartphone notifications in remote monitoring systems, especially using multiple types of notification strategies, has not been investigated.

#### *2.3. Challenges*

We found out that there are many methods with which activities can be detected accurately. However, in the case of elderly people, it is also necessary to monitor such activities on a regular basis [5]. A smartphone application, equipped with adequate notification strategies, can provide a quicker remote monitoring compared to most of the remote monitoring platforms that are currently web based [35,37,38]. The smartphone application that we have designed can be used to instantly monitor completed activities and receive quick feedback from the monitoring person. It is essential not only to track activities, but also check if any risks that have occurred, and predict or prevent any potential risks in the daily life of elderly. Hence, at first, it is necessary to determine what activities to monitor and if those activities can be properly visualised in the application, and, furthermore, if any deviation in the routine of the elderly can be distinguished so that any potential risky situation of the elderly can be detected. Similarly, it is essential to identify if using the application, and monitoring activities regularly will put a burden on the monitoring person. With all this in mind, we propose the following research questions (RQs), which we try to verify with an experimental study:


#### **3. System Design**

In this section, we first explain the overview of the proposed PATROL (Participatory Activity Tracking and Risk assessment for anOnymous eLderly monitoring) system. Then, we describe the design and interface of our smartphone application in detail.

#### *3.1. System Overview*

The architecture of PATROL system is shown in Figure 1, where we denominate the elderly being monitored as *Target* and the person conducting monitoring as *Monitor*.

**Figure 1.** System architecture of PATROL.

The monitoring can be conducted in different ways. One Target can be monitored by a single or multiple Monitors and one Monitor can conduct monitoring of a single or multiple Targets. Consequently, multiple Monitors can be used to monitor multiple Targets.

The overall system can be further divided into four sections: activity recognition, monitor generation, notification generation, and smartphone application, as highlighted in

Figure 1. In this research, we focus mainly on the two sections: notification generation and smartphone application. We will now discuss each of the sections and their application in our overall system.

#### 3.1.1. Activity Recognition

Most elderly people have a definite time and duration for their activities, and follow a routine set of activities throughout the day [51]. It is important to check for everyday basic activities because, with old age, these important basic daily activities can sometimes be missed or incomplete or not properly carried out [7]. For the purpose of our research, we assume that the Target is residing in a smart home equipped with an activity recognition system, where it is possible to collect information related to daily activities like eating, sleeping, watching TV, taking medicine, etc. through the use of different kinds of sensors and power consumption meters available in the house [1,8,16]. We have designed our system in a way that it can incorporate any available activity recognition systems. Therefore, it is easy to integrate in houses which already have an activity recognition system. Activities that we showcase in the smartphone application are shown in Table 1. We believe that the state of everyday basic activities can be used as criteria to determine the wellness of the elderly person. There can be instances when anomalies can occur whilst conducting activities that are not listed in Table 1. However, such incidences will subsequently impact the occurrence of basic activities that we aim to monitor. Therefore, our system can detect anomalies that can occur doing activities that are not directly monitored in our application. Since our aim is to disclose as less information about the Target as possible, whilst making it possible to determine their current status, we only use time of completion and duration of the activities to provide information about them. We assume that the activity recognition system outputs events (i.e., start and end times of activities performed by the resident) which are utilized for data visualization and notification generation, as shown in Figure 1. This feature will be further discussed in Section 3.2.


**Table 1.** Areas and activities to monitor.

#### 3.1.2. Monitor Generation

The PATROL system is designed to be used especially for monitoring the elderly, and to be deployed in nursing homes, elderly residential areas, care homes, municipalities, etc. The overall system needs to be handled by a system administrator who can be the head of the residence association or personnel who work in such institutions. In case of changes in the system administrator, then the outgoing system administrator under the authority of the local welfare committee (and/or residents' association) will have to train the new system administrator immediately. In our context, Monitors are usually volunteers who work in the field of helping elderly in care homes, elderly residential areas, etc. The Monitors participate in tracking the activities and determining risky situations in the activities of the elderly. The system administrators have the responsibility of training the Monitors to use the smartphone application, assigning Monitors for each Target, assessing the performance of Monitors and determining if any change needs to be done. In case of changes in Monitors as well, the training of new Monitors is handled by the system administrators. Similarly, the initial testing and assessment of our application is handled

by the system administrators as well who check if the system is working properly, and the application is generating activity reports and notifications regularly. Since our application shows activities not just of the current day, but of a period of days (e.g., week), including previous days, a new user can still be familiar with start/end and duration of activities of a range of days and deduce a pattern or routine of the target easily.

The number of Targets assigned for each Monitor may vary based on the preference of each volunteer. The volunteers are free to choose a minimum or maximum number of Targets to monitor, after which the system administrator will assign them Targets. Therefore, the number may vary from a single Target to multiple ones based on each volunteer.

#### 3.1.3. Notification Generation

To encourage regular usage of the application, frequent notifications are sent to the Monitors. This functionality helps to timely track the recent activities of the Target and detect any change in the usual routine. We think there should be two types of notifications generated: emergency and general. General notifications are sent to remind monitors about using the application and check current activities of the target. Emergency notifications are sent when the system itself detects abnormalities in the recent activities of the target. We do not generate or analyze emergency notifications in this research because we aim to determine how often general notifications are responded by the Monitors, if they motivate the monitors to frequently use the application or not, and if constant notifications will be burdensome or disturbing.

The notification scheduling techniques that are commonly used can be divided into three types: randomized time points in a day, timed at specific intervals, and event dependent times [52]. In our system, general notifications are generated by using two types of notification strategies: timed at specific intervals and event dependent notifications. This ensures that the monitors are notified regularly to use the application, and can instantly check information about the activity completed.

#### 3.1.4. Smartphone Application

The information collected from the house of the Target is utilised to create graphical representation of activity completed in a time series form which helps to identify a pattern in the time of completion of activity and its duration, so that any deviation from the usual pattern can be identified with ease. Hence, we develop an Android based smartphone application, PATROL, which can be used to view the activities completed and send reports. Since smartphones have become a common gadget among the elderly as well [53], our application can be used by the young volunteers as well as the elderly. For our research, we conduct an experiment using smartphones, but the application can also be used in any other Android based devices like tablets.

The interaction between the Monitor and the application is shown in Figure 2. We have tried to minimize the number of actions required to be carried out by Monitors. In the application, the Monitors receive notifications as a trigger so that they can check the time of completion and duration of the activity for the current day and previous days, after which they can judge whether the Target is in a risky situation or not, and submit a report. If the Monitor reports that the Target is in a high risk situation, then the application can notify the system administrator and emergency contacts of the elderly via text, email, or automated phone calls, who can take necessary actions immediately. The Monitors did not disclose any details of the Target even in such situations to maintain the anonymity of our system. The system administrators, who are in the vicinity of the Target, will take the responsibility for checking the Target as soon as such reports are received. The report sent by the Monitors are saved and analyzed to evaluate their monitoring capability.

**Figure 2.** Interaction between Monitor and smartphone application.

For the accurate analysis of our application, it is necessary that risky situations of the Targets are identified correctly. We, however, at first need to define what these risks are, and how they can be related to real life situations. We created a total of four risk stages, as shown in Table 2. These risks are based on the changes in the routine of the Target. If there is no change in their routine i.e., no noticeable deviation in their activity, then we regard the risk as None. Low and Medium risks are defined based on the amount of deviation from the usual start/end time or duration of the activities. High risks refer to situations when the activity has not started, or completed indicating that the Target needs urgent attention.



We have used standard deviation to define *low* and *medium* level risks. We calculated standard deviation of duration and time of completion of each activity, for each targets. Then, we defined *low* and *medium* level of risks as follows:

	- **–** duration ±1.5 × standard deviation of duration
	- **–** time ±1.5 × standard deviation of activity completion time
	- **–** duration ± 3 × standard deviation of duration
	- **–** time ± 3 × standard deviation of activity completion time

The purpose of using this technique is that it gives us a wide range of duration and activity start/end times that we can relate with risks in real life scenarios. The low risk indicates that the deviation in time or duration was not so concerning, which meant that the elderly had some problems but were able to deal with them themselves. Medium risk indicates a higher deviation in time or duration of activity, which indicates that the elderly might not be doing so well and need to be attended to personally. Since we have ourselves defined these ranges of duration and start/end times for low and medium risks, they are flexible, and hence can be modified based on the activity data of the elderly.

#### *3.2. Application Design*

Even though recent technologies have been designed and developed targeting with an average young user in mind, who is efficient at handling new systems or devices [54,55], we have tried to make the interface simple and intuitive so that it can be used by people of all ages conveniently. As shown in Figure 2, the number of tasks to be carried out by the Monitor in the application are very minimal. Therefore, we believe that the application will be easy to use, and the burden of using the application will be low for the Monitors. Our final goal is to achieve remote elderly care and prompt identification of risky situations; however, we believe that to achieve them, the design and interface of the application should be favorable to the monitors. We aim for our concern of providing a continuous and detailed elderly care system, and an easy and intuitive interface for monitors does not remain mutually exclusive. The actions in the application to be carried out are: respond to notifications, check activity, and submit a report. Below, we will explain different features in the interface of the smartphone application, and the notification strategy that we developed.

#### 3.2.1. Features of the Application Interface

We have designed the application with various features in the interface that is aimed at helping the monitoring process. All the activities are shown in a single interface to reduce the burden of going back and forth between interfaces to monitor the activities. We will now discuss the features of the application interface.

#### Activity Report

The application shows the option to choose whom to monitor among a list of Targets, as shown in Figure 3a. Since our application is anonymous, the real names of the Targets are not shown. We used three commonly used names in Japan (Taro, Watanabe, and Yamazaki) to denominate the Targets in our application. Once the Target is chosen, then the activity report interface is shown, as shown in Figure 3b.

**Figure 3.** Snippet of the smartphone application for: (**a**) choosing Targets, (**b**) sleep card, (**c**) breakfast card, and (**d**) submitting report.

The activity report interface breaks down each activity into different cards, with each card showcasing the current status of the activity (incomplete, ongoing, or completed), activity completion time (in graph as well as text), and duration of the activity, as shown in Figure 3b,c. In case of activities like TV and medication that can occur multiple times in a day, each separate activity is represented by separate cards. The Candlestick chart style helps to identify a pattern in the time of completion of activity and its duration, so that any deviation from the usual routine can be recognized with ease. We use a candlestick chart to

show activities because it can showcase the time as well as duration with clarity, and the difference between consecutive days is also understandable.

The Monitor, ideally, should be able to submit only one report per activity per day as well as provide the report for an activity only after the activity has been completed. Hence, in order to prevent multiple and erroneous reporting, we use two techniques: color codes in activity cards; and radio button for reporting. In the cases of activities that occur multiple times in a day (such as TV, medication), multiple activity cards of the same activity are shown. To avoid confusion for the users, only one activity card is shown at the start of the day, when no multiple activities have occurred. The activity cards are then subsequently added soon after their occurrence.

#### Colors Codes in Activity Cards

Traffic light colors have been used in various research studies, from labelling traffic colors on food to indicate their edibility or freshness [56,57], to using traffic colors as a means of self-monitoring by recording the weight and shortness of breath in a diary [58]. We use traffic color codes for the activity cards in order to make the current status of activities of the Target clear, as shown in Figure 4.

**Figure 4.** Use of color for representing activity state for: (**a**) activity not complete, (**b**) activity ongoing, (**c**) activity complete, and (**d**) activity reported.

The background color of the activity card is represented by *red* when the Target has not completed the activity, as shown in Figure 4a. The current status information, shown as *Incomplete* , also gives an update that the activity has not been finished for the current day. The information about end time and the duration of the activity is also empty at this stage.

The background color of the activity card is represented by *red* when the target starts the activity, as shown in Figure 4b. The current status information is changed to *Ongoing* in this case, and the information about the start time of that activity is updated. The information about the duration of the activity is also empty at this stage.

The background color changes to *yellow* when the activity is finished by the Target. The current status is also updated, to *Complete*, along with information about end time and

duration of the activity. Along with the change in color, the radio buttons for reporting the status are also shown below the card, as shown in Figure 4c.

When the Monitor reports about the activity, then the background color of the card is changed to *green*. Along with that, the radio buttons for reporting are hidden, as shown in Figure 4d. Thus, when the Monitor opens the application again after submitting a report, the option to report again is not available, and the color codes help them identify the activities they have already reported.

We believe that, since people are familiar with traffic colors and their functions, this feature in the application is intuitive, and helpful in clearly distinguishing the states of activity. The colors are also directly related to the state of the elderly as well as the necessity of Monitor's attention. When the background color is red, activities are either ongoing or not started at all, which means that the elderly has not completed any activity. This state requires a higher amount of attention from the monitor because if the background color does not change from red for a prolonged time, then it should be deduced by the Monitors that the elderly might be in a risky situation and thus the Monitor should report, via an overall report card. When the background color of the card changes to yellow, it indicates that the elderly has completed an activity, and the monitor should now check the activity and submit a report. This state requires lower attention from the the Monitor compared to the red background color state. Similarly, a green color gives Monitors a confirmation that they have completed the reporting task already and should not pay any attention to that particular activity anymore.

#### Overall Report

Along with the activity cards for each activity, there is a separate card called Overall card. This, in general, is to report about overall impression about the status of the elderly. This can be reported multiple times by the Monitor throughout the day, and has the same reporting option of risks and confidence as in other activity cards, as shown in Figure 4d. Thus, when submitting reports for activities, the Monitor has the option to choose what they feel is the overall status of the Target based on their judgement of activities completed or not completed. In cases of High risk situations such as no activity or long deviation, the target will not register completion of activities regularly, which means that no notifications are sent and the activity cards are not updated. If no activity has been updated for a significant time, then the monitors can deduce that there is something wrong with the target. In such situations, they can report the emergency situation using the Overall card. The card also shows the type and time of previous response for the Overall card, to make it easier for the Monitor to recall their previous impression, as shown in Figure 5b.

#### Submit Report

The task for the Monitor is to check the activity report of the Target and analyze the information shown and then submit their report. The report can be submitted for one activity at a time, as well as for multiple activities at the same time. To submit the report for each activity, the Monitor needs to scroll down in the activity report interface and click the submit button at the end of the activity report interface as shown in Figure 3d.

If the monitor responds with *high* risk and *high* confidence to any activity, then the application can infer that the elderly might be in an emergency situation, and can promptly notify the emergency contact of the Target (friends, family or health professionals) via text message, email, or automated phone calls, and they can take necessary actions. Similarly, if more than two subsequent *medium* risks are reported with *high* confidence, then their emergency contact can be notified immediately. Thus, to provide a base to analyze the confidence of the report, we divided the confidence level for each report as *Low*, *Medium* and *High*, as seen in Figure 4c. The confidence levels hence act as reference points of risks for each activity, especially when there are multiple Monitors. The confidence level provides a perception of each of the Monitors and their report, and also helps to analyze their monitoring capabilities.

**Figure 5.** Overall report card: (**a**) before report submission and (**b**) after report submission.

#### 3.2.2. Notification Strategy

We deploy two kinds of notification patterns in our application: recurring notifications (rN) and activity based notification (abN). We send notifications every two hours (rN) to provide a trigger to the targets to use the application. The period for recurring notification is two hours because we feel that two hours is an appropriate time gap for reminding users, as sending a notification every 30 min or an hour will be too disruptive. Analyzing the activity completion times and usual gap between activities, we feel that two hours is an appropriate gap to send a recurring notification. We have also analyzed the perception of users towards recurring notifications of two-hour intervals, and have empirically proved that they are not perceived as disturbing and were responded to about 87% of the time [59].

Apart from this, we also send a notification, abN, which is sent as soon as a target completes an activity. We mentioned in Section 2.2 that it is necessary to provide contextual information in notifications for quick responses. We provide the name of the target and the activity completed in the notification, to provide context of the notification to the monitors, as shown in Figure 6. To make distinction between the two types of notifications, we indicate abN with a red icon of notification (see Figure 6a) and rN with a blue icon (see Figure 6b).

**Figure 6.** Example of notifications generated: (**a**) activity based notification (abN) and (**b**) recurring notification (rN).

#### **4. Implementation and Evaluation**

In this section, we will explain the details of the experiment conducted to analyze the application, including the dataset used for the application, multiple versions of PATROL application that we created, and finally explain the result of our study.

#### *4.1. Multiple Versions of PATROL Application*

In order to concretely determine that our proposed method of a graphical interface (GI), as shown in Figure 3b, is intuitive and has a higher degree of user acceptance, we needed to compare that interface with commonly used activity representation techniques. To make that distinction, we created a separate version of our application where activities were shown in a textual interface, rather than graphs. Figure 7 shows the activity report interface of this kind of version of the application. All the features of the application mentioned in Section 3.2 are included in this version as well, so the working principle is the same regardless of the interface. This helps create less confusion for the participants and ensures that the performance and perception of users is solely based on the type of interface, and not on other features of the application.

**Figure 7.** Example of tabular interface (TI): (**a**) activity incomplete, (**b**) activity complete and (**c**) activity reported.

Similarly, we created a third version of our application (GR), in which we did not send notifications to the monitors when the activity was completed by a target. We only send them recurring notifications every two hours. With this version of the application, we aim to determine if the monitors are able to report about activities of the elderly even if they do not receive activity based notifications (abN) and thus our strategy of providing both abN (activity based notification) and rN (recurring notification) can be effective to encourage and motivate monitors to use the application frequently and receive continuous reports of activities of the target.

Table 3 summarizes the three versions of the application created, and we will use the same label for versions (GAR, TAR, and GR) in future discussions. GAR refers to the proposed version of PATROL, which consists of a Graphical interface, Activity based notification, and Recurring notification. We investigate the accuracy of risk identification, and the burden of use of our application by comparing the versions GAR and TAR (Tabular interface, Activity based notification, and Recurring notification). Similarly, we compare the effectiveness of using activity based notifications by comparing GAR with GR (Graphical interface and Recurring notification).


**Table 3.** Types of versions of PATROL application.

#### *4.2. Dataset*

The dataset used in our experiment is taken from a real life experiment conducted in the houses of elderly residents over the age of 60 [16]. The activity dataset was obtained by Matsui et al. through an extensive research conducted over a period of two months, where motion and environmental sensors were installed in each of the houses. Along with that, a physical button was installed in each of the houses, and the residents were requested to press the button whenever they started and ended an activity [16]. The original dataset consists of activity recognition data from single as well as two-person households. For the purpose of this research, we selected only single resident households that were three in total. We use cleaned and collected data from the above-mentioned study, and consider that the activity recognition system is 100% accurate (we used ground truth labels of activities in the dataset as the output of the activity recognition method).

The daily activities of the elderly that we want to track and monitor are mentioned in Table 1. The original dataset, however, does not contain data related to the Medication activity. Similarly, we also wanted to include multiple activities related to frequent use of TV. To fulfill our desired dataset, we added aforementioned activities into the original dataset. The total period of experiment of the two-month study was longer than our intended experiment period of 10 days. Hence, we only selected data for a 10 day period from the available two months of data. We included data from the same time period section for all the three single-resident households.

We included some risky situations into the dataset based on the definition shown in Table 2. For the purpose of our research, we included only *low* and *medium* level risks. As defined, *none* risk indicates that there is no problem with the elderly. Hence, we do not need to alter the dataset for such risk, since they concur with the regular routine of the elderly. If the level of risk is *high*, it indicates that the elderly person is in a serious condition and in need of immediate medical care. In such cases, no activity will be completed by the elderly, and the activity report in the application will not be updated.

However, our aim is to determine if any deviation from regular routine of the activities could be determined using our application. Though *high* risks can occur suddenly, we also think that, if we regularly monitor and determine *low* and *medium* level risks, then *high* level risks can be prevented or predicted. Because of this, we did not include *high* level risks in our dataset.

#### *4.3. Experiment Details*

We recruited a total of nine participants (gender: 6 Male, 3 Female; age range: 25–34 years old, average age: 28.6 years) to take part in our evaluation study. The participants were playing the role of 'Monitors' throughout the experiment. The modified dataset of the three single-person households were used for the three 'Targets' in the application. The participants were divided into three groups each. Thus, we had three participants each in three study groups. This was carried out to implement random distribution of our application in a way that each group, with an equal number of participants, will use a different application at a given time compared to other study groups. To implement that, we divided the experiment period into three phases in total. Table 4 simplifies the study group and application interface division.


**Table 4.** Study groups and division of version of PATROL application.

The three versions of the application were uploaded to Google Play Store. Before the start of the experiment, we conducted a research and experiment introduction session that all the participants were requested to attend compulsorily. We explained the theme of the study and experiment in detail, their role as monitors, and the tasks they have to complete while using the application. They were also provided a document containing all the information about the working principles of the different versions of the application, along with QR codes for each version. The documents also indicated the version of the application they were supposed to use in each phase of the experiment. As a reward for participation in the experiment, the participants were provided with a gift card worth 2000 JPY.

To make the transition between interfaces easier for the participants, we included a one day gap between each phase. The participants were asked to take a break for a day in between the phases. The phases were designed to be of three days each. However, at the start of phase 2, we encountered some complications with the server connected to our application, and the application did not work properly until mid-day. Hence, we asked the participants to continue phase 2 for one day more. Thus, in total, the experiment period consisted of 12 days, with breaks of two days in total. After the end of each phase, we asked the participants to fill in a questionnaire developed using Google Forms. Most of the questions had to be rated on a five-point Likert scale (1 = strongly disagree, 3 = neutral, 5 = strongly agree), while some of them were open-ended. The participants were asked to respond to questions or statements related to their perception of the version of the application, as well as the effect of change in the version of the application, such as "*The activity related notifications were helpful in monitoring the elderly as it reminded me to check the application regularly.*", "*I found the change in the interface confusing.*", and "*I feel the new interface needed more mental effort.*" At the end of the experiment, the participants were asked to fill out a final questionnaire. The purpose of these questionnaires is to gain insight into the impression of the participants for different versions and different notification types.

#### *4.4. Results*

The results of our study are analyzed based on the following three conditions:


#### 4.4.1. Accuracy of Risk Detection

In order to verify the effectiveness of our visualization technique, it is necessary to check if the risks included in the application, as mentioned in Section 4.2, will be identified correctly. In this section, we report the rate with which the risks included in the dataset were correctly identified in each phase, using different versions of the application. Table 5 and Table 6 show rate of correct identification of risks based on study groups and interfaces, respectively.

From Table 5, we can observe that StudyGroup C was the most consistent group, with the highest risk identification rate during all of the three phases of the experiment. The rate of correct identification also increased along with the experiment, which proves that familiarity with the application helped to analyze the activity reports and submit reports.


**Table 5.** Risk identification based on study groups.

There was a slight decrease in risk identification for StudyGroup A when the interface changed from graphical (GAR) to tabular (TAR) in phase 2 of the experiment. All of the participants in StudyGroup A agreed that the new interface needed more time to analyze in their questionnaires after phase 2, with 66.7% agreeing that the tabular interface (TI) needed more mental effort than graphical interface (GI). When the interface changed to graphical layout (GR) in phase 3 of experiment, there was an increase in the correct rate identification. When asked about the change, participants claimed that it was easier to understand the routine with the graph compared to tabular layout (66.7% agree, 33.3% strongly agree).

StudyGroup B showed a considerable increase in correct risk identification, in phase 2, as shown in Table 5, even though they had graphical layout for both phase 1 (GR) and 2 (GAR). We can predict that familiarity with the application was the reason for such change. In their questionnaire after phase 2, 66.7% strongly agreed that they were familiar with the application and found it easier to use the application during this phase. However, in phase 3, their interface changed to tabular layout (TAR). This led to reduction in risk identification, with 33.3% strongly agreeing that the change in interface was confusing.

As shown in Table 6, we found out that, in total, using GAR, on average about 75.2% of the time the risks were identified correctly. In comparison, the risks were identified correctly about 65.8% of the time using TAR. GR, which in this context, is the same in visualization as GAR had a risk identification accuracy of about 68.5%. The average rate of risk identification is lower for tabular interface (TI), compared to both of the graphical interfaces (GI). This can help to identify that graphical interfaces (GI) provide better understanding or identification of risks.


**Table 6.** Risk identification based on interface types.

We also found statistically-significant differences between the average risk identification rates of the three interfaces using the one-way ANOVA method (*p* = 0.037). A Tukey-HSD post-hoc test revealed a significant pairwise difference between interfaces GAR and TAR (*p* = 0.032) whilst no difference was observed between GAR and GR (*p* = 0.2).

To investigate this further, we combined the results of GAR and GR into a single group and compared it with TAR, to clearly determine differences between graphical and tabular interfaces for risk identification. Through the paired *t*-test analysis, we found that there is a significant difference between the two (*p* = 0.047).

#### 4.4.2. Low Burden Evaluation

We define burden as the time taken by the participants between opening the application to check the activity report of targets and submitting the report. We logged the time of opening of the application as well as the time of reporting using "Shared preference" functionality available for Android developers. These time periods were saved together in the Firebase database. We analyzed the burden time for each participant using this data and calculated an average burden time for each participant over the whole experiment period, which is shown in Figure 8. The average burden time for each of the versions is also shown.

**Figure 8.** Average burden time of participants.

We can see that the burden time for GAR, on average, is always less than TAR. The mean burden time for GAR, TAR, and GR were observed to be 28 s, 38 s, and 52 s, respectively. As seen in Figure 8, the burden for participant 1 while using GR is very high compared to other participants, and other interfaces used by the same participant. Upon inspection, it was discovered that, while using GR, for one particular report, the participant recorded an unusually high burden time, which was uncharacteristic for the participant based on his other responses. Discarding the unusually high burden time, the average burden time of the participant 1 was reduced from 193 s to almost 20 s. However, for the final analysis, the skewed data are kept as it is. Similarly, the burden for participant 2 while using TAR is zero because the participant did not record any response during phase 2 of the experiment.

To analyze the link between burden of using the application, and engagement with the application over time, we calculated the average time it took to report based on the phases of the experiment. The results are shown in Figure 9. When the interface changed from graphical (GAR) to tabular (TAR), in phase 2 for StudyGroup A, we can see that the burden time was higher. In phase 3, when their interface changed back to graphical (GR), the burden time was observed to be extremely high (94 s) due to the unusual reporting by participant 1 as explained above. Discarding that particular incident, the burden time was observed to be lower than in phase 2 (28 s).

For StudyGroup B, the burden time was highest in phase 1, with 47 s, when using GR. However, the burden time decreased in phase 2 (25 s) when using GAR. This can be attributed to the participants getting familiar with the interface. In phase 3, however, when the interface changed to tabular (TAR), we can see that the average burden time increased to 37 s.

**Figure 9.** Average burden time of study groups per phase.

Similarly, when the interface was changed from tabular (TAR) to graphical (GR), for StudyGroup C in phase 2 of experiment, we can see that the average burden time was lower (22 s). Even though the burden time increased in phase 3 (25 s), using GAR, it was still lower than the burden time in phase 1 (42 s). Therefore, over the course of the experiment period, we can observe that change in interface had some effect on the engagement with the application and burden time. Familiarity with the application lowered the burden time, especially using a graphical interface (GI).

We found a statistically-significant difference in the burden time for the three interfaces using a one-way ANOVA method (*p* = 0.012). A Tukey-HSD post-hoc test revealed a significant pairwise difference between interfaces GAR and TAR (*p* = 0.039) whilst no difference was observed between GAR and GR (*p* = 0.13).

For further investigation, we combined the results of GAR and GR into a single group and compared it with TAR and through a paired *t*-test analysis; we found that there is a significant difference between the two (*p* = 0.049). This analysis, along with the results from Figures 8 and 9, help to show that there is a significant difference between tabular and graphical interfaces for the burden faced while using the application, with a graphical interface resulting in a lower burden for the participants.

Lesser burden also resulted in higher engagement with the application. Figure 10 shows that the total number of reports received using GAR across different phases were almost consistent across the three phases, and on average higher than when using TAR. There was a significant decrease in reports using TAR in phase 2 for StudyGroup A. This can be attributed to change in their interface because, in an earlier phase, they used graphical interface (GI). They also mentioned in the questionnaire after phase 2 that tabular interface (TI) was difficult to understand, which resulted in a lower number of reports.

We can thus conclude that GAR provides lesser burden to participants, in comparison with TAR, and on average has higher engagement and reporting. This further strengthens our proposal that graphical interface (GI), with adequate textual information, can be helpful for monitors to identify the routine of targets and distinguish risky situations whilst spending less time and effort analyzing the interface.

#### 4.4.3. Timely Detection

Figure 11 shows the time taken to report about a completed activity during each phase, based on types of interface. Over the three phases of experiment, we can observe that using a graphical interface (GI), the reports for activities were received quicker compared to tabular interface (TI): GAR (average = 176.46 min, median = 115.01 min), TAR (average = 201.42 min, median = 118.85 min), and GR (average = 166.9 min, median = 121.12 min). Even though such high response times for the report are not favorable, we think that there were many factors that affected the reporting time for activities.

**Figure 10.** Total number of reports received.

**Figure 11.** Response time for activities per phase based on study groups.

The time of notification generated, which is also the time when the activities were completed, was saved using "Shared preference" functionality, as mentioned in Section 4.4.2. Similarly, we also saved the time when the activity report was submitted. We determine the time taken to report an activity by calculating the time difference between report submission and notification generation. For StudyGroup A, when the interface changed from graphical (GAR) to tabular (TAR) in phase 2, the reporting time was higher compared to phase 1, even if they had received both rN (recurring notifications) and abN (activity based notifications) in both of the phases. This can be attributed to the change in interface because, when their interface changed back to graphical (GR) in phase 3, the time of response also was observed to be lower than on phase 2, even though they did not receive abN. This shows that type of visualization can have an effect on the response time for notifications received.

StudyGroup B were almost consistent in their performance throughout the first two phases of the experiment period. In phase 2, when their interface changed from GR to GAR, there was no significant change in their response time even if they did not receive abN. However, when their visualization changed to tabular (TAR) in phase 3, the time of responses was higher than in the previous two phases.

In contrast, StudyGroup C did not show any significant differences in response time for activities based on changes in interface as well as reception of abN. When their interface changed from TAR to GR in phase 2 and from GR to GAR in phase 3, their response time for notifications did not show any high amount of significant differences. StudyGroup C thus did not show any conclusive effect for the change in visualization or notification strategies for the reception of reports to activities.

Table 7 shows the average response time of each participant while using each of the interfaces, where the lowest response time taken among the three interfaces is highlighted. Even though TAR consisted of both abN and rN notifications, we found that none of the participants responded quickly while using it. Moreover, the mean response time using TAR is highest across all the participants (except participant 2, who did not register any response during phase 2). We found that, even though they did not receive abN, some of the participants (4) recorded lowest mean response time using GR. GAR and GR recorded mean response times of about 176.46 min and 166.9 min respectively, while TAR had a mean response time of 201.42 min. Even though GR had lower average response time, we observed that the median response time for notification was lower for GAR (115.01 min) compared to GR (121.12 min) and TAR (118.85 min). This shows that reports were received quicker using GAR than GR or TAR.


**Table 7.** Mean response time (in minutes) of each participant.

The quickest mean response time for each participant is highlighted in bold text.

Upon further analysis, we found statistically-significant differences between activity response time for the three interfaces using a one-way ANOVA method (*p* = 0.005). A Tukey-HSD post-hoc test revealed a significant pairwise difference between interfaces GR and TAR (*p* = 0.05) whilst no difference was observed between GAR and GR (*p* = 0.64) or between GAR and TAR (*p* = 0.055).

We then combined the results of interfaces that received abN, i.e., GAR and TAR, into a single group and compared it with GR, and found that a paired *t*-test shows a significant difference between the two (*p* = 0.022).

This shows that reception of abN does indeed have an effect on the time for response to the activities. To investigate this further, we determined the time range within which the responses to the activity notifications were received. Table 8 shows the cumulative percentage of reports received within the given time ranges for the three versions of the application. We divide the time into 30 min intervals; however, the table only shows until 210 min, since the highest average time of response is within the 180–210 min range. We can see that the amount of responses received does not vary by a large amount if graphical interfaces are compared. However, for tabular interfaces, the response rate is lower even if abN was received. This shows that abN, when used with a graphical interface, provides a better result than compared with tabular interface. We then tried to investigate which interface provided the quickest response for activities.


**Table 8.** Cumulative percentages of responses received per time range (in minutes).

We divided the notifications into those that were for regular activities and those that were for the risky situations. By using the time taken to report to activities, we determined the minimum time taken to submit a report for an activity among all the participants, and the version of the application used to submit that report. Thus, we found that, using which particular version of the application, we received the quickest response for each of the activities. The results are shown in Figures 12 and 13. We can see that the risky situations responded quicker when using interfaces that consisted of abN, even though there is not much difference between interfaces for the quickest time of response to nonrisky notifications.

**Figure 12.** Quickest response for risky situations.

In the final questionnaire, the participants responded with the reasons that could also provide the reason for such higher response time. Almost 45% participants (*n* = 4) mentioned that they were busy with their research/private work and could not respond to the notifications on time. We received responses such as: *"I was so busy with my work"; "Busy with my research work or play a game"; "mentally busy with my own work"; "sometimes i was busy"*. Similarly, two of the participants mentioned that they often forgot to check the application. This can be attributed to the different interface types used and notifications received.

Two of the participants responded in the questionnaire that they did not use the application if they did not receive any notifications, while six (66%) of them said they did not wait for the notifications to use the application but were busy with their work and could not respond immediately. We also wanted to know if the notifications received were perceived as distracting or disturbing, to analyze if their perception played any role in the response time. When asked if the notifications received from the application were distracting, 2 (22%) of them strongly claimed they were not disturbed, 5 (55%) said they were not disturbed, while 1 of them was neutral, and 1 agreed that he was distracted. Similarly, 8 (88%) (strongly agree: 4; agree: 4) agreed that they prefer to receive abN so that they can be regularly notified monitor frequently, while 1 of them was neutral.

**Figure 13.** Quickest response for non-risky situations.

#### **5. Discussion and Limitations**

In this section, first we discuss the results and verify research questions RQ1–3 mentioned in Section 2.3, then we show some remaining issues as limitations.

#### *5.1. Discussion*

When considering user engagement and their ability to identify routine of individuals with the interface, we can conclude that the results are fairly positive towards GAR, as compared to TAR. Using GAR, we found that 75.2% of risky situations were correctly identified as risks, compared to 65.8% and 68.5% for TAR and GR, respectively. Though identification of risk varied between study groups using GAR (68.4% for StudyGroup A; 64.7% for Study-Group B, and 92.6% for StudyGroup C), the overall identification rate is higher for GAR. This shows that risks can be identified using graphical interface and the style of graph that we used. A response from a participant , *"I can see the difference of the duration directly from the graph. The table one need to scroll up and down to see all the information, which sometimes kind of annoying"* also suggests that our visualization is effective. These findings justify our research questions, RQ1 and RQ2, that it is possible to identify the daily routine of individuals using a smartphone application, and it is possible to detect potential risks in such routine based on the visualization provided.

Using GAR, participants faced the lowest burden of 28 s, compared to 38 s in tabular (TAR). Similarly, none of the participants claimed that the application demanded a lot of time and effort from them. Regarding notifications, only one of the participants found them distracting, and 88.8% mentioned that they will prefer to receive activity based notifications for monitoring purposes. Similarly, all of the participants (77.8% strongly agree, 22.2% agree) responded that the use of traffic colors was useful to identify the state of the activities quickly. Therefore, we can verify RQ3, that constant notifications and using the application was not troublesome for the users.

We received a total of 1680 responses from participants over the experiment period. We can claim that such interaction is a result of their willingness to use the application. When interface of participants changed from graph to table, there was a reduction in the number of reports obtained (45.6% for StudyGroup A in phase 2, and 9.8% for StudyGroup

B in phase 3). Similarly, when the interface changed from tabular to graph, we obtained an increase in the number of reports by 96.7% for StudyGroup A in phase 3 and reduction by 11.5% for StudyGroup C in phase 2. In total, the engagement with the application is high, which along with the lower interface analyzing time, verifies RQ3, that using the application is not a burden for the monitoring person.

At the end of the experiment, we asked the participants which representation of activities they preferred: table or graph. All of them agreed that graphical representation was better. Some of the responses we received, such as, *"Got on a quick glance the exact duration of past activities and could check exact time of the day"; "With graph, it's easy for me to compare the length of activity at the glance."*, further strengthens our proposal that the graphical interface we proposed can help to identify a daily routine in a clear and intuitive manner and further justifies RQ1, that a smartphone application can be a good tool for identifying daily activities.

#### *5.2. Limitations*

Our system evaluation requires that there are certain risky situations in the activity of the elderly. We did not conduct a real-time activity recognition of elderly, but instead, we used a pre-existing activity dataset because, in real-time scenarios, there is no surety of receiving such risky situations, and we would need to request someone to deliberately change their activity pattern so that others could detect it. Such a situation can invoke unfavorable reactions. Similarly, since activity recognition systems are not perfectly accurate, sometimes the activities may not be correctly identified, or falsely identified, which would hamper our evaluation. Moreover, we recruited students for the experiment, but they are always busy because of their academic work, and/or personal lives which might have affected the number and time of reception of reports.

#### **6. Conclusions**

In this study, we proposed a system, PATROL, that can be used to anonymously track everyday activities of the elderly and identify any potential risks in their daily routine using a smartphone application. Our system is aimed to be deployed in elderly residential areas or communities and does not disclose any private information such as age, location, etc. to the monitoring person to maintain the privacy and security of elderly residents. The monitoring person receives recurring notifications every two hours and activity-based notifications whenever an elderly person completes an activity from the service server and assesses elderly condition by a smartphone application visualizing elderly activity history. We designed our application with features such as single interface design, intuitive graphical user interface for activity and anomaly detection, and color and textual information for state of activities. These features altogether help not only to conduct quicker monitoring of activities of elderly, but also to induce a low amount of burden to the monitoring person, who at once may be responsible for monitoring single or multiple elderly people.

We added risky situations in an activity dataset obtained from a real-life experiment with elderly residents and conducted a user study using the proposed method and two other baseline methods varying in visualization and notification techniques for three groups consisting of nine participants. We found that with our proposed method, 75.2% of the risks were successfully identified, while 68.5% and 65.8% were identified with other methods. The proposed method also provided a better result for the timely reception of activities: GAR ( median = 115.1 min), TAR (median = 118.85 min), and GR (median = 121.12 min). Moreover, the interface analyzing and reporting time was also lower (28 s) in the proposed method compared to 38 and 54 s in other methods. As future work, we will conduct realtime activity recognition and monitoring using our application. To achieve that, we will also research/work on activity recognition systems using other kinds of sensors that can not only potentially provide better activity recognition in real time but also remove dependency on the elderly person for data collection. Moreover, we will explore the possibility to assess the elderly's activity state and detect anomalies by using measurements from ambient

sensors (temperature, humidity, illumination, etc.). We will also include high risk situations such as Fall (and no activities after the incident) and try to determine if participants will be able to deduce such emergency situations quickly. We will also aim to increase the number of participants to receive more reports and analyze the results based on age, gender, etc.

**Author Contributions:** Conceptualization, R.D., T.M., Y.M. and K.Y.; Methodology, R.D., T.M., Y.M. and K.Y.; Software, R.D. and Y.M.; Validation, T.M.; Formal, Y.M. and K.Y.; Resources, K.Y.; Writing original draft preparation, R.D.; Writing—review and editing, R.D., T.M., Y.M. and K.Y.; Visualization, R.D. and K.Y.; supervision, T.M., Y.M. and K.Y.; funding acquisition, K.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partly supported by JSPS KAKENHI Grant Nos. JP21H03431, JP19KT0020, JP19K11924, and JP20H04177.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by Ethical Review Committee of Nara Institute of Science and Technology (2020-I-16).

**Informed Consent Statement:** Informed consent was waived because collected data does not include private information.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**

