**Detection of Participation and Training Task Di**ffi**culty Applied to the Multi-Sensor Systems of Rehabilitation Robots**

**Hao Yan 1, Hongbo Wang 1,2,\*, Luige Vladareanu 3,\*, Musong Lin 1, Victor Vladareanu <sup>3</sup> and Yungui Li <sup>1</sup>**


Received: 3 September 2019; Accepted: 23 October 2019; Published: 28 October 2019

**Abstract:** In the process of rehabilitation training for stroke patients, the rehabilitation effect is positively affected by how much physical activity the patients take part in. Most of the signals used to measure the patients' participation are EMG signals or oxygen consumption, which increase the cost and the complexity of the robotic device. In this work, we design a multi-sensor system robot with torque and six-dimensional force sensors to gauge the patients' participation in training. By establishing the static equation of the mechanical leg, the man–machine interaction force of the patient can be accurately extracted. Using the impedance model, the auxiliary force training mode is established, and the difficulty of the target task is changed by adjusting the K value of auxiliary force. Participation models with three intensities were developed offline using support vector machines, for which the C and σ parameters are optimized by the hybrid quantum particle swarm optimization and support vector machines (Hybrid QPSO-SVM) algorithm. An experimental statistical analysis was conducted on ten volunteers' motion representation in different training tasks, which are divided into three stages: over-challenge, challenge, less challenge, by choosing characteristic quantities with significant differences among the various difficulty task stages, as a training set for the support vector machines (SVM). Experimental results from 12 volunteers, with tasks conducted on the lower limb rehabilitation robot LLR-II show that the rehabilitation robot can accurately predict patient participation and training task difficulty. The prediction accuracy reflects the superiority of the Hybrid QPSO-SVM algorithm.

**Keywords:** rehabilitation robot; human–robot interaction; training task; multi-sensor system; quantum particle swarm optimization; support vector machines

#### **1. Introduction**

Neuromuscular injury can lead to disability or inconvenient movements, such as stroke and spinal cord injury, which have become important problems in the world [1]. Nowadays, there are more than 33 million stroke patients in the world [2], the mortality rate is as high as 80%, and 75% of the survivors are disabled [3]. The necessity to develop rehabilitation robots has made it one of the research hotspots in the world [4,5]. As a robot that is in direct contact with the patient, the rehabilitation robot shoulders the responsibility of helping the patient recover smoothly and safely. The human–computer interaction strategy, the energy interaction and role distribution control, are very important [6].

Clear detection of human–computer interaction and patient intention are the basis of flexible robot control. Most rehabilitation robots use force sensors to feedback mechanical information from patients, such as Hongbing Tao [7], Victor G [8]. Hwang et al. judged human motion intention by collecting pressure sensor data placed at the contact point between the standing posture rehabilitation robot and stroke patients [9]. Wolf S et al. connected the elastic element in series with the driving part and named it the Serial elastic actuator. By detecting the deformation of the elastic element, the joint moment can be detected and the motion intention of the patient can be judged [10]. Kim et al. used only one pressure sensor to realize the assistant force of the robot in the motion of the patient's elbow joint [11]. Some of them rely on current changes of the joint motors to detect motion intentions, such as Kim [12]. A few researchers use surface EMG (electromyography) signals and EEG (electroencephalogram) signals to predict the patients' motor intentions [13,14], such as Edward [15], Magdalena [16], and Tang [17]. Yepes et al. use electromyogram signals to determine the required moment of the knee joint [18]. Some researchers apply EMG signals in motion modal recognition of the prosthesis [19] and Cooperative Robot [20]. All these measurement methods have their own advantages, but EEG is susceptible to noise interference [21]. The EMG signal is not easy to collect when the skin surface changes [22,23]. Relatively speaking, it does not increase the cost of the robot and the complexity of the system. In this paper, the joint torque sensor and the hardware detection system of the six-dimensional force sensor on the sole are used.

Meanwhile, the rehabilitation effect is not only related to scientific rehabilitation training methods and reasonable training planning, but also has a great impact on the patients' active participation and active sports intention, which has been proven by clinical studies [24]. In order to improve the active participation of patients during the training process, it is necessary to provide assistance to patients according to the interaction situation during the training process [25] and to maximize the patient's independent tasks. Introducing patients' movement characteristics and physical fitness into the control strategy has a positive impact on the rehabilitation effect on the patients [26]. Yatsenko et al. controlled and adjusted the movement speed of the robotic arm according to the amplitude ratio of the EMG signal of the affected limb [27], and the patient could quickly adapt to control the movement of the prosthesis [28], but it was inconsistent with the movement characteristics of the human body. Many researches have introduced velocity field and virtual channel technology in the specific trajectory of rehabilitation training [29,30]. Cai uses impedance control to construct the velocity field in different directions of expected trajectories and provide correction force within a certain range of trajectories, with the correction force being a rigid force outside the interval threshold [31], but the threshold size setting is not given. In order to provide more accurate training for patients, radial basis function (RBF) neural network has excellent analytical ability in the patient's motor ability analysis, which was researched by Wolbrecht [32] and Pehlivan [33]. However, during the training period, the rehabilitation robot frequently interferes with the training of patients, which easily causes the side effect of relying on the machine, and fails to motivate the active participation of patients. In order to introduce the training state of patients into the control loop more accurately, many studies have judged the patient's psychological-level participation by collecting the patient's EMG signal, EEG signal, and other physiological information [34–36]. At the same time, in order to stimulate patients to actively participate in training, most rehabilitation robots currently use interesting games [37] or virtual reality technology [38].

In summary, this paper proposes a lower limb rehabilitation robot using joint torque sensors and six-dimensional force sensors on the foot soles. In the training task, man–machine interaction force information is collected, from which can be extracted characteristic quantities to predict the task difficulty by using support vector machines. The rest of this paper is organized as follows: the second section introduces the rehabilitation robot structure of multi-sensor system and human–machine interaction mechanical model. In the third section, a multi-difficulty rehabilitation training task is proposed. Under the model of impedance control, a support vector machines algorithm is used to establish the model of the patients' active participation and task difficulty detection. The fourth section analyses the characteristic quantity of 10 healthy volunteers during different difficulty training tasks, using support vector machines (SVM) to predict the participation and task difficulty of two other volunteers.

#### **2. LLR-II Rehabilitation Robot**

#### *2.1. Structural Design of LLR-II*

In order to adapt to the patients in the early stage of rehabilitation, the lower limb rehabilitation robot LLR-II designed in this paper can be trained in two postures, so as to prevent the mechanical leg from squeezing the patient [39,40]. As to the hardware platform of the rehabilitation system, LLR-II adopts a modular design and consists of five sub-modules: lower limb mechanical leg, main control system, sensor system, multi-function seat, and mechanical limit adjustment frame, as shown in Figure 1, to validate that the rehabilitation robot can accurately predict subjects participation and training task difficulty.

**Figure 1.** The LLR-II Rehabilitation.

The mechanical leg is a planar three-degree-of-freedom serial mechanism, similar to the three joints of human leg, including hip, knee, and ankle. In order to solve the problem of excessive driving power of the hip joint, a self-balancing design is adopted. The knee drive component is installed on the back of the hip joint rotation axis to balance the weight of part of the mechanical leg, reduce the driving power of the hip joint, and improve the dynamic performance of the mechanical leg. The addition of an electric pushrod in the mechanical leg can automatically adapt to patients with a height of 1500 mm to 1900 mm. In order to realize the safety of sitting and lying posture training for patients, variable joint limitation consisting of fastened limit groove and driven limit groove was designed, as shown in Figure 2.

Torque sensors are mounted inside the hip and knee joint of the LLR-II sagittal plane to detect the dynamic torque characteristics of the patient's training state in real time. The dynamic torque characteristic of the ankle joint is detected by a six-dimensional force sensor mounted on the sole of the foot. The torque sensor is manufactured by Sunrise Instruments Company in China, and the six-dimensional force sensor is manufactured by Junde Technology Co., Ltd. in China. The profile of the sensor and the sensor's detailed parameters are shown in Figure 3. The output side of the reducer increases the sensitivity and accuracy of the mechanical information detection and uses this information to complete the patient's motion intention detection.

Based on the LLR-II rehabilitation training function, its electrical control system is divided into Control Center system, Movement Control system, Signal Feedback system, Human–Computer Interaction system, as shown in Figure 4. The control center system use a variety of sensors to monitor the human–computer interaction state, and uses a variety of signals to complete the planning and training tasks; the robotic arm receives the instructions and drives the affected limbs to perform the multi-mode advanced rehabilitation training under the guidance of the driving system.

**Figure 2.** The detailed design of LLR-II leg mechanism.

**Figure 4.** The sensors system composition of LLR-II.

#### *2.2. Man–Machine Interaction Mechanics Model of LLR-II*

The joint no-load moment in LLR-II man–machine coupled motion is affected by the weight of the mechanical leg and the patient's leg, and it can be expressed as a nonlinear function of joint variables [41].

$$\mathcal{M}\_{n \times 1} = F(\theta\_{n \times 1}) \tag{1}$$

where, *Mn*×<sup>1</sup> is the column vector of joint no-load torque, *F*(•) is the mapping function, and θ*n*×<sup>1</sup> is the joint variable.

According to its own structure, it can be simplified as a planar three-link series mechanism. It should be noted that, considering the large weight of the leg, in order to increase the stability of the hip joint, the self-balancing design concept was introduced in the design process. The specific model can be shown in Figure 5.

**Figure 5.** Leg model of lower limb rehabilitation robot.

In the Figure, *l*1–*l*<sup>3</sup> represent the length of the thigh, calf, and sole, respectively; *l*<sup>4</sup> represents the length of the self-balancing part; *O*, *A*,*B* represent the hip, knee, and ankle joints, respectively; *D*,*P* respectively represent the first and last two endpoints; *G*1, *G*2, *G*<sup>3</sup> represent the weight of the machine and the patient's thigh, calf, and foot, respectively; *G*<sup>4</sup> represents the weight of the self-balancing part; *R*1–*R*<sup>4</sup> represent the lengths from the center of gravity of each part to the node; θ1, θ2, θ<sup>3</sup> represent the joint variables, in a counter-clockwise positive direction; θ4, θ<sup>5</sup> represent the intermediate quantity introduced.

The joint no-load moment equation is obtained as

$$
\begin{bmatrix} M\_1 \\ M\_2 \\ M\_3 \end{bmatrix} = \begin{bmatrix} \cos\theta\_1 & \cos\theta\_4 & \cos\theta\_5 \\ 0 & \cos\theta\_4 & \cos\theta\_5 \\ 0 & 0 & \cos\theta\_5 \end{bmatrix} \begin{bmatrix} G\_3l\_1 + G\_2l\_1 + G\_1R\_1 - G\_4R\_4 \\ G\_3l\_2 + G\_2R\_2 \\ G\_3R\_3 \end{bmatrix} \tag{2}
$$

In combination with Equation (2), the above equation can be modified to:

$$
\begin{bmatrix}
\mathcal{M}\_1\\\mathcal{M}\_2\\\mathcal{M}\_3
\end{bmatrix} = \begin{bmatrix}
\cos\theta\_1 & \cos(\theta\_1 + \theta\_2 + \theta\_3) & \cos(-\theta\_2 - \theta\_1) \\
0 & \cos(\theta\_1 + \theta\_2 + \theta\_3) & \cos(-\theta\_2 - \theta\_1) \\
0 & 0 & \cos(-\theta\_2 - \theta\_1)
\end{bmatrix} \begin{bmatrix}
\mathcal{G}\_3l\_1 + \mathcal{G}\_2l\_1 + \mathcal{G}\_1\mathcal{R}\_1 - \mathcal{G}\_4\mathcal{R}\_4 + f\_1 \\
\mathcal{G}\_3l\_2 + \mathcal{G}\_2\mathcal{R}\_2 + f\_2 \\
\mathcal{G}\_3\mathcal{R}\_3 + f\_3
\end{bmatrix} \tag{3}
$$

It can be abbreviated as:

$$M\_{3 \times 1} = L\_{3 \times 3}(\theta) \bullet \mathcal{C}\_{3 \times 1} \tag{4}$$

In the formula, the joint no-load torque term is represented by *M*3×1, and the joint variable term is *L*3×3(θ), and *C*3×<sup>1</sup> is a characteristic parameter term.

The characteristic parameter item *C*3×<sup>1</sup> is associated with patient information, which is unique to any patient, and needs to be solved for each patient. Since joint variables *Li* and θ*<sup>i</sup>* can be measured by the sensor system on the robot body, but *Gi* cannot be directly measured by the weight of the patient's leg, the measured torque *Ms* is the sum of applied torque *Mh* and no-load torque *M*:

$$M\_{\mathfrak{k}} = M\_{\mathfrak{k}} + M \tag{5}$$

And can be obtained from Equation (4) as,

$$\mathcal{L}\_{3 \times 1} = L\_{3 \times 3}(\theta)^{-1} \bullet \mathcal{M}\_{3 \times 1} \tag{6}$$

First, the ankle joint is moved at a small speed V, the foot pressure value *fzd* at this time is recorded at intervals Δ*t*, the joint angles θ1, θ2, and θ<sup>3</sup> are calculated, and a total of k times are recorded. Then the knee joint is rotated k times in the same manner, knee joint torque value *M*<sup>2</sup> and angle values θ1,θ<sup>2</sup> and θ<sup>3</sup> are recorded, then the hip joint is rotated to record k hip joint torque values *M*<sup>1</sup> and angle values θ1,θ<sup>2</sup> and θ3, then *fzd*, θ3,θ2, θ<sup>1</sup> are converted into *M*1,θ4,θ5. *C*31, *C*<sup>21</sup> and *C*<sup>11</sup> are calculated according to the following formulae.

$$\begin{aligned} \mathcal{M}\_{3i} &= \cos \theta\_{5i} \mathcal{C}\_{31i} \ (i = 1 - k) \\ \overline{\mathcal{C}}\_{31} &= \frac{1}{k} \sum\_{i=1}^{k} \frac{\mathcal{M}\_{3i}}{\cos \theta\_{5i}} \end{aligned} \tag{7}$$

$$\begin{aligned} \mathcal{M}\_{2i} &= \cos\theta\_{4i}\mathcal{C}\_{21i} + \cos\theta\_{5i}\overline{\mathcal{C}}\_{31} \ (i = 1 - k) \\ \overline{\mathcal{C}}\_{21} &= \frac{1}{k} \sum\_{i=1}^{k} \frac{\mathcal{M}\_{2i} - \cos\theta\_{5i}\overline{\mathcal{C}}\_{31}}{\cos\theta\_{4i}} \end{aligned} \tag{8}$$

$$\begin{aligned} \mathcal{M}\_{1i} &= \cos\theta\_{1i}\mathcal{C}\_{11i} + \cos\theta\_{4i}\overline{\mathcal{C}}\_{21} + \cos\theta\_{5i}\overline{\mathcal{C}}\_{31} \ (i = 1 - k) \\ \overline{\mathcal{C}}\_{11} &= \frac{1}{k} \sum\_{i=1}^{k} \frac{\mathcal{M}\_{1i} - \cos\theta\_{5i}\overline{\mathcal{C}}\_{31} - \cos\theta\_{4i}\overline{\mathcal{C}}\_{21}}{\cos\theta\_{4i}} \end{aligned} \tag{9}$$

$$\mathcal{C}\_{3 \times 1} = \begin{bmatrix} \ \overline{\mathcal{C}}\_{11} & \ \overline{\mathcal{C}}\_{21} & \ \overline{\mathcal{C}}\_{31} \end{bmatrix}^{T} \tag{10}$$

The force exerted by the patient's active intention is the main feature to be identified in rehabilitation training. We judge the rehabilitation effect of the patient by identifying the force that the patient can produce actively. In the training process, the actual measurement of human–machine interaction force is the data measured by the sensor system of the robot. The following equation is the established equivalent terminal mechanical model of human patients.

$$f\_{2 \times 1} = H(\theta\_{3 \times 1}, \mathbf{M}\_{3 \times 1}, \mathbf{M}\_{4 \times 3 \times 1}) \tag{11}$$

In the formula, *<sup>f</sup>*2×<sup>1</sup> is static terminal forces in the plane of motion, <sup>θ</sup>3×<sup>1</sup> is current position lower joint variable, *M*3×<sup>1</sup> is three-joint no-load torque, *Ms*3×<sup>1</sup> is measured force/torque at the three joints, and *H*(•) is the mapping function.

In the process of human–machine motion, since the force exerted by the patient mainly acts on the pedal, a six-dimensional force sensor is placed in the middle of the bottom of the pedal. Due to the influence of the arch structure of the foot, the force at the heel and the forepaw is simplified to two points: B and P. The force at the heel generates torque mainly at the hips and knees, and the force at the forepaw generates pressure on the foot pedal, as shown in Figure 6.

**Figure 6.** End applied force model.

In Figure 6, *fg* represents the patient applying force at point B; *fgx*, *fgy* and denote the horizontal and vertical resolution of *fg*, respectively; *fhzx*,*fhzy* represent the horizontal and vertical resolution of foot forepaw forces, respectively; *l*g*xO*, *l*g*xA* represent the moment arms generated by *fgx* at point A and O, respectively; *l*g*yO*,*l*g*yA* represent the moment arm generated by *fgy* at point A and O, respectively; *lhzxB*, *lhzyB* represent the moment arms generated by *fhzx* and *fhzy* at point **B**; *fx* and *fy* represent the force measured by the sensor mounted on the sole of the robot foot, respectively.

The moment of hip and knee joint can be expressed as:

$$
\begin{bmatrix}
\mathbf{M}\_{\mathbf{h}1} \\
\mathbf{M}\_{\mathbf{h}2}
\end{bmatrix} = \begin{bmatrix}
l\_2 \sin \theta\_5 + l\_1 \sin \theta\_1 & l\_2 \cos \theta\_5 + l\_1 \cos \theta\_1 \\
l\_2 \sin \theta\_5 & l\_2 \cos \theta\_5
\end{bmatrix} \begin{bmatrix}
f\_{\mathbf{g}\mathbf{x}} \\
f\_{\mathbf{g}\mathbf{y}}
\end{bmatrix} \tag{12}
$$

The patient's heel force *fg* can be expressed as:

$$f\_{\mathcal{S}} = \begin{bmatrix} f\_{\mathcal{S}^{\text{xy}}} \\ f\_{\mathcal{S}\mathcal{Y}} \end{bmatrix} = \begin{bmatrix} l\_2 \sin \theta\_5 + l\_1 \sin \theta\_1 & l\_2 \cos \theta\_5 + l\_1 \cos \theta\_1 \\ l\_2 \sin \theta\_5 & l\_2 \cos \theta\_5 \end{bmatrix}^{-1} \begin{bmatrix} M\_{\text{h1}} \\ M\_{\text{h2}} \end{bmatrix} \tag{13}$$

From which can be obtained

$$f\_{\mathcal{S}} = \frac{\begin{bmatrix} l\_2 \cos \theta\_5 & -(l\_2 \cos \theta\_5 + l\_1 \cos \theta\_1) \\ -l\_2 \sin \theta\_5 & l\_2 \sin \theta\_5 + l\_1 \sin \theta\_1 \end{bmatrix} \begin{bmatrix} \mathcal{M}\_{\mathcal{S}1} - \mathcal{M}\_1 \\ \mathcal{M}\_{\mathcal{S}2} - \mathcal{M}\_2 \end{bmatrix}}{\left( (l\_2 \sin \theta\_5 + l\_1 \sin \theta\_1)(l\_2 \cos \theta\_5) - (l\_2 \sin \theta\_5)(l\_2 \cos \theta\_5 + l\_1 \cos \theta\_1) \right)} \tag{14}$$

The force exerted on the patient's forefoot is collected by a six-dimensional force sensor on the sole of the foot that has the same axial direction as the pedal, so the end force *fhz* can be decomposed as follows:

$$f\_{hz} = \begin{bmatrix} f\_{hz^x} \\ f\_{hz^y} \end{bmatrix} = \begin{bmatrix} f\_x \cos \theta\_6 + f\_y \sin \theta\_6 \\ f\_x \sin \theta\_6 - f\_y \cos \theta\_6 \end{bmatrix} \tag{15}$$

Then, the equivalent terminal force of the patient can be calculated as:

$$f = f\_{\mathcal{g}} + f\_{hz} = \begin{bmatrix} F\_{\mathcal{X}} \\ F\_{\mathcal{Y}} \end{bmatrix} \tag{16}$$

where *Fx*, *Fy* are the horizontal and vertical components of terminal force;

By combining Equations (13) and (15), the terminal static component can be expressed as follows:

$$\begin{array}{c} F\_{\mathbf{X}} = \frac{l\_{2}\cos\hat{\theta}(\mathbf{M}\_{\mathbf{s}\mathbf{l}} - \mathbf{M}\_{\mathbf{l}}) - (l\_{2}\cos\hat{\theta} + l\_{1}\cos\theta\_{1})(\mathbf{M}\_{\mathbf{s}\mathbf{l}} - \mathbf{M}\_{\mathbf{2}})}{\left(\left(l\_{2}\sin\hat{\theta} + l\_{1}\sin\theta\_{1}\right)(l\_{2}\cos\hat{\theta}) - (l\_{2}\sin\hat{\theta})(l\_{2}\cos\hat{\theta} + l\_{1}\cos\theta\_{1})\right)} + \\ f\_{\mathbf{x}}\cos(\theta\_{1} + \theta\_{2} + \theta\_{3} + \pi/2) + f\_{\mathbf{y}}\sin(\theta\_{1} + \theta\_{2} + \theta\_{3} + \pi/2) \end{array} \tag{17}$$

In the formula, <sup>θ</sup><sup>ˆ</sup> is intermediate variable of joint Angle, <sup>θ</sup><sup>ˆ</sup> = <sup>−</sup>θ<sup>1</sup> <sup>−</sup> <sup>θ</sup>2.

$$\begin{array}{l} F\_{y} = \frac{-l\_{2}\sin\delta(M\_{s1} - M\_{1}) + (l\_{2}\sin\delta + l\_{1}\sin\theta\_{1})(M\_{s2} - M\_{2})}{\left(\left(l\_{2}\sin\delta + l\_{1}\sin\theta\_{1}\right)(l\_{2}\cos\delta) - (l\_{2}\sin\delta)(l\_{2}\cos\delta + l\_{1}\cos\theta\_{1})\right)} +\\ f\_{x}\sin(\theta\_{1} + \theta\_{2} + \theta\_{3} + \pi/2) - f\_{y}\cos(\theta\_{1} + \theta\_{2} + \theta\_{3} + \pi/2) \end{array} \tag{18}$$

Equations (17) and (18) can completely solve the mapping relationship between terminal force and joint variables, no-load torque and measured torque mentioned in Equation (11), and provide the entry parameters for the following judgment of patients' motion intention and control strategy.

#### **3. Participation Detection of LLR-II**

#### *3.1. Assist Force Training Control*

According to the change of the patient's participation in the training process, the assist mode and the active training mode are divided into different grades to ensure that the patient completes the training and maximize the patient's training enthusiasm and task completion. Using the impedance control model, the human–computer interaction force is represented by the end position offset, and is magnified by game in the task. With the participation of the robot's assistant force, the task difficulty is classified, to ensure that patients can find suitably challenging rehabilitation tasks. In order to improve the level of the patients' active participation, according to the recognized level of physical participation, the size of the auxiliary force is adjusted in real time to ensure that patients maintain a high level of participation for training.

With the progress of rehabilitation training, the patient gradually has a certain ability to control the affected limb, but when it is not enough to fully control it, it is necessary to introduce assistance training, in which the robot obtains the patient's motion intention through the force/torque sensor, and then drives the affected limb for training. In order to improve the coordination ability of each joint, the patient needs to complete the trajectory training, such as the circular trajectory and the linear trajectory. In many cases, the patient does not have the ability to perform the trajectory training task independently, and the robot needs to assist in suppressing the wrong movement. The assistance training control mode introduces the impedance model as shown in Figure 7. According to the current joint actual position θ*a*, the position positive solution is compared with the current desired position of the training trajectory, the auxiliary force calculation is performed, the auxiliary force *F<sup>T</sup>* is obtained, the patient force *Fa* is summed, and the result is sent to the impedance controller to obtain the end position control amount *Pd*. It then inversely solves the position, calculates the desired joint position θ*d*, and transmits it to the position controller to realize the assist control.

**Figure 7.** Assistance training control block diagram.

In order to make patients intuitively understand the movement track of their affected limbs, a task game was designed in which the patient operated the virtual mice to walk in the safe area between the red lines. The position of the mouse on the screen reflects the position of the patient's limb end in the motion plane. Participation scores increased with the time the mice spent walking in the safe area, and did not increase when the mice were outside the safe area. The trajectory of the safe passage can be selected according to the length of the patient's limb, and the width of the safe passage is related to the parameters of the impedance model. The impedance model parameters are as follows:

$$\mathbf{M} = \begin{bmatrix} 0.0625 \\ 0.0625 \end{bmatrix} \mathbf{B} = \begin{bmatrix} 5 \\ 5 \end{bmatrix} \mathbf{K} = \begin{bmatrix} 1000 \\ 1000 \end{bmatrix} \tag{19}$$

In order to make the width of the safety passage between the red lines challenging for most patients, and not tedious at the same time, 70 kg is selected as the standard reference value for the weight of patient, and the positional offset of the standard reference value is selected as the width of the safety channel. The standard reference value of 70 kg was selected as appropriate to the lot of volunteers that the machine was tested on, and it is used to provide an approximate starting point to the initial conditions of the algorithm. During the training task, patients need to resist the weight of their limbs and control the virtual mice to walk at a constant speed in the safe passage. If the virtual mice touch the red line during the task, the physical strength of the mice will decline until the end of the training mission cycle. According to the size of the auxiliary force, the task difficulty is divided into nine levels, with K values ranging from 0 to 0.8, respectively. The degree to which patients participate in training is related to the degree of the patient's recovery and individual physical strength. These parameters are difficult to quantify. Therefore, the degree of the patients' participation in training is quantified in stages by means of an experimental questionnaire. In the course of the experiment, patients are asked to try nine different difficulty training tasks, and they are asked to accept questionnaires to determine the current situation. Task difficulty is appropriate for each patient. Tasks with different difficulty can be divided into three states: under-challenge, challenge, and over-challenge, which are expressed by −1, 0, 1.

#### *3.2. Patient Participation and Training Task Di*ffi*culty Prediction Model*

In order to predict the degree of the patient's task participation, a mathematical model based on support vector classifiers and regression machines was established according to the characteristics of the small sample and nonlinear data of a small number of patients' training data and questionnaire data. The characteristic parameters were extracted from the training data, and the data was analyzed. The implicit mathematical relationship between input value and output value predicts the actual participation of patients, so as to achieve the goal of selecting the appropriate task difficulty.

Using the characteristic quantity of patient training data and task difficulty states, a QPSO-MLSSVM (quantum particle swarm optimization and multi-output least squares support vector machine) model can be established and tested. This model is based on the LS-SVM (least squares support vector machines) model, which is a class of kernel-based learning methods normally used for regression and classification problems. The main distinction in LS-SVM is solving for a set of linear equations, rather than the quadratic programming problem in classical SVMs [42]. The QPSO (quantum particle swarm optimization) algorithm is used to optimize the key parameters in the model to make the model performance better [43,44]. The sample set is *xi*, *yi* , *i* = 1, 2, ... , *l* , where *xi* <sup>∈</sup> *<sup>R</sup><sup>n</sup>* is the input value of the *ith* sample, *yi* ∈ *R* is the output value of the *nth* sample. The assumption is

$$f\_i(\mathbf{x}) = \boldsymbol{\omega}^\mathsf{T} \bullet \Phi(\mathbf{x}\_i) + b\_{i\prime} \mathbf{i} = 1, 2, \dots, l \tag{20}$$

where, Φ(*xi*) is the spatial conversion variable function, ω is the weight vector, *b* is the adjustment parameter.

We optimize the confidence interval under this condition, and transform the optimization problem into the minimum value problem according to the principle of structural risk minimization [45]:

$$\begin{cases} \min & \frac{1}{2} \|\omega\|\|^2 + \mathcal{C} \sum\_{i=1}^{I} \xi\_i\\ \text{s.t.} & y\_i \bullet f\_i(\mathbf{x}) \ge 1 - \xi\_{i\cdot}, i = 1, 2, \dots, I\\ & \xi\_i \ge 0, i = 1, 2, \dots, I \end{cases} \tag{21}$$

where, *C* is the weight coefficient; ξ*<sup>i</sup>* is the relaxation factor.

The first item in the optimization problem reflects the generalization ability and the model complexity, the second item reflects the model error, and the parameter C adjusts the weight of these two items. Introducing the Lagrange equation into the above formula:

$$\begin{split} L &= \frac{1}{2} \| \boldsymbol{\upmu} \boldsymbol{\upnu} \| ^2 + \mathcal{C} \sum\_{i=1}^{I} \left( \boldsymbol{\upxi}\_{i} + \boldsymbol{\upxi}\_{i}^{\*} \right) - \sum\_{i=1}^{I} \boldsymbol{a}\_{i} \Big( \boldsymbol{\upvarepsilon} + \boldsymbol{\upxi}\_{i} - \boldsymbol{y}\_{i} + \left( \boldsymbol{\upomega}^{T} \boldsymbol{\upTheta}(\boldsymbol{\upkappa}\_{i}) + \boldsymbol{b} \Big) \Big) \\ &- \sum\_{i=1}^{I} \boldsymbol{a}\_{i}^{\*} \Big( \boldsymbol{\upvarepsilon} + \boldsymbol{\upxi}\_{i}^{\*} + \boldsymbol{y}\_{i} - \left( \boldsymbol{\upomega}^{T} \boldsymbol{\upTheta}(\boldsymbol{\upkappa}\_{i}) + \boldsymbol{b} \Big) \Big) - \sum\_{i=1}^{I} \left( \boldsymbol{\upeta}\_{i} \boldsymbol{\upxi}\_{i} + \boldsymbol{\upeta}\_{i}^{\*} \boldsymbol{\upxi}\_{i}^{\*} \right) \end{split} \tag{22}$$

where ξ(∗), α(∗) and η(∗) represent ξ, α, η with \* and without \*, α(∗) and η(∗) are Lagrange multipliers. A relaxation variable is introduced ξ*i*,ξ<sup>∗</sup> *<sup>i</sup>* ≥ 0, *i* = 1, 2, ... , *l*, ξ is the insensitive.

The radial basis function is selected to calculate the spatial inner product of the kernel function in the support vector machine model. The result obtained by the above formula is inserted back into the Lagrange equation to obtain the dual equation of the optimization function:

$$\max \mathcal{W}(a\_i, a\_i^\*) = -\frac{1}{2} \sum\_{i=1}^l \sum\_{j=1}^l \left(a\_i - a\_i^\*\right) \left(a\_j - a\_j^\*\right) \times \mathcal{K}(\mathbf{x}\_i, \mathbf{x}\_j) + \sum\_{i=1}^l \left(a\_i - a\_i^\*\right) y\_i - \sum\_{i=1}^m \left(a\_i + a\_i^\*\right) \varepsilon\_i \qquad (23)$$

The constraint of this dual equation is:

$$\begin{cases} \sum\_{i=1}^{l} \left( a\_i - a\_i^\* \right) = 0\\ a\_{i'} a\_i^\* \in (0, \mathbb{C}) \end{cases} \tag{24}$$

where, *C* > 0 is the Penalty parameter

When 0 < α*<sup>i</sup>* < *C*, ξ*<sup>i</sup>* = 0; when 0 < α<sup>∗</sup> *<sup>i</sup>* < *C*, ξ<sup>∗</sup> *<sup>i</sup>* = 0, the corresponding sample is the standard support vector, and expresses the reliability of the calculation. In general, the *b* value of the standard support vector is calculated respectively, and then the average value is calculated.

$$\begin{split} b &= \frac{1}{N\_{\text{NNV}}} \left\{ \sum\_{0 < \boldsymbol{\alpha}\_{\text{I}} < \boldsymbol{\mathcal{C}}} \left[ y\_{\text{I}} - \sum\_{\mathbf{x}\_{\text{I}} \in \text{SV}} \left( \alpha\_{\text{I}} - \boldsymbol{\alpha}\_{\text{i}}^{\*} \right) \Phi(\mathbf{x}\_{\text{I}}) \bullet \Phi(\mathbf{x}\_{\text{I}}) - \boldsymbol{\varepsilon} \right] \\ &+ \sum\_{0 < \boldsymbol{\alpha}\_{\text{I}}^{\*} < \boldsymbol{\mathcal{C}}} \left[ y\_{\text{I}} - \sum\_{\mathbf{x}\_{\text{I}} \in \text{SV}} \left( \alpha\_{\text{I}} - \boldsymbol{\alpha}\_{\text{i}}^{\*} \right) \Phi(\mathbf{x}\_{\text{I}}) \bullet \Phi(\mathbf{x}\_{\text{I}}) + \boldsymbol{\varepsilon} \right] \right\} \tag{25} \end{split} \tag{25}$$

In order to eliminate local optima problems, the QPSO algorithm and the SVM algorithm are mixed the Hybrid QPSO-SVM algorithm. The formula of QPSO is:

$$\begin{cases} m\_{\text{best}} = \frac{1}{M} \sum\_{i=1}^{M} P\_i \\\ P\_{\mathcal{C}\_{\vec{\eta}}} = \phi P\_{\vec{\eta}\vec{\jmath}} + (1 - \phi) P\_{\mathcal{g}\vec{\jmath}} \\\ x\_{\vec{\eta}}(t+1) = P\_{\mathcal{C}\_{\vec{\eta}}} \pm \alpha \left| m\_{\text{best}\vec{\jmath}} - x\_{\vec{\eta}\vec{\jmath}}(t) \right| \ln \left( \frac{1}{u} \right) \end{cases} \tag{26}$$

And the particle swarm velocity formula is:

$$
\sigma\_{i\bar{j}}(t+1) = \omega \bullet \sigma\_{i\bar{j}}(t) + c\_1 r\_{1\bar{j}} \left[ P\_{i\bar{j}}(t) - \mathbf{x}\_{i\bar{j}}(t) \right] + c\_2 r\_{2\bar{j}} \left[ P\_{\mathcal{S}\bar{j}}(t) - \mathbf{x}\_{\mathcal{S}\bar{j}}(t) \right] \tag{27}
$$

where, *pij*, *pgj*, *pij* are the optimal positions of the *i* particle and the *g* particle in the *j* dimension, respectively; *mbest*, *mbestj* are center points of the current best position of all individuals on all dimensions; M is the particle swarm size; *pi* is the current best position of the *i* particle; *pcij* is the random position between *pij*, *pgj*; α is the control coefficient.

QPSO optimizes two key parameters, C and σ of the MLSSVM, and the optimization goal is minimizing the *fitness*(σ, γ) function. The sample mean square error (MSE) is selected as the particle swarm fitness function.

$$fitness(\sigma, \gamma) = \frac{1}{M} \sum\_{i=1}^{M} \left( y\_i - \hat{y}\_i \right)^2 \tag{28}$$

where, *yi* is the actual value and *y*ˆ*<sup>i</sup>* is the predicted value. When *fitness* reaches its minimum, the optimal solution is obtained.

#### **4. Experiment**

In order to verify the effectiveness of this method for patients, 12 groups of healthy volunteers were tested. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Yanshan University. Because the speed of physical expenditure for stroke patients is different in difficult tasks and due to the changes of the assistant force parameter K, maintaining the foot in a safe area requires a different level of initiative. With the increase of the difficulty coefficient, the K value of the assisted force parameter decreases gradually. Meanwhile, the patients' goal participation will increase during the training process, which may lead to a rapid decline in the patients' physical strength. So the human–computer interaction exerted by the patient at the end of the robotic chain is related to the degree to which the patient participates in the task. The degree of the patients' participation in tasks is also closely related to the difficulty of the tasks. To verify this point, the force/moment of three joints is calculated as the terminal force in robot coordinates. Secondly, the obtained data yields an observation of the changes of the calculated end force in the training cycle and a comparison of the position of the end of the robot under different parameter K values of the assisted force.

Figure 8 shows the end force after transforming the data collected by the sensor into the end force of the robot coordinate after eliminating the self-weight of the robot. Under the assists force with K = 0.4 task difficulty, the human–robot interaction force keeps at a relatively low level for a period of time at the beginning of training. At 380 s, the volunteer is too weak to bear his own weight to complete the task, and the Human–robot interaction has reached its first peak. At 400 s, the volunteer challenges himself again and strives to achieve the goal of the task, so the human–robot interaction force declines rapidly because the volunteer takes the initiative to bear the weight of their limbs. However, the second phase of maintaining a lower level is shorter than the first phase, and the second peak of the interaction force appears. At the end of the training, it is difficult for the volunteer to bear part of the body weight again in order to achieve the goal of the task, so the interaction force, which is almost entirely composed of the volunteer's limb weight, keeps at a high level. The variations in the time period of repeated challenges for a volunteer at different task difficulties are shown in Figure 9.

**Figure 8.** Sensors measuring terminal force during training: (**a**) Human–computer interaction of six-dimensional force acquisition under the training task of assistant force parameter K = 0.4; (**b**) the calculated terminal force in robot coordinates under the training task of assistant force parameter K = 0.4.

**Figure 9.** Terminal force during different difficulty tasks training: (**a**) the training task of assistant force parameter K = 0.1; (**b**) the training task of assistant force parameter K = 0.5; (**c**) the training task of assistant force parameter K = 0.6; (**d**) the training task of assistant force parameter K = 0.8.

The picture above is the terminal force of the first volunteer in training tasks with different K values of the assist force. After a questionnaire survey of the volunteer, the K = 0.1 of assist force task for the patient is "over–challenge", and the K = 0.5, K = 0.6 of the assist force task are "challenge", with the K = 0.8 of the assist force task being "under-challenge". Under the over-challenge task, there being heavier limb weights to load, the volunteer's physical exertion is fast and there appear many peaks in the human–robot interaction force. With the increasing participation of assistive forces in training, volunteers need less initiative to achieve task goals. This phenomenon can be clearly seen by observing the relationship between the end position and the safe passage.

Figure 10 shows the relationship between the terminal position of the robot and the safe passage of the target in the different difficulty training tasks for the first volunteer. The blue line is the rehabilitation robot terminal position, and the green line is the target terminal trajectory, while the red line is the safe passage. The difficulty of K = 0, K = 0.1, K = 0.2, K = 0.3 assist force tasks are "over-challenge", K = 0.4, K = 0.5, K = 0.6 assist force tasks are "challenge", and K = 0.7, K = 0.8 assist force tasks are "under-challenge". It is difficult for the volunteer to complete the task goal in the over-challenged task. In the under-challenged task, it's easy for volunteers to reach the goal position.

**Figure 10.** Terminal position during different difficulty tasks training.

The degree of patient participation under different task difficulties is reflected in the fluctuation of human–machine interaction mechanical signals, and the feature fluctuation represents the fluctuation of signal data in this period. The greater the fluctuation of the feature, the greater the degree of

dispersion, as it is more sensitive to signal fluctuation, and it is more suitable to be used as an input parameter of the detection model for patient participation. The sample data is processed by using four indicators, describing the degree of dispersion of the signals, the interquartile range, and the variance. The above features in the data are statistically analyzed to judge their significance and correlation under different volunteer states. Significance analysis is undertaken to compare the feature data of "under-challenge" and "over-challenge" volunteer states with that of the "challenge" state, in order to judge that the feature data have significant differences in the three states. The correlation is done to compare the insignificant features with the degree of volunteer participation, whether and if there is correlation. This feature will still be used as an input parameter to train the support vector machine. This paper extracted the preliminary feature variables in Table 1.



Ten volunteers were selected to carry out the experimental verification of the participation and task difficulty detection. Each volunteer was trained in 10 difficult tasks that lasted from 15 min to 20 min. In order to eliminate the influence of physical energy consumption between each experiment, they were conducted one day apart. After the end of the experiment, a questionnaire survey was conducted on the difficulty of the task, which is divided into three participation levels: "under-challenge", "challenge", and "over-challenge". As part of the experiment, the data of hip and knee joint torque, plantar six-dimensional force, and terminal trajectory were collected. With the different participation of assistive force, there are different performances of the terminal force and position. The characteristic quantities were extracted, as shown in Table 2, from the training data. The training data characteristic quantities of volunteers were then compared to their classification, according to the predicted task difficulty. The pairwise t-test comparisons of the characteristic quantities were statistically analyzed to verify whether the characteristic quantities are significantly different under different task difficulties. Comparisons of the characteristics of each two difficult tasks, using one-way repeated measure ANOVA, were done separately. Table 2 shows the results of the significance analysis of the characteristic quantity in the difficulty of the three tasks. *p* value is the test probability, F value is the effect of random error. When *p* value is less than 0.05, the characteristic quantity has significant difference under different difficult tasks.

The significance analysis of the characteristic quantity from the training data of 10 volunteers shows that there are obvious differences when comparing the *P***RMSE**, *P***STD**, *T***SCA**, *P***MAE** among different volunteers. The *p* value of *F***Q**, *U***MAX**, *F***<sup>D</sup>** are greater than 0.05, only in the case of the difficult and medium groups. For *F***Q**, *U***MAX**, *F***<sup>D</sup>** there are obvious differences in other groups, as it can distinguish the difficulty of the under-challenge tasks. Although *P***OR** has significant differences, its value is rough and its stability is not high. Accordingly, *P***RMSE**, *P***STD**, *T***SCA**, *F***Q**, *U***MAX**, *F***D**, *P***MAE** are used as feature inputs for volunteer participation and task difficulty classification. Figure 11 shows more intuitively the difference in training characteristics among three difficulty levels for each volunteer.


**Table 2.** Significance comparison of characteristic quantities.

**Figure 11.** Characteristic quantity of training data under different task difficulties: (**a**) characteristic quantity of Volunteer 1# training data; (**b**) characteristic quantity of Volunteer 2# training data.

Figure 11 shows the characteristics of two volunteers under different task difficulty. Among them, the red line is under the task difficulty of over-challenging and difficult; the green line is under the task difficulty of challenging; the blue line is under the task difficulty of under-challenge. *P***RMSE**, *P***STD**, *T***SCA**, *F***D**, *P***MAE** are positively correlated with the task difficulty evaluation, and *F***Q**, *U***MAX** are negatively correlated with the task difficulty evaluation.

The training data of 100 groups of 10 volunteers receiving the test were used as training sample data. At the same time, another two volunteers were randomly selected as the predictive group. Two volunteers in the predictive group were trained in all tasks with different difficulty levels, and questionnaires were conducted on task difficulty. Their training data is used as predictive sample data. Extracted feature quantities *Xs* as input from 100 sets of experimental sample data and the training set known category information *Ys* (Task Difficulty of Patient Evaluation) were taken as output. The prediction model based on QPSO-MLSSVM hybrid optimization algorithm and the two comparison models based on MLSSVM algorithm and a neural network algorithm were established. QPSO is an iterative optimization that optimizes the parameters *C* and σ in the MLSSVM algorithm to improve the generalization ability and prediction accuracy of the model. The 20 sets of data in the test set are similar to the ones in the training set, and seven feature quantities *Xl* are extracted as inputs of the existing model, and the predicted output *Yp* is obtained by the model operation. The accuracy of the model was evaluated by a minimum mean square deviation operation between *Yl* (Task Difficulty of Patient Evaluation) and *Yp*.

In the analysis of the results, the training samples obtained from the mathematical model cannot directly reflect the prediction ability of the model. Further model evaluation can be achieved by comparing the prediction data of the training samples with the real data. Common evaluation indexes of the model include MSE, RMSE, correlation coefficient, and so forth. In this paper, the mean square error and the correlation coefficient are used to evaluate the model.

The data from 100 datapoints of the volunteers' participation status were input into the prediction model to train the model, and the prediction tested on 20 datapoints. Correlation analysis was conducted on the actual values and predicted values of the data, and the linear fitting results are shown in Figure 12.

**Figure 12.** Comparison between task difficulty prediction and reality of test group.

In order to further analyze the classification effect of the QPSO-MLSSVM support vector machine on the state of volunteer participation, the minimum mean square error (MSE) and mean absolute error (MAE) as well as Standard Deviation (MAPE) of the predicted and true values of various volunteer participation states was obtained, as shown in Table 3



Training concentration simply divides task difficulty evaluation into −1, 0, and 1. Because patients have different evaluation criteria for difficulty, the dynamic trend in training data is different, and the results after algorithm testing will be distributed around three values. If the test results are graded according to the difficulty of (−1.5, −0.75), (−0.25, 0.25), (0.75, 1.25), the accuracy of the test can reach 100%. In order to eliminate the result of slightly larger offset, the determination range is reduced to half of the original one that are (−1.25, −0.5), (−0.5, 0.5), (0.5, 1.5). And the test accuracy can still reach 80%. The matching result of task difficulty evaluation shows that the predicted value of task difficulty is close to the real value, which also verifies that volunteers' evaluation of the difficulty of training tasks can be obtained from training data.

#### **5. Discussion**

Early rehabilitation training for stroke patients is very important and effective. While the rehabilitation robot can be used in the later stages of recuperation and even as a workout enhancer, the research work is aimed at the early stages of post-trauma rehabilitation. The aim is to re-train the nerve control and brain–body associations for typical movements of the affected limb. Overall, the success of training is measured in how quickly and effectively a patient regains normal control of their limbs. In order to make patients take the initiative to participate fully, while not letting the patient's physical strength drop rapidly, dispelling the enthusiasm of patient training, is a complex problem. It is very important to choose the appropriate training task difficulty for patients. Therefore, this paper determines whether the current task difficulty is suitable for patient training to achieve optimal training effect based on the data of the patient in the training task. The final experiment in the paper proves that the matching degree of task difficulty evaluation of the two volunteers in the test group was worse than that of the test difficulty evaluation of the volunteers in the experimental group from the fitness curve. This is due to the volunteer's subjective persistence and subjective evaluation of the difficulty of the task. However, the support vector machine task difficulty judgment model still has a prediction accuracy of 80% for the volunteer task difficulty evaluation of the test group. As the training data continues to increase and a variety of training information is introduced, the prediction accuracy of the judgment model will become higher and higher.

As can be seen from Figures 8 and 10, when volunteers perform multiple training tasks with different difficulty, with the decrease of difficulty and the increase of the proportion of assistant power, the strength needed by volunteers to achieve the goal task position and the speed of physical consumption will be reduced. During clinical trials, most volunteers have emotional issues when performing challenging tasks. Most of them have low mood, and some are irritated. They need to continue their psychological counselling and speech encouragement to support their task training. When the volunteers were under challenging tasks, most of them felt bored and emotionally stable. It may have an effect on the experimental results for the frequency of speech encouragement during training. In this clinical trial, speech encouragement was given four times in each training process to minimize the influence of this factor. In the future, the research team will use a variety of measurements to study the emotional and physical characteristics of volunteers to verify their impact on rehabilitation training.

In this paper, the performance of volunteers in training tasks at different levels of difficulty is investigated in order to determine whether the task difficulty is appropriate and to verify and judge the past data. But it also proves the validity and universality of the assistant training strategy. This control strategy can maximize the ability of patients to actively participate in training. In the future, the research team will continue to study and improve upon such clinical trial data. It is expected that the task difficulty can be judged and predicted online, and then the assistant force can be adjusted in real time, so that patients can participate in training actively and optimally.

#### **6. Conclusions**

This paper studies a seated and reclining training lower limb rehabilitation robot with a multi-joint sensing system. In order to make the patient participate actively in the training task, an assistive force training control strategy and corresponding task difficulty are proposed. The multi-joint mechanical sensing system is used to solve the more accurate end mechanical model, and then the human–computer interaction force is detected. Clinical trials of 10 volunteers were conducted, and each volunteer underwent nine levels of difficulty training. Through the optimized support vector machines algorithm, quantitative features in the training data are taken as the input set, and the volunteer's evaluation of the task difficulty is taken as the output set, and a task difficulty judgment model based on the volunteer training data is obtained. The training difficulty of two other volunteers, not in the original 10 persons training set, was predicted. It was verified that the difficulty judgment model of the task was universal and could exclude the influence of body size and weight. By comparing the prediction results of various algorithm models, the accuracy and convergence speed of the optimization algorithm are verified.

Future work will concentrate on extending the research to alternative models, such as described in the introduction, with a detailed comparison providing possible improvements to the data pipeline. The application will also benefit from a continuous expansion of the dataset, as more patient trials become available. This will also lead to the training data being judged and predicted online, and the difficulty of the task being adjusted in real time to optimize the rehabilitation effect of the patient in the future. As discussed throughout the paper, the patient's perception of the difficulty of the training exercise influences their mood, behavior, and performance. As such, matching the patient's perception is an important task in itself, even if the mechanical ground truth may be misrepresented. The desired end result for the rehabilitation robot, including future research, is a real-time online assessment which includes individual patient profiles, which should make patient subjectivity less relevant.

**Author Contributions:** Mechanical design and prototype debug, H.Y. and M.L.; conceptualization and supervision, H.W., L.V. and V.V.; data analysis and validation H.Y. and Y.L.; the main content of this manuscript was created and written by H.Y. and reviewed by all authors.

**Funding:** This research was funded by China Science and Technical Assistance Project for Developing Countries under grant number KY201501009; Key Research and Development Plan of Hebei Province, under grant number 19211820D, the forty-third regular meeting exchange programs of China Romania science and technology cooperation committee, under grant number 43-2; The European Commission SMOOTH project: Smart Robot for Fire-fighting, under grant number H2020-MSCA-RISE-2016:734875; Romanian Ministry of Research and Innovation, CCCDI-UEFISCDI, within PNCDI III, "KEYT HROB" project under grant number PN-III-P3-3.1-PM-RO-CN-2018-0144 / 2 BM ⁄ 2018; Postgraduate Innovation Research Assistant Support Project, under grant number CXZS201902.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Perceptive Interface for Intelligent Cyber Enterprises**

#### **Ioan Dumitrache 1, Simona Iuliana Caramihai 2, Mihnea Alexandru Moisescu 2,\*, Ioan Stefan Sacala 1, Luige Vladareanu <sup>3</sup> and Dragos Repta <sup>2</sup>**


Received: 15 September 2019; Accepted: 10 October 2019; Published: 12 October 2019

**Abstract:** Large scale, complex, networked enterprises, as may be considered (trans)national energy systems, multi-national manufacturing enterprises, smart cities a.s.o. are structures that can be characterized as systems of systems (SoS) and, as such, require specific modelling paradigms and control architectures to ensure their successful running. Their main characteristic is the necessity of solving practically one-of-a-kind problems with respect to the external context and internal configuration, thus dealing with dynamically evolving flows of data and information. The paper introduces the concept of intelligent cyber-enterprise, as an integrating paradigm that uses information and knowledge dynamics, in order to model and control SoS, especially focusing on the importance of appropriately adapt external and internal perception of an enterprise through a new generation of sensorial systems—the perceptive interfaces. The authors analyze sensing and perception in relation to intelligent cyber enterprise model and propose an implementation for a perceptive system interface.

**Keywords:** cyber-physical systems; intelligent cyber enterprise; systems interface

#### **1. Introduction**

New technological developments have helped solve many scientific and socio-economic problems over the years. Nowadays we can produce goods in every economic sector, and we can transport them in very short time no matter what distance is required all around the globe. This can be done nowadays much faster than ever. We can extract the required data from very large amounts of data, found in a single location or combine from multiple locations through heterogenous interconnected networks. We may use the information provided in a very short period, so that we may easily find a solution and adapt depending on the circumstances.

Availability of resources, personalized products, social enterprise, environmental awareness, globalization of markets, all of these represent important factors in today's enterprises' management and economic development. The problem is, they represent also pressure factors in the decision-making process, considering the following aspects:


• This complexity either may generate such models that their analysis and performance evaluation take too much time with respect to the problem-solving time horizon or may encourage model simplification (including finding false similarities) which results in incorrect models.

Either way, a correct decision making implies both an appropriate problem modelling and a correct selection of relevant information and data to be used (gathered and exchanged) in its solving.

The approach of this paper is thus oriented on:


Some aspects related to process identification in order to allow appropriate design of perceptive interfaces are included. Information technologies that are used in control systems and knowledge management represent the most important aspects out of all the tools that are available on the market. Also, because the degree of complexity has increased in different engineering domains, new models and paradigms appeared over the years. On the other hand, globalization and sustainable development require a new economic perspective that must include social and environmental impact in order to develop sustainable businesses and production systems [1]. In the present paper the authors analyze sensing and perception in relation to the intelligent cyber enterprise model and propose a new concept: perceptive system interface.

#### **2. Related Work**

New emerging concepts, such as internet of things (IoT) and cyber-physical systems (CPS) paradigm represent only a small part of the technological drivers of the next industrial revolution. In this section we address important concepts and paradigms related to the development of intelligent cyber enterprise model.

CPS are related to the study of complex systems as well as systems of systems focusing on physical and virtual components, interactions, process interconnections and information processing and addressing a wide range of temporal and spatial scales [2]. CPS are characterized by the National Science Foundation [3] as "engineered systems that are built from and depend upon the synergy of computational and physical components. Emerging CPS will be coordinated, distributed, and connected, and must be robust and responsive."

Cyber-physical systems must have the capability to be reconfigurable in order to integrate the human factor into system engineering and socio-technical systems. This new trend emphasizes the constant need of interoperability paradigm, seen both from the classical considerations as well as the new trends as we must sense and perceive the smart system taking into consideration all interoperating systems. This new shift can have an important consequence in the near future in the design process as well as the implementation phase. The adoption of cloud-based technologies will have an impact on the design and implementation on the next generation of sensing systems [4].

With the development of the internet of things and future internet paradigms, the number of systems that need to cooperate and interact in the future will increase both in number and complexity. Thus, we must perceive the purpose of every system in order to use the proper methodology in order to verify and validate each functionality. One solution can be the model-based cyber-physical systems [5].

Advances in internet-oriented paradigms have led to the development of Edge/Fog computing model aiming to optimize smart applications to expand their functionalities close to the geographic location of the IoT devices, rather than outsourcing computations to far away datacenters. A proposed architecture includes an edge/fog/cloud monitoring system and a capillary container orchestrator that is able to handle highly dynamic IoT environments in which the edge node may be overloaded at runtime due to an increase in the workload [6].

Security is an important component of future internet systems. One important aspect in addressing security is trust. Edge, fog and cloud computing have to rely on resources and services under ownership of various entities [7]. A proposed trust management architecture that relies on the use of blockchain-based smart contracts (SCs) and specifically designed trustless smart oracles is presented in [7].

Correlated with these concepts, new enterprise models have emerged. Such models have been related to the factories of the future paradigm. Factories of the future aggregate the concepts and represent the future of an enterprise that is built on collaboration, connectivity at different organizational structures, machineries and human operators, in order to become part of interconnected networks of suppliers, transporters and customers.

Distributed manufacturing systems are an important component of factories of the future and require new scheduling optimization models. A proposed model is presented in [8] and is developed based on a discrete fruit fly optimization algorithm with three heuristic methods proposed to initialize.

Another important component of factories of the future is represented by material flow analysis. Results have been used with success in optimizing material flows and waste streams in production processes [9].

A sensing enterprise represents a new paradigm based on the digital innovation concept that deals with the adoption of sensing and future internet technologies. Sensing enterprises are mostly connected to virtual enterprises and their networks. Seen from the perspective of a smart system in general, sensing enterprises must have the capability to sense, model and interpret signals from the real world and thus to be able to adapt into a more agile configuration of the system [10]. In relation to such systems a product service systems conceptual framework is proposed in [11] in order to facilitate the development of interoperable product systems and service systems in accordance with the stakeholders needs.

Another important model associated with factories of the future is the cognitive manufacturing. Cognitive manufacturing is a proposed manufacturing organization method, based on perception and cognition that integrates IoT principles, Artificial intelligence and data analytics technologies. One of the main objectives of cognitive manufacturing implementations is to integrate process as well as enterprise wide data and information, as to achieve improvements in the use of equipment, processes reengineering based on decision-making models and data analytics models, enterprise integration of knowledge management models [12].

Cognitive manufacturing systems address the following:


#### **3. Intelligent Cyber Enterprise**

In this section, the authors briefly describe the concept of intelligent cyber-enterprise (ICE), [13] seen as a socio-technical complex system. In relation to the implementation enterprise wide CPS systems the concept of ICE was proposed.

The connection between physical processes and their cybernetic representation is a very important aspect in order to modeling and design exactly the behavior of a process and thus being capable to integrate hardware-in-the-loop and human-in-the-loop components [5].

A cyber-enterprise represents an aggregation of physical objects, knowledge represented through process models (workflows), control algorithms, humans and information represented by adequate software tools and communication processes [14]. Thus, within an ICE, all these components must be capable to interact and cooperate in order to achieve intelligent enterprise goals, in terms of production, costs, time bound and resources limitations in order to satisfy all clients, suppliers and economic partners in a sustainable environment. Such a proper interaction must take into consideration the level of required intelligence, the amount of required data and how to extract information from heterogeneous flows within the enterprise, as well as by the emergent intelligent behavior of the enterprise in order to become adaptive, reactive or proactive, depending on the economic constraints and demands [15].

The ICE must be modeled in order to be capable to solve any specific problem, being capable to divide any problem at the lowest level of complexity. On the other hand, problems that must be solved are goals existing at the operational level that must include the minimum amount of enterprise resources.

From the CPS perspective, in order to achieve these results, we must include at least a control loop and at least one enterprise resource. The more complex a problem becomes there is a demand to include multiple control loops and specific resources. Control loops must operate in real time and must be based on targeted algorithms and rules, using well defined sets of data. The more complex a problem becomes there is a necessity to integrate heterogeneous information structures and heuristics. An intelligence-based model for the ICE is represented in Figure 1.

**Figure 1.** CPS-based enterprise model. Adapted from [16].

Every component of the ICE architecture must be capable to be independent according to different pre-defined specifications and must have the capacity to communicate, allowing the transfer of meta-data, information and knowledge in order to provide reliable context-oriented behavior [17].

Another aspect that must be taken into consideration is to connect subsystems in clusters, depending on the initial functional requirements. For examples, machines are a part of manufacturing cells that communicate based on protocols, standards and rules, all of them being embedded in control algorithms. Thus, functionalities can be added afterwards depending on the required specifications, initial or during the process implementation. Manufacturing cells are part of a manufacturing system, and thus are capable to fulfill any given problem and to develop various products [3]. The main characteristics of ICE include:

• *Perception*: this characteristic has the purpose to develop and integrate results from measurement science for sensing and perception. The role of perception is to better understand the system's complexity and thus to reduce the risks related to the adoption of new technologies.


Intelligent cyber enterprise systems design is focused on adopting new technologies in order to become agile, safe and productive, being also capable to interoperate with smart manufacturing applications and to better evolve.

#### *3.1. Sensing in Correlation to the Enterprise Environment*

In order to easily reconfigure, taking into consideration all existing constraints, a sensing system need different resources and conditions in order to be constantly under control. All the functions of a sensing system can be seen as a CPS. These systems can easily make the switch between physical world and thus merge to virtual world. By the help of sensor networks and actuating systems, the sensing system can monitor all the physical processes, collect data from all devices connected, interpret and expose data in different formats in order to make it widely available for every interested stakeholder and software applications to use it in order to better understand the real world.

The artificial intelligence concept can be seen from two different perspectives: both from the traditional approach but more important, from the machine learning perspective. Thus, a smart system will need to be capable of integrating generating mechanisms (physical and virtual) in order to extract the exact data that is needed. This might mean that simulation tools must be used in order to better under the behavior of the system and thus know how to instantly react based on its behaviors. Or, it might mean that data can be extracted from different web resources or multiple physical systems.

In order to develop a measurement mechanism in order to better understand a smart system, and thus sense and perceive the system performances in order to reduce the risk related to the adoption of new technologies, there is a need to integrate sensors and algorithms in order to create productive manufacturing environments.

If in the past we could identify manufacturing systems formed from simple sensors, actuators and controllers, now, there is an urgent need to integrate complex sensor networks in order to use them for the system perception. This is a complex task that makes it very difficult to implement it for academia, system integrators and even manufacturers in order to identify the right solution. The technical idea is to identify existing and emerging sensing and perception technologies and related software in order to understand the product and research solutions landscape.

Due to specific constraints related to flexibility and reusability, smart systems play a very important role in strengthening worldwide manufacturing competitiveness by stimulating responsiveness and innovation. To achieve this goal, smart systems must be capable to perceive and adapt depending on the need of collaboration with other smart systems or humans. Thus, they must have the capability to learn, adapt and quickly integrate into the rest of the enterprise.

#### *3.2. Perception and Cognition in Correlation to Enterprise Environment*

The design of enterprise systems based on the ICE model includes the integration of perception, reasoning, learning and cognition models. These aspects become relevant in the development of perceptive system interfaces for the ICE. An important aspect of ICE is related to extending the enterprise sensing function to a perception function. The perception-reasoning-learning loop (PRL) is further addressed, from a functional point of view and with a special emphasis on the awareness concept.

The ICE functioning is based on problem solving. Usually, such problems are described in terms of system-specific production/functioning goals, with different levels of precision, from strategical (profit versus investments, sustainability, market coverage, utilities availability, durations for problem-solving, quality of processes and products, safety, etc.) to more resource oriented, as number of products/services delivered, cost/profit per product, a.s.o.

Every problem is solved by a sequence of activities, performed by specified resources, at given times (eventually triggered by specific events)—which will denominate as workflows, and which are pieces of knowledge.

Resource oriented problems are solved by deterministic, predictable, well-defined workflows; operational and strategic problems, implying collaboration between resources, are solved by ad-hoc combination of workflows, with respect to availability of resources and context. Such problems may be, usually, solved in different ways, implying very different resources, approaches (interconnections of resources), costs and time. Solving approaches are determined mainly by resource availability, but also by the information availability, interpretation and possibility of gathering (sensing/perception). Also, networking is determined by the possibility of information interoperability between resources.

Appropriate interfaces for information transmission and appropriate perception of external information become crucial for such problem-solving approach. As designing specific interfaces for any possible problem is impossible, it is important to focus on the interface adaptability in order to allow the maximum flexibility for problem solving.

It results that an appropriate complex problem model should underline:


Such a model allows, once the appropriate solution is found, to design appropriate perceptive interfaces either between interconnected systems (including human operators) or with the environment.

Perception in an ICE environment may be referred as the active process that facilitates the interpretation of the environment by integrating information and "stimuli" from the sensory system. Two processing mechanisms can be relevant in the context of an enterprise environment:


Relevant aspects of human brain functions and processes that can be modeled in enterprise processes and systems, include:


Three phases of perception can be identified, in association to the following gestalt principle: every person has a role in one's perception process, designating a three-stage sequence:


Reasoning is considered the (brain) capability of solving problems. The main components of the reasoning process in an ICE environment are: the problem identification, the problem categorization and the identification of the solution.

The solution identification process of a specific enterprise problem can be associated with a reasoning mechanism. Action based on the proposed solution generates feedback that can be measured and interpreted in order to generate a new perception/reasoning cycle. An generic perception/reasoning cycle that can be integrated in problem solving mechanisms of enterprise systems can include the following:


Learning in an ICE environment can be associated to saving and retrieving a solved problem using a knowledge model. A deep learning process can also be associated to the enterprise system environment. In this process knowledge, used and validated for a given number of times may be retrieved "reflexively" i.e., without a volitional act. Rule mining and self-learning are being used with success in various industrial cases. One such a case is related to the development of a mixed-integer mathematical model to represent the direct energy consumption of machines and indirect energy consumption on a shop floor [18].

#### *3.3. Intelligent System Interfaces*

An interface has various interpretations across technical literature: a connection, a composition of connections, a boundary, a linkage, an interaction between functional entities, a logical relation, a specification [19].

ISO/IEC standards (ISO/IEC 2382-1/1993) define interfaces as: "A shared boundary between two functional units, defined by various characteristics pertaining to the functions, physical signal exchanges, and other characteristics." [20].

System Interface design can be addressed taking into consideration the following aspects:


When addressing the necessity of developing intelligent system interfaces, the following characteristics are relevant:

*Adaptation.* The adaptation of a system to changes in the environment or in an interconnected system needs to be addressed as both a change in the system itself but also as a change in its interfaces. In this context adaptation can become a function of the interface.

*Learning.* Machine learning techniques can be used in association with interface components. Patterns in data exchanged through the interface can emerge as a result of machine learning techniques applied at interface level. Such patterns can be used to optimize the functions of an interface. An example can be related to the amount of time the interface is used.

*Prediction.* Prediction is not only associated with intelligent systems but can become a function of the interface. Components fulfilling this functional role should be able to generate predictive models related to the environment and/or behavior of the interface.

*Optimization.* An interface can be extended with components that can monitor the function and behavior. Aggregated data from such components can be used in interface optimization processes.

When developing intelligent interfaces, the following aspects need to be addressed:


#### **4. Design of Systems Interfaces for the Intelligent Cyber Enterprise**

In the current section the authors propose a model for a generic perceptive systems interface as a part of an ICE architecture. In order to implement the perception function two facilitator components are proposed: a process/behavior identification component and a semantic routing component. The semantic routing component facilitates the transfer of data and information between enterprise system components. The process identification component discovers behaviors associated to the function and current use of the interface (how the interface is used) in accordance with patterns identified in the data and event streams transferred through the interface. As the interface connects system components, the process identification system analyses data in order to determine a workflow or a process associated with data passing through the interface.

#### *4.1. Intelligent Cyber Enterprise Perceptive Systems Interface*

The main functional aspects of a perceptive interface are showcased in the context of an ICE system model, described in this section. The proposed interface relies on components that enable its adaptation through the discovery of behavioral models based on observations of transferred information and then by selecting the appropriate behavior from a behavior repository.

The ICE architecture enabling the usage of perceptive system interfaces is depicted in Figure 2 and has the following components:

*Integration Layer*—this layer is responsible to integrate data from all levels of the system and to provide the basis for the design and implementation of the cyber intelligent systems. This layer is also responsible with the integration of all nodes and layers intro a homogeneous structure that hosts the intelligent entity instances needed to manage this structure. In order to implement the integration layer, the system must have specific components in order to fulfill different roles:

*Nodes*—represent computation units that are responsible with hosting the intelligent entity instances. The entire lifecycle of a node is managed by the "lifecycle manager". Each node must expose its own semantic interface which appears in the system as an Intelligent Entity. Through this interface, every node can receive different commands in order to check the state of every instance or to generate new events.

*Domain Knowledge Repository*—this repository is responsible with the interpretation of the system's domain in an easy to understand machine format. The elements of the domain must be uniquely addressable so that metadata can easily target the domain's concepts. Also, the relation between all components of a system domain must be clearly stated for the domain knowledge repository to provide reasoning capabilities and thus, to be capable to determine the context in which the needed information is processed.

*Semantic Service Repository*—this repository is responsible with the storage of the semantic description of all the services that are exposed by the intelligent entities instances. To achieve this, each service must contain metadata that refers to the system's knowledge base. Thus, these services can be described based on existing standards (e.g., OWL-s, SAWDL, etc.).

*Semantic Event Subscription Manager*/*Event Routing*—this layer is responsible with the subscription of all the system's components to the semantic events that are generated by the intelligent entities. This layer must select the exact events based on specific contexts in order to generate an event generation rule such as subscription to the events generated by sensors. The information regarding subscriptions must be sent to the "semantic event router" which is responsible to the transfer of information between the event generator and the subscriber.

*Intelligent Interface Instance Registry*—this registry is responsible with the storage of all the information that is sent through the entire system and that is deployed in the intelligent instances, thus, providing a shortcut to the services that are exposed by the intelligent entities.

*Intelligent Interface Repository*—this repository is responsible with the management of each process creation and deployment or un-deployment of the intelligent interface. On the other hand, this repository is responsible with the management of every behavior of the Intelligent Interface.

The perceptive system interface is deployed at the boundary of each intelligent entity instance and requires the following components: semantic interface, enterprise system adapters, physical adapters, lifecycle manager and behavior selection/execution component.

The proposed interface (Figure 3) will allow the connection of both virtual and physical resources of an enterprise system. The use of semantic data models facilitates service and event-based interactions between the system's components. The components of the interface: service interface, event sinks and event sources include a semantic description and the information transferred between the systems is semantically enhanced.

**Figure 2.** Generic ICE system interface and system model.

**Figure 3.** Block diagram of an intelligent entity instance.

Enterprise system adapters are components that facilitate the connection to services, business process execution engines or service orchestration engines. Behavior selection will allow the adaptation in accordance to the process requirements. The semantic augmentation of data will facilitate data interoperability between system components:

Physical adapters—adapters for the physical devices. These adapters are designed for data acquisition and integrate sensors, sensor networks and sensing systems. These adapters can also connect actuators, actuator networks and actuating systems providing input form control systems.

Behavior Selection/Execution System. Behaviors are associated with execution engines. The selection system will facilitate the selection of the appropriate behavior, from the behavior repository, in accordance with predefined criteria.

Lifecycle manager addresses the following operations: initializing, maintenance, management of the components of the current instance and selecting behaviors or adapters. Using a semantic interface, the lifecycle manager can expose a set of services including deploy components of the instance, routing to the appropriate event source.

#### *4.2. Semantic Routing System*

The ICE interfaces need to be associated with a scalable, semantically enhanced middleware solution that facilitates the integration of required resources in a distributed system. The middleware solution is characterized by the following elements [21]:


On the other hand, the systems that are developed based on the middleware solution must have a homogeneous distributed nature, so that, within each node of the system, the middleware can have the same components and will execute the same operations. Each node of the system is determined by specific applications that, together with the distributed nature of each system built using this middleware solution, represent the basis for the creation and implementation of the ICE.

Thus, the middleware solution components are further detailed:

Components that manage the resources of each local node include: node manager and node application manager.

Components that manage the semantic layer include: semantic manager, semantic service executor, automatic event manager, distributed RDF store and the P2P overlay manager.

The P2P overlay manager supervises the P2P overlay network containing the nodes of the system. Scalable distributed systems can be developed using structured P2P overlays based on distributed hash tables (DHT). P2P overlay stores key-value pairs in the nodes of the system. The system facilitates nodes to become part and leave the network without interfering with the system's operation. A uniform distribution of data across the network is desired in order not to generate congestions. The proposed component uses overlay to manage RDF triples and to transfer messages containing results of the query evaluation.

The semantic manager component facilitates the system components access to the system's ontology. The ontology referring functions of the semantic manager is implemented using Apache Jena. The semantic API exposes functions that allow easier creation of individuals from the classes of the ontology including application, node, event. The semantic service executor component allows system components to invoke remote semantic services.

The automatic event manager facilitates the event-based connection of the system components. New events will be accepted only if they are in accordance with the predefined requirements. The operation of the declarative event management system is associated with the automatic event manager component. The semantic manager component handles system's ontology new definitions of EventSink individuals. The component identifies the corresponding event sources by using semantic queries containing properties related to the event sink descriptors.

Therefore, the proposed middleware solution must have the capability to manage all the processes of most of the applications that are deployed in the "managed applications container" managed by the node application manager. Analyzing these applications, we have identified various operations that must be exposed, bundled into the following APIs:


In order to achieve the best results, the middleware solution must contain a set of predefined semantic definitions that are responsible with providing a consistent implementation of communication and interoperability at different levels and thus, to support various features (e.g., remote application management) [22,23]. On the other hand, on top of these semantic definitions, each system must define its own semantic class.

#### *4.3. Process Identification*

The focus of this section is to detail the application of process discovery techniques on data streams in order to generate behaviors for the adaptive interface. Although most of the steps required to generate the relevant sequences of events are specific to each implementation, for some of the common ones, domain-independent solutions are investigated [12,24]:

*Event identification*—using the data transferred through the interface, which might be in continuous form, to create sequences of events relevant to the application domain. This is a domain-specific problem as the rules for extracting events from observations are dependent on the layout of the monitored environment. Internet-of-Things oriented systems can be used in relation to processing large streams of data coming from heterogeneous devices, such as employing semantically-rich representations.

*Activity recognition*—transforming sequences of "low-level" events into sequences of activities that will be represented in the resulting process model; this task is especially important given the granularity of the input data streams considered here. Solving this problem requires domain-dependent knowledge in the form of "activity models".

*Process instance or process case identification*—this task involves associating each collected event or detected activity to one of more process instances; it should be noted that, depending on the application, this pre-processing step can be applied before or after the activity recognition one. The solutions are mostly limited to domain-specific rules, for example, based on the resources (and/or agents) involved in each process instance.

A process mining system implementation can be developed around the previously described tasks of event recognition/collection, process instance identification, action recognition and process model discovery, following an informational flow similar to that depicted in Figure 4.

**Figure 4.** Semantic routing system and middleware ontology.

Taking into consideration several design concerns such as portability—making the system capable of handling several related application domain—and the minimization of the effort required the experts in providing the necessary system KB, a system architecture similar to the one presented in Figure 5 can be implemented—a set of major functional blocks, each corresponding to one of the previously mentioned processing tasks loosely arranged around a central data repository containing everything from raw collected to events to recognized activity sequences and process instances.

**Figure 5.** Process identification system model.

The required input from domain experts, for each processing tasks, is:


The result of solving each planning problem will be an activity trace, from which the causal and independence relations between the activity instances can be derived. This information is subsequently used to build a workflow net. Unlike many process discovery methods that use an "event log" as input, newly proposed approaches based on Petri Net "unfoldings" accept (labelled) partial ordered sequences of events and as such offer better results from a smaller sample size in the case of processes with high concurrency (of course, having the downside that this information must be provided by an expert, or derived from another source).

#### **5. Implementation of the Intelligent Cyber Enterprise System Interfaces and Event Routing System**

#### *5.1. Implementation of the Event Routing System*

The proposed implementation involves the routing of data to a specific systems interface. The implementation of an event routing system facilitates the integration of new data sources such as sensors or system components without the reconfiguration of the communication network [25]. The data is routed to the appropriate source based on the semantic description. In order to perform the routing process, the following sequence of operations is proposed:


The implementation of the event routing system is based on a P2P overlay network that allows dynamic configuration of new nodes. Identified nodes are integrated in the system without interfering with the systems operation. Operations of the event routing process are distributed. The operations involved in the process of adding the node to the system are further described:


#### *5.2. Process Identification Based on Interface Tra*ffi*c*

By addressing the enterprise as a CPS, different enterprise systems can be interconnected and data collected form sensors can be aggregated. An industrial scenario was chosen involving the following event sources: position detection devices, RFID readers and contact sensors. The scenario involves human activities and the use of sensors that are present in an industrial environment but used for different purposes and by different industrial systems. The scenario is related to the movement of parts in an industrial environment. The part is transported in a container, by a worker operating a vehicle. The contact sensor detects the opening of a gate. The RFID reader identifies the tags on the vehicle and the container. The vehicle travels through three areas: AREA 1 to AREA3 and a position detection device, detects the vehicle in the specific area.

The first step in extracting the process model involves the event identification tasks. Through the application of rules such as "An observation collected from RFID reader X of a tag Y implies that Y is currently in AREA 1", an event sequence such as the one in Table 1 can be interpreted in relation to a specific component.


**Table 1.** Case study observations from sensors.

To showcase a possible solution for the activity recognition tasks using the state-based action model library mentioned previously, the movement of vehicles can be considered. The action models are designed in terms of the system's state space. Each action definition will specify the action's preconditions and effects as expressions based on state variables. As such, the action models will not directly refer event types, leading to the separation of the knowledge engineering tasks required in order to deploy the observation manager and activity recognition components.

As mentioned previously, the action recognition problem can be converted into a planning problem. This approach allows the reuse of existing tools developed in the field of automated planning such as PDDL editors and planners with various capabilities. For this approach, the development of a "template" planning domain and problem definitions is required. The template contains a type/object hierarchy, a set of predicates (for describing the state) and a set of action definitions.

Listing 1 details an action definition for "Transport". Procedures can be developed to infer the type hierarchy of the planning domain from the system's Knowledge Base.

**Listing 1.** Action definition for "Transport"

(:durative-action transport :parameters (?v—Vehicle ?l1—Location ?l2—Location) :duration (= ?duration 22) :condition (at start (at ?v ?l1) (next\_to ?l1 ?l2)) :effect (at end (and (at ?v ?l2) (not (at ?v ?l1)) )))

A set of PDDL domain / problem files is created, based on the template defined for each instance of the action recognition problem. Using the information resulting from the event detection phase, the "init", "goals" and "actions" will be augmented (Listing 2).

**Listing 2.** Augmentation of "init", "goals" and "actions"

```
(:init
...
at 5023f674ae4d2e (event3_pred)
at 5023f674ae4d2e (not (event3_pred))
)
(:goal and(
...
(event3_constr_satisf)
))
(:action event3_constr
:precondition (and (event3_pred) (at_loc vehicle1 Area3))
:effect (and (event3_constr_satisf) (increase (total-cost) 2))
)
```
The schema used to constrain the trajectory, relies on the usage of timed initial literals, available since PDDL 2.2. In the proposed case, a timestamp and a "window" predicate is be enabled and the predicate is be added as a goal. This predicate is related to a new action, whose preconditions are related to an event. A conjunction of predicates is describing the change in the monitored environment's state represented by the event, which in this case corresponds to "vehicle1"'s arriving in "Area2". Similar constructs will be added for all the available events.

Solving the problem with these actions will allow an existing planner to instantiate planning operators (actions) capable of "explaining" (at a higher level of abstraction) the trajectory in the monitored environment's state space, described by the collected events. For the proposed scenario (Table 1), in the case of events at t6 and t7, satisfying the goal requires predicates "event6\_constr\_satisf" and "event7\_constr\_satisf" to be true. These can only be set by adding instances of actions "event6\_constr" and "event7\_constr" to the plan. However, in order to satisfy the preconditions of "event7\_constr", a "transport" action instance for the areas "Zone2" and "Zone3" must be added before it.

Several post-processing steps must be applied on the resulting plan. Such steps include the removal of the special actions used to create the trajectory constraints and the detection of the concurrency relations between activities extracted.

The process instance identification step can be performed before or after the action recognition phase, based on the initial sequence of events or the results of the activity recognition stage (partially-ordered sequences of activities). This can be accomplished using either domain-specific rules or one of the semi-automated, iterative approaches mentioned in the previous section.

The results obtained after the action recognition and instance identification steps can be used to compile an event log, referring high-level activities instead of low-level events. The activities can be subsequently used as the input for the process model discovery step. Given the availability of additional information regarding the concurrent execution of some of the activities (derived from the set of action models used in the recognition step), the study focused on methods capable of using this knowledge. As such, a method based on building event structures followed by folding them into Petri Nets, was used to generate a process model. An example of a process model is described in Figure 6. Some aspects of the process model such as "AND" and "XOR" blocks need to be further analyzed and modeled with the aid of a higher level Petri Net formalism. The system implementation was used to validate the proposed approach in a case study built around a simulated logistic process.

**Figure 6.** Process model of the proposed scenario.

#### **6. Discussion**

In this article, we have explored the possibility of developing a process identification model aimed towards identification of processes associated to data transferred through a system interface. Such processes usually occur in an enterprise environment and are beyond the scope of established process management systems [26]. The identified workflows or processes are associated to the interface as interface behaviors.

The proposed model enhances the capability of a system interfaces by attaching a perception function that facilitates the interpretation of the working environment and connected systems by integrating information and "stimuli" from the sensory system [27].

Advantages of the proposed system include:


• Enhancing system adaptation capabilities by adding the adaptation capability to the system interface.

This represents a novel approach that can provide benefits in enterprise environments adopting IoT and CPS technologies, in which a large number of events collected from sensors or system components are transferred through system interfaces [28]. The resulting interface behavior model offer meaningful insights into the behavior of the system and allows for the use of novel process mining techniques for conformance checking.

In order to address the entire complexity of the considered domain, ontologies based on DL were used. Using this method, the tasks associated to processes can be associated to reasoning problems. The proposed model demonstrates how data acquired form sensors or other system components can be analyzed and used in determining interface behaviors.

Two main categories of activity recognition problems have been analyzed in regard to the proposed: model learning based and specification base. Due to user requirements level, portability and flexibility issues only specification-based methods have been further addressed. The authors followed the main lines of research into applying and extending existing process workflow management techniques in CPS and particularly in ICE environment. Due to the benefits for industrial automation systems, the authors focused on the applications of the developed techniques in process mining in a CPS environment.

Knowledge engineering tasks are still required to develop the planning domain used in the activity recognition step and the rules that enable the identification of individual instance traces in the stream of events. An important factor for the adoption of the proposed method are the events that relate directly to state changes in the observed system. Usually process discovery techniques are used in correlation to logs of various enterprise systems.

Although the discovered process model has the same structure as the process used to generate the dataset, some performance and scalability issues related to the activity recognition step were discovered. Planning time is a relevant factor associated to the proposed method and is directly related to the increase in the system resource consumption. An "object-based" decomposition method, can be implemented in order to increase the efficiency in solving the simple planning problems and identifying "macro-operators" from existing cases. These aspects can correlate to the original domain and lead to improvements in planning time, by creating "short-cuts" in the state space and transferring them to automated planners [13,18]. Such an approach allows the augmentation of enterprise systems and their components with system interfaces capable of analyzing patterns in the transferred data and associating the patterns to workflows or processes. This approach couples the systems interface to the process associated with the actual system [12].

The event routing system facilitates the implementation of the process identification model and routes relevant data to the appropriate system interface.

The advantage proposed by the process identification model is the ability to generate process models in the absence of a pre-modeled process or to offer a basis for comparison of the existing pre-modeled process to the actual execution. The proposed approach utilizes the data already handled by the interface augmenting the function of the interface and enabling the future autonomous adaptation to process chances [12].

Further research is conducted in the area of autonomous adaptation models for systems interfaces that will allow an automatic reconfiguration of an interface in accordance with the detected changes in interface behavior and based on the selection of appropriate interface behaviors form behavior repositories [29]. An extensive discussion regarding the available translation methods for different process modelling paradigms and the association to models of interface behaviors needs to be addressed in future research.

#### **7. Conclusions**

A new approach to future enterprise as a complex system is proposed: intelligent cyber enterprise. The authors address the enterprise model of the intelligent cyber enterprise, describing a key component,

the perceptive interface, thus proposing a method to solve the operational context adaptability problems in complex infrastructures. A behavioral model is developed, based on data and information acquired from the enterprise environment.

The proposed model integrates key functions such as processing, perception, communication, learning, pattern recognition and data mining. The proposed case studies address an enterprise environment consisting of intelligent systems-based components with a high degree of autonomy. The case studies illustrate the advantages of the proposed model in relation to complex systems behavior modeling, thus facilitating system adaptation to a dynamic working environment.

**Author Contributions:** Conceptualization, I.D.; Formal analysis, M.A.M.; Methodology, S.I.C., M.A.M., I.S.S. and L.V.; Software, D.R.; Supervision, I.D.; Validation, D.R.; Writing – original draft, S.I.C., M.A.M. and I.S.S.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **3D Object Reconstruction from Imperfect Depth Data Using Extended YOLOv3 Network**

#### **Audrius Kulikajevas 1, Rytis Maskeliunas ¯ 1, Robertas Damaševiˇcius 2,3,\* and Edmond S. L. Ho <sup>4</sup>**


Received: 21 February 2020; Accepted: 2 April 2020; Published: 3 April 2020

**Abstract:** State-of-the-art intelligent versatile applications provoke the usage of full 3D, depth-based streams, especially in the scenarios of intelligent remote control and communications, where virtual and augmented reality will soon become outdated and are forecasted to be replaced by point cloud streams providing explorable 3D environments of communication and industrial data. One of the most novel approaches employed in modern object reconstruction methods is to use a priori knowledge of the objects that are being reconstructed. Our approach is different as we strive to reconstruct a 3D object within much more difficult scenarios of limited data availability. Data stream is often limited by insufficient depth camera coverage and, as a result, the objects are occluded and data is lost. Our proposed hybrid artificial neural network modifications have improved the reconstruction results by 8.53% which allows us for much more precise filling of occluded object sides and reduction of noise during the process. Furthermore, the addition of object segmentation masks and the individual object instance classification is a leap forward towards a general-purpose scene reconstruction as opposed to a single object reconstruction task due to the ability to mask out overlapping object instances and using only masked object area in the reconstruction process.

**Keywords:** 3D scanning; 3D shape reconstruction; RGB-D sensors; imperfect data; hybrid neural networks

#### **1. Introduction**

One of the pressing issues in computer vision is three-dimensional (3D) object reconstruction, due to it becoming a core technology in numerous high-end industrial applications such as smart manufacturing, industrial automation and Industry 4.0 [1]. Moreover, there exists a wide variety of applications that would benefit from real time computer vision systems that are capable of fully reconstructing scenes, with most notable examples being an interactive medium such as virtual reality (VR) games and simulations [2], augmented reality (AR) applications or even in newest booming technologies such as extended reality (XR) [3]. Further examples for applications of such systems could include gesture [4,5] and posture [6] applications, indoor mapping [7], obstacle detection [8] recreating environments in movies or even digital forensics [9] to allow for crime scene recreation, robotics [10], teleconferencing [11] with the use of holograms and more. Therefore, we can safely assert that there is definitely a need for affordable, commercially viable solutions capable of providing real-time reconstruction capabilities available to the average user with as little complexity and barrier of entry, in terms of both financial investments and knowledge about the field, as possible.

As we cannot expect an average user to have the access to professional filming sets, mounting arrays of laser scanners capable of scanning the entirety of the room, in addition to the computing resources that would be required to stitch the data retrieved from multiple high-fidelity depth sensors, we need a solution that would meet or exceed the previous caveats. Therefore, we need a solution capable of working in real-time on a regular non-enthusiast grade workstation or even on a laptop. Furthermore, while we cannot expect the user to have a modern sensor array setup we can try to minimize the initial setup cost to a single depth sensor available in electronics stores or even in quite a few modern mid-tier and flagship phones. While solutions for scene reconstruction from a single depth sensor already exist, these solutions require incremental building per each frame [12,13]. This is done based on camera localization information and delta frames and in the scene reconstruction algorithms that make use of simultaneous localization and mapping (SLAM) [14]. To reliably fill all the holes in the areas that are occluded by other objects and even because of self-occlusion, we would have to scan the entirety of the object from all sides to have its full profile. Furthermore, incremental methods tend to underperform because of one principal flaw: changes in the scene can disrupt the mesh [15]. Making the applications in non-static real world scenes limited, where instead of the entirety of the view moving some objects can change their localization, or even suddenly pop-in or pop-out of the frame. Other proposed methods, such as space carving [16], would bypass some of the incremental building problems by performing what is essentially a subtractive reconstruction from multiple perspectives. However, these methods assume that you can accurately acquire the mask, which can be impossible in certain lighting conditions.

A majority of current algorithms for performing 3D object reconstruction have limitations: objects must be monitored from a large number of views; or views must follow a small baseline, thus the methods cannot function properly when provided only a small number or a single view. To solve these issues one of the most novel approaches employed for state-of-the-art reconstruction algorithms is to employ a priori knowledge of the objects that are being reconstructed [17,18]. These are generally relying on black-box models such as neural networks (NN). One of the most obvious advantages of using a priori information is for the algorithm to approximate the occluded object information, which we as humans are capable inferring quite easily. These methods have shown success in solving this task. For example, 3D Recurrent Reconstruction Neural Network (3D-R2N2) for multi-view reconstruction on the Sanford Online Products [19] and ShapeNet [20] datasets, has managed to achieve this task with fewer images available with competitive results [21], with the proposed improvement that uses densely connected structure as encoder and utilizing Chamfer Distance as loss function [22]. Additionally, Generative Adversarial Networks (GANs) can be used to generate 3D objects from multiple 2D views [23] or even from a single image [24]. GANs have also been shown to be able to predict former geometry of damaged objects [25]. Other authors have used feedforward NNs to detect valid matches between points in an image using different views with more than 98% accuracy [26]. Additionally it was shown that by adopting Bernstein Basis Function Networks (BBFNs) it is also possible to solve the task of reconstructing a 3D shape [27]. A trilateral convolutional neural network (Tri-CNN) that uses three dilated convolutions in 3D to extend the convolutional receptive field was applied on the ShapeNet and Big Data for Grasp Planning [28] data sets to obtain 3D reconstruction from a single depth image [29].

A majority of methods are using voxel based representations, e.g., PointOutNet [30] has shown the ability to predict and generate plausible 3D object shapes. This allows for the model to perform multiple predictions from a single input and using point cloud distribution modeling to refine the final results. Other approaches include: hierarchical surface predictions (HSPs) [31] for predicting high resolution voxel grids using convolutional neural networks (CNNs); discrete wavelet transform (DWT) and principal component analysis (PCA) can be used to get targeted object models, which can be used as an input to an artificial neural network (ANN) to recognize the 3D shape. Other authors have used geometric adversarial loss (GAL) in order to regularize single-view 3D object for object reconstruction using a global perspective by training the GAN to reconstruct multi-view valid 3D models [32]. RealPoint3D network composed of an encoder, a 2D-3D fusion module, and a decoder, accepts a single-object image and a nearest-shape retrieved from ShapeNet to generate fine-grained

point clouds [33]. Similarly, PGNet [34], a recurrent generative network, uses the original images and partial projection images for fine-grained 3D reconstruction. Finally, it was shown that using ANNs it is possible to produce a fully textured, appropriately proportioned 3D model from a single RGB [35] or RGB-D frame [36], however, this approach was limited to basic volume primitives (rectangular boxes and spheres) .

Even though the black-box methods have shown substantial improvements over existing state-of-art reconstruction algorithms such as incremental reconstruction, they can still be prone to severe mishaps due to poor illumination conditions, and object material interaction with light (mainly reflectivity). Furthermore, due to the fact that these methods rely on the visible light spectrum, they are incapable of working in dark environments. Therefore, they would not be suitable to be used in critical applications such as security.

Starting with the *Microsoft Kinect* released in 2010 [37] to *Intel Realsense* [38], the depth sensors are becoming the norm not only in the flagship mobile phones. As of late, stereoscopic depth is becoming available in newer budget phones with the introduction of multiple back facing cameras on a single device. For these reasons we have almost reached an era of the RGB-Depth (RGB-D) sensors being readily available. Therefore, focusing solely on the RGB cameras is missing the potential that the RGB-D cameras may provide for the object reconstruction tasks. For example, depth data stream from the Kinect camera has been used to generate topologically correct 3D mesh models [39].

Applying additional information provided by the RGB-D senor is the logical next step in the lifecycle of the object reconstruction algorithms as we believe they are less dependent on ambient conditions and could potentially be used in pitch black situations due to modern depth sensors using infrared cameras for object depth calculations on the hardware level. We concede that the depth sensors have their own limitations such as speckling due to surface properties [40,41] and distortions caused by infrared projections [42]. However, we believe that the addition of the depth sensor information in conjunction with readily available color data adds useful information. This information helps ANNs to better generalize input data and increase robustness against different lighting conditions. This includes pitch black environments as the depth information is sufficient to reconstruct the captured scene in most cases.

We present an improved hybrid ANN architecture for reconstructing polygonal meshes using only a single RGB-D frame, and employing a priori knowledge, which allows the neural network to be deployed on low-end RGB-D sensor devices with low frame rates.

#### **2. Materials and Methods**

#### *2.1. Proposed Hybrid Neural Network Architecture*

Our hybrid NN architecture (Figure 1) consists of two major branches: the preliminary input branch that is used for object instance classification and their mask extraction; secondary input branch, which uses the results of preliminary branch in conjunction with the inputs of preliminary branch to perform individual object reconstruction. However, unlike preliminary branch we do not use generalized branches for reconstruction, instead we have *n* of specialized branches for each of the object categories. This allows us to more easily train additional types of objects in the reconstruction branches without having to re-train for classification, in addition this allows to re-train any of the individual reconstruction branches without losing the existing gradients by performing the training on more models [43]. The modularity of the system also provides the advantage of reduced training times as each branch can specialize onto its own generalization task, which gives the ability to change the network configurations of the reconstruction branches by simplifying for easier objects or having more elaborate ANN structures for more complex objects.

**Figure 1.** Our extended *YOLOv3* capable of extracting geometric object segmentation along with object bounding boxes.

#### *2.2. Classification and Segmentation Algorithm*

Our aim is to detect individual object instances in the scene in order to have a system that is usable in real-world environments. Therefore, we need a classifier that is capable of detecting more than a single object instance for given frame, for example, having two cups and a toy plane on a table would require us to rebuild both of the cups and the toy plane models, respectively. Fortunately, some research has already been performed in the area of individual object instance classification [44–46].

For this reason, to perform our classification task we use one of existing state-of-the-art classifiers as it has shown to produce some of the best results in classification tasks, i.e. *YOLOv3* [47], which we have adapted to our needs to output an additional geometric segmentation mask (Figure 1), while authors have mentioned to be unable to achieve object instance segmentation in their original paper. Additionally, we define the term geometric segmentation as extension to segmentation that allows to discriminate between nearby object instances. This is done by generating a heatmap glow that radiates from the origin of the object. While other more lightweight methods exist, such as *MobileNet* [48], in our paper we try to compare the classification results using three different methods: using only color information; using only depth information; using both color and depth information. Therefore, we have decided to use a slower, but more accurate algorithm to have the most representative results.

Just as the majority of the individual object instance classifying algorithms, *YOLOv3* uses what is know as anchors for object detection. These anchors are used as jumping off bounding boxes when classifying objects, for example, a motor vehicle has a very different profile from a basketball. While the basketball in most cases has close to 1:1 aspect ratio bounding box, meaning that their width is the same, or very close when the image is distorted, to its height, while a motor vehicle like an automobile for the most part has height that is lower than its width. For this reason, one anchor could specialize in detecting automobiles, while the other can specialize in detecting basketballs. Additional feature, albeit a less useful one due to the way our training and testing dataset is generated, is the specification of bounding box scales by the authors of *YOLOv3*. These size specializations group bounding boxes into three groups: small, medium and large. For example small objects may include kitchen utensils, medium objects may include people, large objects may include vehicles. However, these bounding box groups are not exclusionary for these objects unlike anchors as these can vary a lot based on the camera distance from the object. Therefore, as our dataset is completely uniformly generating object scales this grouping loses some of its usefulness.

In our work, we have experimented with three types of inputs into the ANN: color space, front-to-back object depth field and the combination of both. In the case of color space, we use 3 channel inputs for representation of *red*, *green*, *blue* colors; when using depth field, we use a single channel input containing only normalized depth field values and for the combination of both we use

*RGBD* channels in the same principle. Depth value normalization is performed by dividing each pixel *z* value using *zmax* of the frame thus landing the depth in range of *z* = [0, 1]. Our input layer is thereafter connected to *DarkNet53* network containing 53 convolutional layers as per specifications, which outputs three routes: *main route* used generally used for larger objects, *route 2* used for medium sized objects and, finally, *route 1* for smaller objects. Due to testing set being uniformly randomly generated, and containing the same object in potentially all size categories, we lose some of the flexibility that is provided by this setup and it impacts classification performance minimally, if removed. However, to stay true to the original algorithm and have an as unbiased result as possible, we have decided to keep all of the branches used in the source material. Additionally, these three routes provide good jumping off points for shortcuts to be used in our segmentation extension (Figure 2).

**Figure 2.** Our proposed geometrical object segmentation extension.

Due to each of the nearby routes being separated by the power-of-two scale, we use transposed convolutional layer [49] to upscale them gradually and then and merge them into desired final shape matrix. We construct our classless geometric segmentation mask by firstly upscaling the *main route* output and merging it with *route 2*, and the resulting layer is then upscaled again and merged with the final *DarkNet* output (*route 1*) which provides us a layer containing latent information of all previous layers that are each specified in learning different sized objects.

Next, we branch out our resulting hidden nodes into four different layers. Each layer contains slightly different network configuration, allowing them to essentially vote on their influence in the final result by extracting different latent feature-maps from the previous layers (Table 1). The first three branches (*A, B, C*) are convolutional branches containing one, two and three convolutional layers, respectively. However, for our final branch (*D*) instead of the convolutional layer, we use a max pool layer to extract the most prominent features. We have selected this parallel stacked approach, because we found it to be more efficient in extracting the object masks than linearly stacked layers when training the segmentation layers independently from the entirety of a model. This decoupling of the segmentation task from the classification task when training gives the additional benefit of allowing us to use transfer learning, which has shown to have very good practical results [50].

Next, we run our concatenated branches through convolutional layers to extract the most viable features and normalize their output in the range of (0, 1) giving us the final segmentation image. In our case the final segmentation output is 80 × 60 due to it being more than sufficient to extract approximate depth masks as we do not require pixel perfect segment representations. Finally, we use cascading flood-fill (Algorithm 1) to classify the masks pixels-wise. This is done because we found the generated binary masks to be impervious to false positives and false negatives, unlike classification using bounding boxes which can have three types of errors: false positives, false negatives and misclassification. This allows us to remove false positive bounding box detections when they do not intersect the origin of the mask. In our testing set, best cascade parameters were *-*= 0.9, *θ* = 0.01.


**Table 1.** Geometric Segmentation architecture.


*Sensors* **2020**, *20*, 2025

Additionally, we have also modified *YOLOv3* network for we had issues with the network being unable to train by consistently falling into local minima during gradient descent and getting perpetually stuck in them. To solve this issue we introduced periodic hyper parameters [51] during model learning. Specifically, we had changed the learning rate to alternate in specified range of *lrmin* = 1*e*<sup>−</sup>6, *lrmax* = 1*e*<sup>−</sup>4.

$$y(\mathbf{x}) = \begin{cases} \frac{\mathbf{x}}{w\_1} \times (lr\_{\max} - lr\_{\min}) + lr\_{\min}, & \text{if } \mathbf{x} < w\_0\\ \frac{\mathbf{c}^{1+\pi\times\frac{\cos\pi(x-w\_1)\bmod(w\_0+1)}{w\_0}}{\pi\mathbf{c}^3}}{\pi\mathbf{c}^3} \times (lr\_{\max} - lr\_{\min}) + lr\_{\min}, & \text{otherwise} \end{cases} \tag{1}$$

This periodical learning rate (Equation (1)) has vastly improved our models ability to learn the underlying relationships of input date by alternating between low and high training rates, therefore jumping out of potential local minima that it might start orbiting around during stochastic gradient descent. Our function has two stages, the first stage that consists of two training iterations, where *w*<sup>1</sup> = 2 × *s*, and the second stage of 4 iterations, where *w*<sup>0</sup> = 4 × *s* where *s* is the number of steps per batch. We selected the two state learning function because having high learning rates initially may cause the model to diverge. Therefore, during the first stage we linearly increase the learning rate. Once in the second stage we use the cosine function and the modulus operator for the model to alternate between two values. The shape of the alternating function also can have influence in model convergence as some models require to be in different extremum points for different amounts of times. Therefore, having a different dataset may require more fine-tuning of parameters of this equation for different slope shapes, while still maintaining the benefits of having alternating learning rates.

Additionally, as we are training the NN from scratch, we have noticed that our network, despite being able to find better convergence results due to periodical learning rate jumping out of local minima, had a high bias rate. A high bias rate is an indicator that our model is over-fitting on our data set. To solve this additional issue, we modified the *YOLOv3* network by adding additional dropout layers with the dropout rate of *P*(*x*) = 0.5 after each branch of *DarkNet53* and before each of the final layers predicting the bounding boxes.

Furthermore, we had issues of model overfitting to the training set, to solve this we additionally modified the neural network by adding two additional dropout layers. We trained our model 6 times, each with 50 iterations using mini-batch of size 8 for comparison, because after about 50 iterations the standard *YOLOv3* model starts to overfit and loose precision with our validation dataset. Therefore, for most objective comparison we trained our modified network for same number of epochs. Note that even though our method also starts to overfit, unlike the YOLOv3 network model, the accuracy of our modified model when overfitting remains roughly at the same value from which we can deduce that the changes make the model more stable.

Figure 3 shows the differences in loss function when trained using the RGB, RGB-D and Depth data as input. For the unmodified *YOLOv3* we are using *lr* = 1*e*−<sup>5</sup> as the midpoint between our minimum and maximum learning rates in the periodic learning rate function. As we can see from the graph, the loss function using static learning rate on the RGB and RGB-D datasets reaches a local minimum causing the model to slow down its ability to learn new features, unlike our periodic learning rate which seems to temporarily force the model to overshoot its target which sometimes causes it to fall into a better local minimum. This effect can be seen in the distinct peaks and valleys in the graphs. The outlier in these graphs are depth-only data points. While in both cases the loss function seems lower and has a better downwards trajectory in stochastic descent, however, we have noticed that despite seemingly lower loss when compared to RGB and RGB-D, the actual model accuracy is very unstable on epoch-per-epoch basis. We assert that this is the case due to depth alone providing very unstable data that's very hard to interpret. We make this assumption due to the fact that even when taken an expert to evaluate the depth maps alone, it is usually very hard to discern what type of object it is without knowing its texture; it is only possible to tell that there is in fact an object in the frame. Finally, we can see that the RGB-D data is a clear winner when training in both cases, which means that depth data can indeed help in model generalization.

**Figure 3.** Training loss comparison between baseline *YOLOv3* and our modified version when using RGB, RGB-D and depth data as training. Due to the loss function being inherently noisy for each of the mini-batches, we have used Savitzky-Golay [52] digital filter to perform the smoothing of the overall graph.

#### *2.3. Reconstruction Algorithm*

The proposed algorithm for 3D object reconstruction consists of two subsystems: voxel cloud reconstruction and post-processing (Figure 4). In reconstruction step we take the outputs of the 3D classifier mask for the object and in conjunction with the original depth map which we feed into our reconstruction ANN (Figure 5) that performs the object reconstruction task for the given masked input frame. Unlike the classification algorithm we only use the underlying depth input from the classifier as it provides enough information for the specific object reconstruction. This is due to fact that we already know the class of the object, which is required for classification because different objects can have very similar depth representations. However, during reconstruction this is not an issue because our ANN is designed in such a way that each branch is responsible for reconstructing similar object representations.

Once the classifier-segmentation branch has finished its task, for each object instance the appropriately trained reconstruction branch is selected. In our case all the branches are highly specialized on a single type of object that it can reconstruct, which is why object classification is required. However, we believe that there is no roadblock to having more generic object reconstruction branches for example all similar objects may be grouped to a single reconstruction task. This could potentially allow some simplifications in the classification-segmentation as it would no longer be required to classify highly specific object instances thus reducing failure rate caused by object similarities. For example, a cup and a basket can be very similar objects and be misclassified. Additionally, the hybridization allows for fine tuning of the reconstruction branches without having to retrain the entire neural network model potentially losing already existing gradients via on-line training skewing the results towards new data posed. This in turn reduces re-training time if new data points are provided for a specific object as we no longer need to touch the established branches due to modularity.

Inside our reconstruction network branch (Figure 2) for given depth input we use convolutional layers to reduce the dimensionality of the input image during the encoding phase (see Table 2). For a given input, we create a bottleneck convolution layer which extracts 96 features, afterwards we use a spatial 2D dropout [53] layer before each with *P*(*x*) = 0.1 to improve generalization. We use spatial dropout as it is shown to improve generalization during training as it reduces the effect of nearby pixels being strongly correlated within the feature maps. Afterwards, we add an additional inception [54] layer (Figure 6) which we will use as a residual block [55] followed by another spatial dropout. Afterwards, we add two additional bottleneck residual layers, each followed by additional dropouts. With final convolution giving us final 256 features with the resolution of 20 × 15. Our final encoder layer is connected using a fully-connected layer to a variational autoencoder [56] containing 2 latent dimensions, as variational autoencoders have shown great capabilities in generative tasks. Finally, the sampling layer is connected to full-connected layer which is then unpacked into a 4 × 4 × 4 matrix. We use the transposed three-dimensional convolutional layers in order to perform up-sampling. This is done twice, giving us 4 feature maps in 32 × 32 × 32 voxel space. Up to this point we have used Linear Rectified Units [57] (ReLUs) for our activation function, however, for our final 3D convolutional layer we use a softmax function in order to normalize its outputs where each voxel contains two neurons. One neuron indicating the confidence of it being toggled on, the other neuron showing the confidence of the neuron being off. This switches the task from a regression task to a classification task, allowing us to use categorical cross entropy to measure the loss between the predicted value and our ground truth.

**Figure 4.** Workflow of object reconstruction from sensor data.

**Figure 5.** Diagram of a single object reconstruction network architecture branch. For the given depth frame, the depth encoder creates a bottleneck, which is then directly connected to VAE node, the resulting sampler is connected into voxel decoder. The voxel decoder layer outputs a 32 × 32 × 32 × 2 matrix which can be explained as *x* × *y* × *z* × *s*, where *x, y, z* components indicate position in 3D grid, and *s* component indicates voxel state encoded as one-hot.


**Table 2.** Architecture of the reconstruction neural network.

**Figure 6.** An example of the inception layer. An input layer is connected to three branches in parallel. If multiple inception layers are used inception layers are connected sequentially. Final inception layer outputs and 1 × 1 convolution are then connected using addition. The result is then used for subsequent layers.

#### *2.4. Proposed Network vs. YOLOv3*

Our approach is the hybridization of two ANN architectures: classification-segmentation branch and reconstruction branch (see Figure 7). The classification-segmentation branch as the name suggests performs object instance classification and segmentation. This information is then fed to the object reconstruction branches. Object reconstruction branch contains a fleet of specialized pre-trained autoencoder models where each of the auto-encoders can reconstruct the model's three-dimensional representation while being provided only a single depth frame. The initial classification-segmentation branch is our expanded interpretation of YOLOv3 which adds crucial output to already existing YOLOv3 network output, i.e., the object instance segmentation. This extension adds crucial information which is required for the reconstruction step by extracting the object instance mask that can be applied per each object on the initially captured depth.

**Figure 7.** Full view of the proposed network model that extends the YOLOv3 network.

#### *2.5. Dataset*

As our method entails the requirement of a priori information for the captured object reconstruction, there is a need for a large well labeled element dataset. However, unlike for object recognition which has multiple datasets, e.g., *COCO* [58] dataset, *Pascal VOC* [59]; there seems to be a lack of any public datasets that provide RGB-D scene representation in addition to it's fully scanned point cloud information viable for our approach. While datasets like *ScanNet* [60] exist, they are missing finer object details due to focusing their scan on full room experience that we are trying to preserve. Therefore, our training data consists exclusively out of synthetically generated datasets, which use the *ShapeNetCore*, a subset of *ShapeNet* dataset that provides 3D object models spanning 55 categories (see an example of a coffee cup model in Figure 8). In addition, we use real-life data acquired by the *Intel Realsense ZR300* and *Intel Realsense D435i* (Intel Corp., Santa Clara, CA, USA) devices for visual validation as it is impossible to measure it objectively without having a 3D artist recreating a 1:1 replica of said objects, which is unfortunately unfeasible option. However, using real world samples as a validation set is not subject to training bias because they are never being use in the training process.

As mentioned, for the training of the black-box model we are using the *ShapeNetCore* dataset that we prepare using *Blender* [61] in order to create the appropriate datasets. Due to the fact that we are training a hybrid neural network, we need two separate training and testing sets, one for each task.

**Figure 8.** A coffee cup model from the *ShapeNetCore* dataset.

#### 2.5.1. Classification Dataset

To create this subset of data we create random scenes by performing the following procedure. Firstly, we randomly decide how many objects we want to have in the scene in the range of *nobjects* = [1; 10) and pick that many random objects from *ShapeNetCore* dataset to populate the scene. Before applying any external transformations we transform the object geometry so that all objects are of uniform scale and have the same pivot point. To perform the required transformations firstly we calculate the geometry extents. Once we know the object extents we can move all the objects on *Up* axis (in our case this is *z*) and scale down all vertices by the largest axis (Algorithm 2). This gives us a uniformly scaled normalized geometry that we can freely use.

#### **Algorithm 2** Normalize geometry


We place the selected objects with random transformation matrices in the scene, making sure sure that the objects would never overlap in space. To generate random local transformation matrix (*L*) (Equation (3)) we need three of it's components: Scale (*S*), Rotation (*Rz*) and with random value; use either capital or lower-case s in both places in the range of *s* = [0.7, 2); Rotation (*Rz*), where rotation is random value in the range of *θ* = [0, 2*π*), we perform rotation only on *z* axis to ensure that randomly generated scenes are realistic and do not require artist intervention; Translation (*T*), where *x* and *y* values are non-intersecting values in the range of *r* = [−5, 5] and *α* = [0, 2*π*) (Equation (2)).

$$\begin{cases} x = r \times \cos \alpha \\ y = r \times \sin \alpha \end{cases} \tag{2}$$

$$L = S \times R \times T = \begin{bmatrix} s & 0 & 0 & 0 \\ 0 & s & 0 & 0 \\ 0 & 0 & s & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \times \begin{bmatrix} \cos\theta & -\sin\theta & 0 & 0 \\ \sin\theta & \cos\theta & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \times \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ x & y & z & 1 \end{bmatrix} \tag{3}$$

Once the selection objects are placed we need to apply lighting in order to have real-life like environments. To do this, we use the Lambertian shading model and directional lights. We randomly generate *nlights* = [1; 4) lights in the scene. We pick a random light rotation, we ignore translation as it does not matter in directional lights; we generate a random color in the range of *ColRGB* = [0.7, 1], we selected the minimum bound of 0.7 to avoid unrealistic real-world lightning; and random intensity *I* = [0.7, 1]. This light acts as our key light. To avoid hard shadows being created, which wouldn't be the case unless using spotlight in real world, for each key light we create a backlight which is pointing the opposite direction of key light with half the intensity and identical color to the key light.

Once the scene setup is complete, we render the scene in three modes: *color*, *depth* and *mask*. Color mode gives us the scene representation from a regular light spectrum camera. As we are not putting any background objects into the scene the generated background is black. However, later on we use augmentation during training to switch the backgrounds to improve recall rates. Once the color

frame is extracted we extract the mask, in order to extract the mask we assign each object an incremental *ID* starting at 1, this allows us to differentiate between objects in the frame. Finally, we render the depth representation of the scene. Before rendering depth we place a plane on the ground that acts as our ground place, this allows for more realistic depth representations because the objects are no longer *floating* in space. The depth is rendered front-to-back, meaning the closer the object is to the camera the closer to zero depth value is, the front-to-back model was chosen because this is the same as *Intel Realsense* model.

Each of the scenes is rendered in *320* × *240* resolution *n* = 25 times by placing it in random locations (Algorithm 3) and pointing it at the center of the scene, where *r* = 10, *zmin* = 4, *zmax* = 6.



We save the perspectives as *OpenEXR* format [62] instead of traditional image formats instead of, for example, *PNG*, as *OpenEXR* file format is linear, allowing for retention of all depth range without any loss of information as it is not limited to 32 bits per pixel. The final *EXR* file has these channels in it *R*, *G*, *B* containing red, green and blue color information respectively; *id* channel contains the information about the mask for specific pixel; *Z* information containing the linear depth data.

Once we create the input image, we additionally label the data and extract the segmentation mask that will be used as output when training the artificial neural net. We perform this step after the scene is rendered in order to account for any kind of occlusion that may occur when objects are in front of each other causing them to overlap. We extract the object bounding boxes by finding the most top-left and bottom-right pixel of the mask. The binary mask is extracted based on the pixel square distance from the center of the bounding box. This means that the center pixels for the bounding box are completely white and the closer to the edges it is the darker it gets. We use non-flat segmentation to be able to extrapolate individual object instances in the mask when they overlap, and this is done by interpolating the pixel intensity from the bounding box edge to bounding box center. The mask is then scaled down to *80* × *60* resolution as it is generally sufficient and reduces the required resources.

#### 2.5.2. Anchor Selection

The existing anchors that are being used with *COCO*, *Pascal VOC* and other datasets are not suitable for our dataset, rarely fitting into them. Therefore, we performed class data analysis and selected three most fitting anchors per classifier branch scale. As we can see from Figure 9, our classes generally tend to be biased towards 1:1 aspect ratio due to data set being randomly generated unlike in real world applications.

However, while the classes tend to be biased towards 1:1 for the most part, the assertion that all individual object instances would neatly fit into this aspect ratio would be incorrect as they still retain certain bias. According to previous Single Shot Detection (SSD) research [63], selecting inadequate base anchor boxes can negatively affect the training process and cause the network to overfit. Therefore, we chose to have 3 anchors per anchor size as this seems to sufficiently cover the entire bounding box scale spread by including tall, wide and rectangle objects. We select the anchor points using *K-Means* to split data into 9 distinct groups (Figure 10).

Once we have our cluster points for bounding box detections, we sort them in order to group into small, medium and large anchor sets. Giving us three different anchors, each having the most popular aspect ratios per that scale detection branch as it can be seen in Table 3.

**Figure 9.** Each individual point denotes the mean object bounding box scale for each class type.

**Figure 10.** Selected anchors using *K-Means* clustering algorithm. Different colors denote distinct anchor groups responsible for detecting objects in the spread.

**Table 3.** Anchor scales in pixels calculated using the *K-Means* clustering method.


The neural network architecture described in Section 2.2 was trained in three separate modes in order to infer how much the additional depth information improves the classification results. These three modes consist of RGB, RGB-D and Depth training modes. Where RGB mode implies we train using only the color information that was generated from the dataset, the RGB-D mode uses both depth and color information and finally Depth mode trains the network using only depth information. We do not use any additional data augmentation when training in both RGB and RGB-D modes. We do however, add additional augmentation when training in the RGB-D mode. When training in the RGB-D mode there is a small chance that either RGB or Depth channel will not be included in the testing sample. We perform this augmentation because both RGB camera and Depth sensors may potentially have invalid frames. Therefore, we assert that both of these data points are equally necessary for the classification task, and that they must be generalized separately from each other and should provide equal contributions to the classification task. This is decided randomly when preparing the mini-batch to be sent to the artificial neural network for training. There is *λ* = 0.1 chance that the input specific data point will be picked for additional augmentation. If the data point is picked for augmentation then there is equal probability that either RGB or Depth Data will be erased from the input and replaced with zeros. We decided on this augmentation approach because both RGB and Depth frames using real sensors are prone to errors. For example, the RGB camera may fail in bad lighting or even be unavailable when the room is pitch black. Likewise, the depth frames are also prone to errors due to inconsistencies in generating depth map which causes the sensor to create speckling effect in the depth information, additionally cameras being too close to object may be completely unable to extract proper depth information. Therefore, we chose this augmentation approach as it allows for the sensors to work in tandem when both are available, but fill in the gaps, when one of them is failing to provide an accurate information.

#### 2.5.3. Reconstruction Dataset

For the reconstruction training set, we use the same *ShapeNetCore* dataset to generate the corresponding depth images and ground truths for the individual objects voxel cloud. We used *Blender* to generate the training data. However, the generated input data is different. We assert that the object material does not influence the objects shape, therefore we no longer generate the color map unlike when generating classification data. Therefore, we only render the depth information for each object. We render individual objects by placing the cameras in such a way that the specific object would be visible from all angles from 45◦ to 90◦ at a distance from 1 to 1.5 m, excluding the bottom. As a result we have 48 perspectives for each of the object models. Once again we save the models as *OpenEXR* file in order to preserve the depth values in this lossless format. Finally, we generate the voxel-cloud representation [64]. Voxelization is performed by partitioning into the equally sized cells, where the cell size is selected based on the largest object dimension axis. Following the space partitioning, we repeat over each of the cells and compute whether the specific cell should be filled by ray-polygon intersection [65].

#### *2.6. Evaluation*

In order to evaluate the correctness of our results, we evaluate the results of the proposed algorithm, and additionally we evaluate both of the subsystems individually. To evaluate the classification accuracy, we use the *mAP* metrics to assess the quality of the classifier and it's output bounding boxes. When performing the classification accuracy evaluation, we evaluate all three train models: RGB, RGB-D and Depth. This allows us to determine the quality differences between the addition of depth information in the classification task.

For the reconstruction task we require the output voxel representation of the object to be as close to ground truth as possible. For that, we define our reconstruction quality as the *Intersection-over-Union* metric. Furthermore, we use the *Correctness*, *Completeness*, and *Quality* metrics during evaluation.

#### **3. Results**

#### *3.1. Settings*

Our experiments have been executed using two computers: (1) a workstation with *Intel i7-4790* CPU with *16 GB of RAM* which achieved 55.76 fps, and *nVidia 1070* graphics card with *8 GB GDDR5 VRAM*; and (2) a laptop computer using *nVidia 960M* graphics chip with *4GB GDDR5 VRAM*, *Intel i5-4210H* CPU and *12 GB of RAM*, which reached 11.503 fps. We consider that these machines should represent the target range of end user devices.

#### *3.2. Quantitative Results*

#### 3.2.1. Object Instance Classification Results

In order to evaluate our model in all cases, we have used the mAP metric, which is a widely used method in order to evaluate mean average precision of the predicted bounding boxes with respect to their Intersection-over-Union (IoU), provided that the object classes match. As per suggested *COCO* evaluation we filter out bounding boxes which have an *IoU* < 0.5 in order to compare all of our trained model versions.

As Table 4 suggests, our iterative training approach in addition to dropout layer was substantially better in the object classification task as opposed to the originally suggested variant which would either plateau with too low of a learning rate or get stuck in a constant loop around the local minima due to the initial learning rate being too high. Therefore, we can assert that a periodic learning rate is a useful tool to improve model generalization and the speed at which the network can train by adding additional noise during training time in a form of sudden overshooting. Furthermore, we can see that the addition of depth information as input greatly increases the recall rate in both cases, while the depth information alone has similar recall rate in both cases. This suggests that the depth cameras can not only greatly benefit in the object classification task when used in conjunction with visible light spectrum cameras but it can be used as a fallback when no light source is available, albeit with lower precision.

**Table 4.** Mean precision values in respect to *IoU* > 0.5 for each of our trained models.


One of the glaring issues we noticed with the *ShapeNetCore* dataset during our experiments is that, while there are specified a total of 55 classes, a lot of those classes have major overlap in form and function which may dramatically affect the overall mAP value, such as classes that are categorized as distinct (e.g., *pistol* and *handgun*) could still be grouped into the same class as they share key characteristics which may not be viable to differentiate when using relatively low resolution images. Additionally, some groups of objects can be distinct in their use (e.g., *mug* and *ashcan* and *basket*) in many cases have no differentiable features and would require each individual scene to be hand crafted by an artist in order to provide visual queues about the objects in relation to the world, which should potentially allow for differentiation between very similar objects (Figure 11). However, this is currently beyond the scope of our paper.

**Figure 11.** Prediction made with our extended *YOLOv3* network. Left to right: (1) Input color image with predicted object instances; (2) Input depth frame; (3) Upscaled to 320 × 240 ground truth mask; (4) Predicted mask upscaled to 320 × 240. Same object is being treated as two distinct classes due to lack of cues for the artificial neural network of what the specific object may be due to scenes being generated randomly.

#### 3.2.2. Mask Prediction Results

As one of our main goals is to extract individual object instances from the depth map, we extended the *YOLOv3* network architecture to be able to predict object masks. In order to compare the predicted mask similarity with the ground truth we use the structural similarity index metric (SSIM) that measures perceived similarity between two images.

As we can see from Figure 12, in all cases our *YOLOv3* extension for object mask prediction is capable of extracting mask frame not only from the combined RGB-D frames but also from the RGB and Depth frames alone. This shows us that both color and depth information individually is generally enough for this task. However, both of these sensors may fail in different environments so the conjunction of both would most likely procure the most accurate results. Additionally, while in both method cases (static and periodical) the similarity is generally more than enough to extract accurate mask, using periodical approach provides a much lower standard deviation, hence better expected results. Additionally, the higher similarity also signals a tighter mask which may improve reconstruction quality due to reduction in bad data. While in our tests RGB has slight advantage over RGB-D when generating a mask, it is worth noting that Depth adds an additional dimension to the data which makes the dataset slightly harder when compared to RGB alone. This is due to RGB alone being able to drop the randomly generated background, unlike RGB-D which has a non-uniform background due to addition of ground plane. As we can see in Figure 13, our approach is applicable not only for synthetic but for real-world data too. This indicates that the network managed to generalize well and it's result can be used during reconstruction step.

#### *3.3. Reconstruction Results*

#### 3.3.1. Quantitative Results

We can observe the achieved results for our proposed method in Figure 14 as they compare to previously achieved results in hybrid neural-network reconstruction [66]. As we can see the mean *IoU* metric value as compared to the results presented in [66] has significantly improved for some of the models, more importantly—even if the the improvement was minimal or if the results were slightly lower the error spread is lower. This indicates that the achieved results are much more stable. Additionally we can see that our reconstruction results are comparable to that of other state-of-art methods like *3D-R2N2* reporting 0.571 mean *IoU*.

**Figure 12.** Similarity of created mask to the mask of ground truth. The hashed bar denotes the similarity of masks predicted by the *YOLOv3* network, the solid bar denotes the similarity of masks predicted for Depth, RGB-D and RGB frames by the network model proposed in this paper.

**Figure 13.** An example of real world object classification using the proposed network model: segmented and classified RGB frame (**left**), depth frame (**middle**), and predicted depth mask (**right**).

**Figure 14.** Comparison between the predicted object shape and ground truth using the IoU metric for different objects in the training set. The hashed bars denote the results achieved using the network proposed in [66]. The solid bars denote the results for the proposed network.

#### 3.3.2. Visual Inspection and Result Validation

For every object that we have trained, we had collected real world examples using *Intel Realsense* device in order to compare how well synthetic results transfer into real world data. The results for the given dataset can be seen in the Table 5.

**Table 5.** Visual evaluation of object reconstruction. Table presents: RGB frame, original depth frame; reconstructed cloud of voxels; triangulated and smoothed surface created using predicted voxel cloud; and a corresponding similar object in the training set.

The reconstructed object shapes are generally recognizable. However, certain object angles cause the network to fail the task, for example, one of the bowls is missing half of it's voxels, while the other bowl may be considered a near perfect reconstruction. While the ANN has managed to reconstruct the *Book* and *Knife* datasets, it has generally only managed to reconstruct their base shape primitives which make the objects somewhat indistinguishable by experts without any prior information of what the objects may be. While the human bias may notice the minute structural differences between the knife handle and blade in terms of it's width, we still consider this a failed reconstruction. *Can* has managed to achieve great results in terms of reconstruction, while the pillow reconstruction could be considered near perfect. *Mug* in our training set is one of the trickiest objects as it contains a handle which should be reconstructed with a hole and additionally the mug cannot be fully filled in with voxels as in our case it is empty. While in all three cases the basic shape of the cup was maintained, there are some issues with two test cases. One of the test cases was missing a hole for the handle, while another is substantially distorted. However, the distortions may be explained by extremely noisy dataset. The *Chair* dataset allowed to reconstruct the shape of the chair although some of the details were missing. The *Laptop* and *Bottle* datasets are the hardest ones in terms of depth sensor capabilities. Depth sensor has issues in retrieving depth information for IR reflective surfaces causing it to distort the images fully. Such surfaces in our case are computer screen and a plastic bottle. However, the *laptop* data has surprisingly managed to account for this error in depth map, albeit containing some distortions.

#### 3.3.3. Reconstruction of Multiple Objects

As a proof of concept, we have performed the experiments to reconstruct multiple objects in the scene (see an example of results in Figure 15). By extracting the individual object masks and performing an object reconstruction individually we have managed to reconstruct the specific objects in the scene. However, we are unable to predict the object's relative position, rotation and scale in relation to camera space. For this reason, we have had to specify the object transformation in relation to camera and other scene objects manually to perform final scene render.

**Figure 15.** An example of reconstruction of multiple objects in the scene: segmented and classified RGB frame (**left**), depth frame (**middle**), and predicted depth mask (**right**).

#### **4. Discussion and Concluding Remarks**

#### *4.1. Discussion*

One of the main advantages of our suggested hybrid NN based method is that unlike other non-hybrid approaches, it is relatively easy to include additional objects into the dataset, due to the fact that you can train network branches separately. Unlike other approaches, we do not need to re-train the model with all the previous data as we do not risk losing any of the existing gradients due to network being skewed to the new data points. The modularity of the approach allows us to train the network reconstruction nodes per each object category independently. Additionally, this modularity allows for variance of the model per object class, meaning we can adjust complexity of the ANN depending on the difficulty of the object that is being reconstructed. Furthermore, we believe that our approach is a step forward to generic object reconstruction as we are capable of extracting multiple

objects from the same scene thanks to masking during classification step, which allows to send only the relevant objects depth information into the reconstruction node.

While our approach is capable of extracting the individual object instances and reconstructing them, additional research is required for full scene reconstruction. This feat requires finding the camera space matrices, paving the way for application in Extended Reality systems. One of the standing issues with our current approach in terms of reconstruction is that our ground-truths are perspective-invariant. This makes training the network slightly harder, additionally it may somewhat reduce the quality of the results due to network somewhat trying to adjust to observation angle, therefore making the IoU metric values lower, despite visually being feasible. Solving the perspective invariance may also be a partial solution to the homography [67,68] problem as our reconstructed object would already be rotated with respect to the camera space.

Additionally, the improvements on the dataset may be obtained by creating and incorporating a real-world dataset along with synthetic data for the depth encoding step. Thus, we can potentially improved results when using real depth sensors. Additional improvements to the network architecture may also be found by changing the complexity of the model [69]; pruning dead neurons [70]; using neuro-evolutionary and neuro-genetic algorithms to find a much more fitting solution [71]; enhancing the learning of the artificial neural networks by using metaheuristic control mechanism [72]; or using multiple frames from a video feed instead of the current single frame solution as a large number of depth frames from a single view may reveal some hidden features and improve recall rate [73]. Using multiple frames would allow for exploration of what improvements may be achieved with the use of recurrent neural networks (RNN) for they have shown to be capable of predicting sequential data [74–76]. Finally, using the RGB frames combined with depth frames for reconstruction can potentially add some missing features from the depth due to inherent noisiness of the sensor, therefore improving the recall rate [77,78].

Finally, we have compared the complexity of the proposed network model with the YOLOv3 network as well as with other popular network architectures. The results presented in Table 6 how that the proposed network model is only sightly more complex than YOLOv3 in terms of the number of model parameters and operation, but outperforms other network architectures in terms of operations.


**Table 6.** Comparison of neural network complexity by the number of parameters, number of operations and model size.

#### *4.2. Concluding Remarks*

Our proposed hybrid artificial neural network modifications have allowed to improve the reconstruction results with respect to theYOLOv3 network results by 8.53% which allows for much more precise filling of occluded object sides and the reduction of noise during the process. Additionally, the reconstruction results are a lot more stable when compared to previous results. Furthermore, the addition of object segmentation masks and the individual object instance classification is a leap forward towards a general purpose scene reconstruction as opposed to single object reconstruction task due to the ability to mask out overlapping object instances and use only masked object area in the reconstruction process. While further research is needed in order to retrieve object orientation and

position with respect to camera space, we believe our method allows for a much broader application in comparison to previous research due to its focus on single object reconstruction.

**Author Contributions:** Conceptualization, R.M.; methodology, R.M.; software, A.K.; validation, A.K. and R.M.; formal analysis, R.M.; investigation, A.K., R.M. and R.D.; resources, A.K. and R.M.; data curation, A.K.; writing—original draft preparation, A.K. and R.M.; writing—review and editing, R.D. and E.S.L.H.; visualization, A.K. and R.M.; supervision, R.M. The individual contribution of the authors is as follows: 0.5, A.K.; 0.2, R.M.; 0.2, R.D.; 0.1, E.S.L.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Exoskeleton Hand Control by Fractional Order Models**

#### **Mircea Ivanescu 1,\*, Nirvana Popescu 2, Decebal Popescu 2, Asma Channa <sup>2</sup> and Marian Poboroniuc <sup>3</sup>**


Received: 11 September 2019; Accepted: 20 October 2019; Published: 23 October 2019

**Abstract:** This paper deals with the fractional order control for the complex systems, hand exoskeleton and sensors, that monitor and control the human behavior. The control laws based on physical significance variables, for fractional order models, with delays or without delays, are proposed and discussed. Lyapunov techniques and the methods that derive from Yakubovici-Kalman-Popov lemma are used and the frequency criterions that ensure asymptotic stability of the closed loop system are inferred. An observer control is proposed for the complex models, exoskeleton and sensors. The asymptotic stability of the system, exoskeleton hand-observer, is studied for sector control laws. Numerical simulations for an intelligent haptic robot-glove are presented. Several examples regarding these models, with delays or without delays, by using sector control laws or an observer control, are analyzed. The experimental platform is presented.

**Keywords:** exoskeleton hand; fractional order model; control

#### **1. Introduction**

The IHRG is an intelligent haptic robotic glove system for the rehabilitation of patients that have a diagnosis of a cerebrovascular accident. This system is created by a thin textile in order to have a comfortable environment for grasping exercises. An exoskeleton architecture ensures the mechanical compliance of human fingers. The driving and skin sensor system is designed to determine comfortable and stable grasping function. This paper analyzes the dynamics of an exoskeleton hand using fractional order operators and proposes control solutions.

The number of applications in the system modelling, where the fractional order calculus (FOC) is used, has increased significantly in the last few decades. Many authors proved that non-integer order integrals and derivatives are suitable for analyses of the properties of various materials. Recent achievements in the interpretation of FOC operators allowed to apply FOC for processes that are better described by fractional order models (FOM) rather than integer order models (IOM). The role of these models in soft matter physics and viscoelastic behavior, in the theory of complex materials, its quality to include effects with non-conservative forces and power-law phenomena suggest to describe the complexity of human dynamics using FOM operators [1].

The idea is supported by the evidence of *s*<sup>β</sup> dynamics in muscles and joint tissues throughout human musculo-skeletal system [2,3]. Interaction and dependence between biological systems and associated mechanical components was analyzed in [4–6]. Fractional order models for metal polymer composite was discussed in [7]. In [8,9] viscoelastic properties for a large variety of biological entities were studied. A class of sensors based on fractional calculus was presented in [10]. Optimal techniques using fractional calculus for sensor networks were discussed in [11]. A fractional model to capture muscular dynamics in the movement process was proposed in [12]. In [13–15] a class of neural, muscular, and vascular processes were studied to minimize sensor placement. Recently, there is a great deal of interest in so-called rehabilitation robotics, a branch of the areas of robotics and mechatronics that addresses the study of complex robotic systems aiming to restore human functions for those who suffer major trauma as a result of strokes and cerebrovascular accidents. Robotic therapy is a promising form of neurorehabilitation that can be delivered in more intensive regimens compared to conventional therapy [16]. The complexity of these systems associated with a specified class of sensors is well described by fractional order differential equations [17,18]. The methods for the analysis and design of fractional order operators can be found in [19]. Control and stability of FOM was investigated by techniques Lyapunov in [20,21]. In particular, the authors of [22,23] discuss the stability properties of solutions of nonlinear Caputo fractional differential equations. The exponential stability of nonlinear FOM using the Lyapunov method was analyzed in [24,25]. Other control problems for a class of FOMs with delay were rigorously investigated in [26,27]. Reference [28] proposed an observer for a class of linear and nonlinear FOM using Lyapunov methods. To our knowledge, this paper is the first paper to assess FOM for systems and sensors that monitor or control human behavior. The exoskeleton architecture, which ensures a mechanical compliance of human fingers, including the driving and sensor system, determines comfortable and stable grasping functions.

The dynamics of the whole system, exoskeleton hand (EXHAND), and the sensors can be accurately described by FOM operators. A class of 3D FOM bending sensors is analyzed. The control laws based on physical significance variables, for linear and delay FOM or IOM systems, are proposed and discussed. The sector control laws for linear FOM, with delays or without delays are studied. Lyapunov techniques and the methods that derive from Yakubovici-Kalman-Popov Lemma are used, and the frequency criteria that ensure asymptotic stability of the physical significance variable closed loop system are inferred. An observer control is proposed for the complex models, EXHANDs and sensors. The asymptotic stability of the whole system, the observer-system, is studied according to sector control laws. Frequency criteria and conditions for asymptotic stability are determined. Numerical simulations for the intelligent haptic robot-glove (IHRG) are presented. Several examples regarding the FOM or IOM systems, with delays or without delays, by using sector control laws or an observer control, are analyzed. The IHRG experimental platform is then discussed.

The paper is structured as follows: Section 2.1 discusses FOM sensors and FOM systems implemented in EXHAND; Section 2.2 presents the control systems; Section 3.1 analyzes IHRG numerical simulations; and Section 3.2 presents the IHRG platform. Section 4 provides concluding remarks and discussions.

#### **2. Methods**

*2.1. Fractional Order Models*

Notations:

1. The fractional order integral of order β is the Riemann-Liouville fractional integral:

$$I^{\mathbb{B}} = \frac{1}{\Gamma(\mathbb{B})} \int\_0^t f(\theta)(t-\theta)^{\beta-1} d\theta$$

2. The Caputo derivative of order β, 0 < β< 1 is:

$$D^{\beta}f(t) = \frac{1}{\Gamma(\beta - 1)} \int\_0^t \dot{f}(\theta)(t - \theta)^{-\beta}d\theta$$

where β is the fractional order exponent and Γ(β) is the gamma (Euler's) function.

(a) 3D curvature sensors described by FOM

Bending sensors represent a class of sensors with large applications in the control of complex systems. They convert changes in bend to an electrical parameter variation. Conventional bending sensors handle cases in which bending is produced in the 2D plane. The most common are the resistive sensors, described by IOM operators of order 0. For a special class of systems, such as the hyper-redundant robots [29] where bending is produced in a 3D space (Figure 1a), a special class of bending sensors defined by FOM operators (Figure 1b) is used.

**Figure 1.** (**a**) 3D hyper-redundant robotic arm. (**b**) 3D FOM curvature sensor. (**c**) Measurement technique.

The architecture of this sensor consists of a main viscoelastic component determined by a long flexible backbone wrapped in a cylindrical elastic envelope. Three antagonist cables are implemented at the periphery of the system. In static behavior, curvature κ is obtained by the differential measurement of the cable lengths, Figure 1c [30]:

$$\kappa = F(\Delta L\_1, \Delta L\_2, \Delta L\_3) \tag{1}$$

The dynamic behavior is inferred considering constant curvature along the length. Employing the same technique as that developed in [17] yields (Figure 2):

$$
\ddot{\kappa}(t) = -c\_{\nu s} b\_s D^\beta \kappa(t) - k\_s c\_s \kappa(t) + k\_M \mathcal{M}(t) \tag{2}
$$

where *c*ν*s*, *ks* are distributed viscous and elastic coefficient, assumed uniform distributed along the length, *bs*, *cs* are material parameters and M is the moment that determines the bending. The transfer function is derived from Equation (2) as:

$$H\_{\rm s}(s) = \frac{\kappa(s)}{\mathcal{M}(s)} = \frac{k\_M}{s^2 + c\_{\rm vs}b\_s s^\beta + k\_s c\_s} \tag{3}$$

**Figure 2.** Technological equivalent model of the 3D curvature sensor.

That corresponds to an order 2 FOM operator.

(b) FOM systems

A large class of systems that monitors or controls the human behavior is well described by the FOM operators. Figure 3 shows the control system of an intelligent haptic robot-glove (IHRG) for the rehabilitation of patients that have a diagnosis of a cerebrovascular accident. The IHRG is a medical device that acts in parallel to a hand in order to compensate for lost functions [16]. The exoskeleton architecture that ensures the mechanical compliance of human fingers for the driving system determines comfortable and stable grasping functions.

**Figure 3.** IHRG system.

The dynamics of the system (EXHAND) can be accurately described by FOM operators,

$$D^\beta z(t) = A\_0 z(t) + f(z) + b u(t), \; t \in [0, T] \tag{4}$$

where *z* is the state vector *z* = [*z*1, *z*2, ... , *zn*] *<sup>T</sup>* that defines the motion parameters, β is the fractional order exponent, and *A*0, *b* are (*n* × *n*), (*n* × 1) constant matrices. In a FOM operator of EXHAND, the vector components are defined as

$$D^\beta z\_1 = z\_2 \ \ D^\beta z\_2 = z\_{3\prime} \dots \tag{5}$$

The nonlinear term *f*(*z*) is determined by the gravitational components and satisfies the inequality

$$\left\| f(z) \right\| < \eta \left\| z \right\| \tag{6}$$

The output of the system is generated by the bending sensors. Provided that the bending of the phalange musculoskeletal system is in 2-D plane, in this project, we used an Arduino Flex Resistive Sensor network. This sensor operates as a zeroth IOM operator,

$$y(t) = \mathbf{c}^T \mathbf{z}(t) \tag{7}$$

where *c* is a constant (*n* × 1) vector.

A new model can be inferred if the delay time constant, associated with the neuro-muscular system, the driving system and the processing time, is introduced,

$$D^{\beta}z(t) = A\_0 z(t) + A\_1 z(t - \tau) + f(z) + bu(t), \ t \in [0, T] \tag{8}$$

The initial conditions are defined by

$$z(t) = q\rho(t), \ t \in [-\tau, \ 0].$$

where the function ϕ is associated to initial states.

For Equations (4)–(8) we used the control system from Figure 4 with a FOM operator for the EXHAND and a IOM operator for the sensor system.

**Figure 4.** Control system.

*2.2. Control Systems*

Mathematical Preliminaries

**Lemma** [19]**.** *For any symmetric matrix <sup>P</sup>* <sup>∈</sup> *Rnxn, the following inequality holds:*

$$
\lambda\_{\min(P)} I^\* \le \mathbf{P} \le \lambda\_{\max(P)} I^\* \tag{9}
$$

*where* λ*min*(*P*), λ*max*(*P*) *denote the minimum and maximum eigenvalue, respectively, of matrix P and I*<sup>∗</sup> *is the unit matrix.*

**Theorem 1** ([21,22,24])**.** *The system D*β*z*(*t*) = *Az*(*t*), 0 <β< 1*, is asymptotically stable if*

$$\left| \operatorname{Arg}(\operatorname{sig}(A)) \right| > \beta \frac{\pi}{2} \tag{10}$$

**Theorem 2** ([22–24])**.** *The system D*β*z*(*t*) = *f*(*z*(*t*)), *z*(*t*0) = *z*<sup>0</sup> *is asymptotically stable if there exists a continuously di*ff*erentiable function V*(*t*, *z*) *that satisfies*

$$\alpha\_1 \|\mathbf{z}\|^2 \le V(t, \, z(t)) \le \alpha\_2 \|\mathbf{z}\|^2 \tag{11}$$

$$D^{\beta}V(t,\ z(t)) \le -\alpha\_3 \|\mathbf{z}\|^2\tag{12}$$

*where* α1, α2, α<sup>3</sup> *are positive constants,*

0 <β< 1.

#### 2.2.1. Control for the EXHAND Without Delays

Consider the system from Figure 4 defined by Equations (4)–(7) without a delay time. Assume a control law.

$$u(t) = -k(y(t) - y\_{ref}(t))\tag{13}$$

where the control gain *k* verifies the condition

$$k\sigma \le 1\tag{14}$$

where σ is a positive constant (for simplicity, *yre f*(*t*) = 0).

**Remark 1.** *For the EXHAND model with the state variables defined by Equation (5) and the output vector c* = [*c*1. *c*2, *c*3, *c*4] *T, the control law (Equation (10)) becomes a PD*<sup>β</sup> *law*

$$u(t) = -k(c\_1 z\_1 + c\_2 D^\beta z\_1 + c\_3 z\_3 + c\_4 D^\beta z\_3),$$

*or*

$$u(t) = -k\_1 z\_1 - D^\beta z\_1 - k\_3 z\_3 - k\_4 D^\beta z\_3 \tag{15}$$

*If c is selected as c* = [*c*1. 0, *c*3, 0] *<sup>T</sup>*, *the control becomes a PD law*

$$
\mu(t) = -k\_1 z\_1 - k\_3 \dot{z}\_1 \tag{16}
$$

**Control System 1.** The system (Equations (4)–(7)) with the controller defined by Equations (13) and (14) is asymptotically stable if:

The matrix *<sup>A</sup>*<sup>∗</sup> = *<sup>A</sup>* <sup>−</sup> *<sup>R</sup>* is Hurwitz stable where *<sup>R</sup>* = *<sup>R</sup><sup>T</sup>* <sup>&</sup>gt; 0.

$$\left(\text{Re}\left(\mathbf{c}^T(j\omega\mathbf{I} - (\mathbf{A} - \mathbf{R}))^{-1}\mathbf{b}\right) \ge -\sqrt{\sigma}\right) \tag{17}$$

$$
\log \rho \ge \rho\_{\text{PR}} + 2\lambda\_{\text{max}(p)}\eta \tag{18}
$$

where = *q* + *k* √ σ*d q* + *k* √ σ*d T* , <sup>ρ</sup>*PR* <sup>=</sup> <sup>2</sup>*PR* and *<sup>Q</sup>* <sup>=</sup> *qqT*, P are solutions of the Lyapunov equation [20,29].

**Proof.** Consider the Lyapunov function

$$V(z) = z^T \mathbf{P} z \tag{19}$$

where *P* = *P<sup>T</sup>* > 0. The first asymptotic stability conditions (Equation (8)) are verified for α<sup>1</sup> = λmin (*P*), α<sup>2</sup> = λmax (*P*), [21] (Theorem 1), where λmin (*P*), λmax (*P*) denote the minimum and maximum eigenvalues of P. The fractional derivative of Equation (14) will be [22,24],

$$D^{\beta}V(z) \le \left(D^{\beta}z^{T}\right)\mathbf{P}z + z^{T}\mathbf{P}\left(D^{\beta}z\right) \tag{20}$$

By substituting Equation (4) into Equation (20), one derives

$$D^\beta V(z) \le z^T (\mathbf{A}^T \mathbf{P} + \mathbf{P} \mathbf{A}) z + \mathbf{2} z^T \mathbf{P} \mathbf{b} u + \mathbf{2} z^T \mathbf{P} f \tag{21}$$

Employing the condition (a) yields

$$D^{\beta}V(z) \le z^{T} \Big( (A - \mathsf{R})^{T} \mathsf{P} + \mathsf{P}(A - \mathsf{R}) \Big) z + 2z^{T} \mathsf{P}bu + 2z^{T} \mathsf{P} \mathsf{R}z + 2z^{T} \mathsf{P}f \tag{22}$$

Considering Equations (13) and (14), this inequality becomes

$$D^{\mathbb{P}}V(z) \le z^T \Big( (A - \mathbb{R})^T \mathbb{P} + P(A - \mathbb{R}) \Big) z + 2z^T \Big( \mathcal{P}b - \frac{1}{2} c \Big) \mu - \sigma u^2 + 2z^T \mathcal{P} \mathcal{R}z + 2z^T \mathcal{P}f \tag{23}$$

By employing the condition (Equation (15)) and Yakubovici-Kalman-Popov (YKP) Lemma [31], results

$$z^T((A-\mathbb{R})^T\mathbb{P}+P(A-\mathbb{R}))z=-qq^T\tag{24}$$

$$Pb - \frac{1}{2}c = \sqrt{\sigma}q\tag{25}$$

Now, considering the control law (Equation (13)), it follows that

$$D^{\beta}V(\mathbf{z}) \le -\mathbf{z}^{T} (\mathbf{q} + k\sqrt{\sigma}\mathbf{d}) \mathbf{(q} + k\sqrt{\sigma}\mathbf{d})^{T}\mathbf{z} + \rho\_{\text{PR}}\mathbf{z}^{T}\mathbf{z} + 2\lambda\_{\text{max}(P)}\eta\tag{26}$$

or, by Equation (16),

$$D^{\beta}V(z) \le -\alpha \eta \|z\|^2\tag{27}$$

where

$$
\alpha\_3 = \varrho - \rho\_{PR.} \tag{28}
$$


#### 2.2.2. Control for the EXHAND with Delay

**Control System 2.** The system described by Equation (8) with the control law defined by Equation (13) is asymptotically stable if:

1. *A*∗ <sup>0</sup> <sup>=</sup> (*A*<sup>0</sup> <sup>−</sup> *<sup>R</sup>*) is Hurwitz stable, where *<sup>R</sup>* <sup>=</sup> *<sup>R</sup><sup>T</sup>* <sup>&</sup>gt; <sup>0</sup>

$$\operatorname{Re}\left(c^T \left(j\omega I - A\_0^\*\right)^{-1} b\right) \ge -\sqrt{\sigma} \tag{29}$$

$$\log - \left( 2\eta \lambda\_{\text{max}(P\_1)} + \frac{1}{2} \lambda\_{\text{max}(D)} + 2\lambda\_{\text{max}(P\_1 R)} + \lambda\_{\text{max}(P\_2)} \right) > 0 \tag{30}$$

$$
\lambda\_{\min(P\_2)} - \frac{1}{2} \lambda\_{\max(D)} > 0 \tag{31}
$$

where *Q* = *qqT*, *P*<sup>1</sup> are solutions of the Lyapunov equations and

$$\varrho = \left\| \begin{pmatrix} q+k \ \sqrt{\sigma}c\_1 \end{pmatrix} \begin{pmatrix} q+k \ \sqrt{\sigma}c\_1 \end{pmatrix}^T \right\|.$$

*Sensors* **2019**, *19*, 4608

$$D = \left(A\_1^T P\_1 + P\_1 A\_1\right)^2$$

**Proof.** Consider the following Lyapunov function:

$$V(\mathbf{z}(t)) = l^{1-\beta}(\mathbf{z}^T(t)\mathbf{P\_1z}(t)) + \int\_{t-\tau}^{t} \mathbf{z}^T(\theta)\mathbf{P\_2z}(\theta)d\theta \tag{32}$$

where *<sup>P</sup>***1**, *<sup>P</sup>***<sup>2</sup>** are (*<sup>n</sup>* <sup>×</sup> *<sup>n</sup>*) are positive definite and symmetrical matrices, *<sup>P</sup>***<sup>1</sup>** <sup>&</sup>gt; 0, *<sup>P</sup>***<sup>2</sup>** <sup>&</sup>gt; 0, *<sup>P</sup><sup>T</sup>* **<sup>1</sup>** = *P***1**, *PT* **<sup>2</sup>** = *P***2**. *V*(*z*) satisfies the condition (Equation (11)) of Theorem 2.

$$\|V(z) \ge \lambda\_{\min(P\_1)}\| \|z(t)\|\|^2 \tag{33}$$

$$V(\mathbf{z}) \le \lambda\_{\max(P\_1)} \left\| \mathbf{z}(t) \right\|^2 + \lambda\_{\max(P\_2)} \int\_{t-\tau}^t \left\| \mathbf{z}(\theta) \right\|^2 d\theta \le M \left\| \mathbf{z}(t) \right\|^2\tag{34}$$

The derivative *D*β*V*(*z*) is computed from:

$$D^{\beta}V(\mathbf{z}) = I^{1-\beta}\dot{V}(\mathbf{z})\tag{35}$$

where *I* <sup>1</sup>−<sup>β</sup> is the Riemann-Liouville fractional integral of the order (<sup>1</sup> <sup>−</sup> <sup>β</sup>). The derivative . *V*(*z*) is evaluated from Equation (32)

$$\dot{V}(\mathbf{z}) = D^{\beta}(\left(\mathbf{z}^{T}(t)\mathbf{P}\_{1}\mathbf{z}(t)\right) + \frac{d}{dt}\int\_{t-\tau}^{t} \mathbf{z}^{T}(\theta)\mathbf{P}\_{2}\mathbf{z}(\theta)d\theta\tag{36}$$

which leads to the inequality

$$\dot{V}(z) \le (D^\beta z^T(t))\mathbf{P}\_1 z(t) + z^T(t)\mathbf{P}\_1 \mathbf{(D}^\beta z(t)) + z^T(t)\mathbf{P}\_2 z(t) - z^T(t-\tau)\mathbf{P}\_2 z(t-\tau) \tag{37}$$

By evaluating Equation (37) along of solutions of Equation (8) it turns out that

$$\begin{split} \dot{V}(z) \le & z^T(t) \Big( \mathbf{A\_0^\* P\_1} + \mathbf{P\_1 A\_0^\*} \Big) \mathbf{z}(t) + z^T(t) \Big( \mathbf{A\_1^T P\_1} + \mathbf{P\_1 A\_1} \Big) \mathbf{z}(t - \tau) + 2 \mathbf{z}^T(t) \mathbf{P\_1} \mathbf{b} u(t) + 2 \mathbf{z}^T \mathbf{P} f \\ &+ 2 \mathbf{z}^T(t) \mathbf{P\_1} \mathbf{R} \mathbf{z}(t)) + \mathbf{z}^T(t) \mathbf{P\_2} \mathbf{z}(t) - \mathbf{z}^T(t - \tau) \mathbf{P\_2} \mathbf{z}(t - \tau) \end{split} \tag{38}$$

By applying the control law Equation (28), it yields

$$\begin{aligned} \dot{V}(z) \le & z^T(t) \Big( \mathbf{A}\_0^T \mathbf{P}\_1 + \mathbf{P}\_1 \mathbf{A}\_0^\* \Big) z(t) + 2z^T(t) (\mathbf{P}\_1 \mathbf{b} - \frac{\epsilon}{2}) u(t) - \sigma u^2(t) + 2z^T \mathbf{P}f \\ &+ z^T(t) \Big( \mathbf{A}\_1^T \mathbf{P}\_1 + \mathbf{P}\_1 \mathbf{A}\_1 \Big) z(t-\tau) + 2z^T(t) \mathbf{P}\_1 \mathbf{R} z(t) + z^T(t) \mathbf{P}\_2 z(t) \\ &- z^T(t-\tau) \mathbf{P}\_2 z(t-\tau) \end{aligned} \tag{39}$$

The following inequality will be used [23]

$$\mathbb{E}\left\|z^{T}(t)Dz(t-\tau)\right\| \le \left\|z(t)\right\| \left\|D\right\| \left\|z(t-\tau)\right\| \le \lambda\_{\max\left(D\right)}\left(\frac{\left\|z(t)\right\|^2}{2} + \frac{\left\|z(t-\tau)\right\|^2}{2}\right) \tag{40}$$

Additionally, considering the YKP Lemma as in the previous Control System, yields

$$z^T(\left(A^\* - \mathbb{R}\right)^T \mathbf{P}\_1 + \mathbf{P}\_1(A^\* - \mathbb{R}))z = -qq^T\tag{41}$$

*Sensors* **2019**, *19*, 4608

$$\mathbf{P\_1b} - \frac{1}{2}\mathbf{c} = \sqrt{\sigma}q\tag{42}$$

Substituting this result into Equation (39), considering the inequalities of Equations (6) and (40), one derives that

$$\begin{array}{lcl}\dot{V}(\mathbf{z}) \le & -\mathbf{z}^T(t) \Big(\mathbf{q} + k\sqrt{\sigma}d\Big) \Big(\mathbf{q} + k\sqrt{\sigma}d\Big)^T \mathbf{z}(t) \\ & + \Big(2\eta\lambda\_{\max(P\_1)} + \frac{1}{2}\lambda\_{\max(D)} + 2\lambda\_{\max(P\_1R)} + \lambda\_{\max(P\_2)}\Big) \Big|\mathbf{z}(t)\Big|^2 \\ & - \Big(\lambda\_{\min(P\_2)} - \frac{1}{2}\lambda\_{\max(D)}\Big) \Big|\mathbf{z}(t-\tau)\Big|^2 \end{array} \tag{43}$$

Employing Equations (30) and (31), yields

$$\dot{V}(\mathbf{z}) \le -\left(\varrho - \left(2\eta\lambda\_{\max(P\_1)} + \frac{1}{2}\lambda\_{\max(D)} + 2\lambda\_{\max(P\_1R)} + \lambda\_{\max(P\_2)}\right)\right) \left\|\mathbf{z}(t)\right\|^2\tag{44}$$

Denoted by

$$\begin{aligned} \alpha\_3 &= \varrho - (2\eta\lambda\_{\max(P\_1)} + \frac{1}{2}\lambda\_{\max(D)} + 2\lambda\_{\max(P\_1R)} + \lambda\_{\max(P\_2)})\\ \text{(5) results} \end{aligned}$$

and from Equation (35) results

$$D^{\beta}V(z) \le -\alpha\_3 z(t)^2. \tag{45}$$


#### 2.2.3. Control System with Observer for the EXHAND with Delay

Consider the linearized model of Equation (8) rewritten as

$$D^{\beta}z(t) = A\_L z(t) + A\_1 z(t - \tau) + b u(t), \ t \in [0, T] \tag{46}$$

$$y(t) = \mathbf{c}^T \mathbf{z}(t) \tag{47}$$

where the nonlinear term was approximated by

$$\mathbf{f}(\mathbf{z}) \cong \left. \frac{\partial \mathbf{f}(\mathbf{z})}{\partial \mathbf{z}} \right|\_{\mathbf{z}\_0} \Delta \mathbf{z} = \left. \frac{\partial \mathbf{f}(\mathbf{z})}{\partial \mathbf{z}} \right|\_{\mathbf{z}\_0} (\mathbf{z}\_1 - \mathbf{z}\_0) = \left. \frac{\partial \mathbf{f}(\mathbf{z})}{\partial \mathbf{z}} \right|\_{\mathbf{z}\_0 = \mathbf{0}} \mathbf{z}\_1 \tag{48}$$

**Remark 2.** *For the EXHAND model, the pair (AL*, *b*) *is controllable and the pair* (*C*, *AL*) *is observable.*

Consider the system defined by Equation (46). The following observer is proposed:

$$D^{\beta}\dot{\mathbf{z}}(t) = A\_L\dot{\mathbf{z}}(t) + A\_1\dot{\mathbf{z}}(t-\tau) + b\mathbf{u}(t) + L\_1(y\_1(t) - \dot{y}\_1(t)) + L\_2(y\_2(t-\tau) - \dot{y}\_2(t-\tau)) \tag{49}$$

$$
\hat{\mathbf{z}}(t) = \hat{\mathbf{q}}(t), t \in [-\tau, 0] \tag{50}
$$

$$\boldsymbol{\hat{y}}(t) = \begin{bmatrix} \boldsymbol{\hat{y}}\_1(t) \ \boldsymbol{\hat{y}}\_2(t) \end{bmatrix}^T, \quad \boldsymbol{\hat{y}}\_1(t) = \mathbf{c}\_1^T \boldsymbol{\hat{z}}(t), \ \boldsymbol{\hat{y}}\_2(t) = \mathbf{c}\_2^T \boldsymbol{\hat{z}}(t-\tau) \tag{51}$$

where *<sup>z</sup>*<sup>ˆ</sup> <sup>∈</sup> *Rn* is the observer state, *<sup>y</sup>*<sup>ˆ</sup> <sup>∈</sup> *<sup>R</sup>*<sup>2</sup> is the estimated output and *<sup>L</sup>*1, *<sup>L</sup>*<sup>2</sup> are (*<sup>n</sup>* <sup>×</sup> <sup>1</sup>) observability vectors. The observer error is

$$
\Delta z(t) = z(t) - \hat{z}(t) \tag{52}
$$

defined by the following equation:

$$D^{\mathcal{G}}(\Delta \mathbf{z}(t)) = \left(A\_L - L\_1 \mathbf{c}\_1^T\right) \Delta \mathbf{z}(t) + \left(A\_1 - L\_2 \mathbf{c}\_2^T\right) \Delta \mathbf{z}(t-\tau) \tag{53}$$

$$
\Delta \mathbf{z}(t) = \Delta \boldsymbol{\varrho}, \quad t \in [-\tau, 0] \tag{54}
$$

Consider the control law

$$u(t) = u\_1(t) + u\_2(t) = -k\_1 \mathbf{c\_1^T} \hat{\mathbf{z}}(t) - k\_2 \mathbf{c\_2^T} \hat{\mathbf{z}}(t - \tau) \tag{55}$$

The global state (*z*ˆ, *z* − *z*ˆ) = (*z*ˆ, Δ*z*) is considered for the system "EXHAND-observer".

**Control System 3.** The whole system, "EXHAND-observer", Equations (46), (47), and (49)–(51) (Figure 5) with the control law (55), is asymptotically stable if

**Figure 5.** Control system with observer.

*A*∗ **<sup>L</sup>** <sup>=</sup> (*A***<sup>L</sup>** <sup>−</sup> *<sup>R</sup>*) is Hurwitz stable where *<sup>R</sup>* <sup>=</sup> *<sup>R</sup><sup>T</sup>* <sup>&</sup>gt; 0.

$$k\_1 \sigma \le 1, \quad \sigma > 0 \tag{56}$$

$$\left(\mathbf{c}\_1^T \left(\mathbf{j}\,\omega\mathbf{I} - A\_L^\*\right)^{-1} \mathbf{b}\right) \ge -\sqrt{\sigma}\,\mathbf{q}\tag{57}$$

$$
\lambda - \varrho + \lambda\_{\max(D\_1)} + \lambda\_{\max(D\_2)} + 2\lambda\_{\max(P\_1R)} + \lambda\_{\max(P\_3)} < 0 \tag{58}
$$

$$
\lambda\_{\max(E\_1)} - \frac{1}{2}\lambda\_{\max(D\_1)} - \lambda\_{\max(P\_2)} > 0 \tag{59}
$$

$$
\lambda\_{\text{max}(P\_2)} - \frac{1}{2} \lambda\_{\text{max}(E\_2)} > 0 \tag{60}
$$

$$
\lambda\_{\text{max}(P\_{\mathbb{S}})} - \lambda\_{\text{max}(D\_2)} > 0 \tag{61}
$$

where is defined by Equation (50) and

$$D\_2 = \left(Lc\_1^T\right)^T P\_1 + P\_1 Lc\_1^T; \quad D\_1 = \left(A\_1^T P\_1 + P\_1 A\_1\right) \tag{62}$$

$$E\_1 = A\_0^\* - L\_1 \mathbf{c}\_1^T; \quad E\_2 = A\_1 - L\_2 \mathbf{c}\_2^T \tag{63}$$

**Proof.** Consider the Lyapunov function

$$V(\mathbf{\hat{z}}, \Delta \mathbf{z}) = I^{1-\beta} \Big( \mathbf{\hat{z}}^T(\mathbf{t}) \mathbf{P\_1} \mathbf{\hat{z}}(\mathbf{t}) + \frac{1}{2} \Delta \mathbf{z}^T(\mathbf{t}) \Delta \mathbf{z}(\mathbf{t}) \Big) + \int\_{t-\tau}^{t} \Big( \Delta \mathbf{z}^T(\theta) \mathbf{P\_2} \Delta \mathbf{z}(\theta) + \mathbf{z}^T(\theta) \mathbf{P\_3} \mathbf{z}(\theta) \Big) d\theta \tag{64}$$

where *P*1, *P*2, *P*<sup>3</sup> are (*n* × *n*) are positive definite and symmetrical matrices. *V*(*z*, Δ*z*) satisfies the first condition (Equation (11)) of Theorem 2.

Applying the same procedures as in the previous control system, yields

$$\begin{array}{lcl}\dot{V}(\hat{\mathbf{z}},\Delta\mathbf{z}) \leq & -\dot{\mathbf{z}}^{T}(t)\Big(\Big(\boldsymbol{q}+k\_{1}\sqrt{\sigma}\mathbf{c}\_{1}\Big)\Big(\boldsymbol{q}+k\_{1}\sqrt{\sigma}\mathbf{c}\_{1}\Big)^{T}\dot{\mathbf{z}}(t) \\ & + \Big(\lambda\_{\max(D\_{1})}+\lambda\_{\max(D\_{2})}+2\lambda\_{\max(P\_{1}R)}+\lambda\_{\max(P\_{3})}\Big)\Big)\Big|\dot{\mathbf{z}}(t)\Big|^{2}-\Big(\lambda\_{\max(E\_{1})}-\frac{1}{2}\lambda\_{\max(D\_{1})}\Big) \\ & -\lambda\_{\max(P\_{2})}\Big)\Big|\Big|\Delta\mathbf{z}(t)\Big|^{2}-\Big(\lambda\_{\max(P\_{2})}-\frac{1}{2}\lambda\_{\max(E\_{2})}\Big)\Big|\Delta\mathbf{z}(t-\tau)\Big|^{2} \\ & -\Big(\lambda\_{\max(P\_{3})}-\lambda\_{\max(D\_{2})}\Big)\Big|\dot{\mathbf{z}}(t-\tau)\Big|^{2} \end{array} \tag{65}$$

By employing conditions (58)–(61) this inequality becomes

$$\begin{cases} \dot{V}(\hat{\mathbf{z}}, \Delta \mathbf{z}) \le -\left(\varrho - \lambda\_{\max(D\_1)} - \lambda\_{\max(D\_2)} - 2\lambda\_{\max(P\_1 R)} - \lambda\_{\max(P\_3)}\right) \left\|\hat{\mathbf{z}}(t)\right\|^2 - \left(\lambda\_{\max(E\_1)} - \frac{1}{2}\lambda\_{\max(D\_1)} - \lambda\_{\max(D\_2)}\right) \left\|\hat{\mathbf{z}}(t)\right\|^2 \\\ \qquad - \lambda\_{\max(P\_2)}\left(\left\|\Delta \mathbf{z}(t)\right\|\right)^2 \end{cases} \tag{66}$$

and using (53) yields

$$D^{\beta}V(\mathbf{\hat{z}},\,\Delta\mathbf{z}) \le -\alpha\_3 \left\| \begin{array}{c} \mathbf{\hat{z}}(t) \\ \Delta\mathbf{z}(t) \end{array} \right\|^2. \tag{67}$$


**Remark 3.** *The asymptotical stability conditions of Control System 2, Control System 3 are independent by the time delay* τ.

#### **3. Results**

#### *3.1. IHRG Control—Numerical Simulations*

#### 3.1.1. EXHAND with Sensors Without Delays

Consider the IHRG system of Figure 3. The exoskeleton drive system is a decoupled one, for each finger. The following parameters of the hand and exoskeleton mechanical architecture [16] will be used: the equivalent moment of inertia is *<sup>J</sup>* = 0.005 kg·m2, the equivalent mass is *<sup>m</sup>* = 0.015 kg, the viscous and elastic coefficients of the equivalent Kelvin-Voigt model of the joint tissues throughout phalange musculoskeletal system and exoskeleton are [6,7] *<sup>c</sup>*<sup>ν</sup> = 0.22 Nm·s·rad−<sup>1</sup> *ce* = 2.8 Nm·rad<sup>−</sup>1, respectively, and the damping coefficient is *cd* = 7.8 Nm·s· rad<sup>−</sup>1.

$$f\ddot{\theta}(t) = -c\_\nu \, D^\beta \theta(t) - c\_t \, \theta(t) - c\_d \, \dot{\theta}(t) + mg\sin\theta + bu(t) \tag{68}$$

$$
\theta(0) = \left[\frac{\pi}{3}, 0\right] \tag{69}
$$

where the nonlinear component verifies the inequality (Equation (6)) for η = 0.2. The sensor is considered as an IOM operator and the output is defined as

$$y(t) = \mathbf{c}^T \theta(t) \tag{70}$$

The fractional order exponent is β = <sup>1</sup> <sup>2</sup> . The FOM model (Equations (8) and (9)) is defined as

$$\begin{aligned} \;^1\theta\_1 = \theta; \;^1D^{\frac{1}{2}}\theta\_1 = \theta\_2; \;^1D^{\frac{1}{2}}\theta\_2 = \theta\_3 = \dot{\theta}; \;^1D^{\frac{1}{2}}\theta\_3 = \theta\_4; \;^1A = \begin{bmatrix} 0 \ 1 \ 0 \ 0 \\\ 0 \ 0 \ 1 \ 0 \\\ 0 \ 0 \ 0 \ 1 \\\ -0.6 \ -0.1 \ -7.8 \ 0 \end{bmatrix}; \begin{aligned} \;^1\theta = \begin{bmatrix} 0 \\\ 0 \\\ 0 \\\ 4.5 \end{bmatrix}; \mathcal{c} = \begin{bmatrix} 1 \\\ 0 \\\ 1 \\\ 0 \end{bmatrix}. \end{aligned} \end{aligned}$$

The pairs are (*A*, *b*), (*A*, *c*) controllable, respectively observable. The IOM sensor output without delays is given by (7). A control law (Equation (13)) (for *yre f*(*t*)) = 0) with *k* = 200 is applied. This control verifies the sector constraint (Equation (11)) with <sup>σ</sup> = 5 <sup>×</sup> 10−3. The matrix R was considered as, *R* = diag(3, 3, 3, 3) where *A***<sup>1</sup>** = *A* − *R* is Hurwitz stable. The vector *q* = 0.05 × [1111] *T* and a matrix *P* were inferred with λmax(*P*) = 0.725. The polar plot of *cT*(*j*ω*I* − *A*1) −1 *b* is shown in Figure 6. The closed-loop system satisfies the frequential criterion (Equation (17)), condition (18) is verified for *Q* <sup>=</sup> 14.5, <sup>ρ</sup> <sup>=</sup> 2.17, MATLAB/SIMULINK and techniques based on the Mittag-Leffler functions are used for the simulation [1,2]. Figure 7 shows the trajectories of fractional order variables.

**Figure 7.** Fractional-order variable trajectories for FOM system with IOM sensor.

The steady state behavior can be analyzed by using by transfer function of Equations (65)–(67) (linearized model)

$$H\_{EX}(\mathbf{s}) = \frac{\theta(\mathbf{s})}{\mathfrak{u}(\mathbf{s})} = \frac{900}{\mathfrak{s}^2 + 44\mathfrak{s}^{\frac{1}{2}} + 1560\mathbf{s} + 530} \tag{71}$$

The control law (Equation (13)) can be rewritten as a PD control

$$
\mu p\_D = -k\_1 z\_1 - k\_3 D^{\frac{1}{2}} \Big( D^{\frac{1}{2}} z\_1 \Big) = -k\_1 z\_1 - k\_3 \dot{z}\_1 \tag{72}
$$

and the controller transfer function will be

$$H\_{C(PD)}(\mathbf{s}) = k\_1 + k\_3 \,\mathbf{s} = \,\, 200(1+\mathbf{s})\tag{73}$$

For the PD0.5 control law, we have

$$H\_{\rm C(PD^{0.5})} = k\_1 + k\_2 \text{s}^{0.5} = 200 \left( 1 + \text{s}^{0.5} \right) \tag{74}$$

The steady error can be inferred from Equations (71)–(73) as [21,32,33]

$$e\_{s(PD)} = e\_{s(PD^{0.5})} = 0.0042$$

The behavior of the linearized model (Equation (68)) for both control laws (Equations (73) and (74)) is studied. The trajectories of angular position θ for target signal θ*targ* = <sup>π</sup> <sup>6</sup> are shown in Figure 8.

**Figure 8.** Trajectory θ(*t*) for PD0.5 and PD

#### 3.1.2. EXHAND with Delay

The sensor dynamics are

$$K\_{s1}z\_1(t) = y\_1(t) \tag{75}$$

$$K\_{\ast 2} \, z\_2(t - \tau) = y\_2(t) \tag{76}$$

where τ ∈ [−0.1; 0]. Substituting Equations (74) and (75) into (68) and using the control law (Equation (70)), yields

$$\int \ddot{\theta}(\mathbf{t}) = -\mathfrak{c}\_{\mathsf{t}} D^{\mathsf{f}} \theta(\mathbf{t}) - \left(\mathfrak{c}\_{\mathsf{t}} + K\_{\mathsf{s}1} k\_{1}\right) \theta(\mathbf{t}) - \mathfrak{c}\_{\mathsf{d}} \dot{\theta}(\mathbf{t}) - K\_{\mathsf{s}2} k\_{2} \dot{\theta}(\mathbf{t} - \boldsymbol{\tau}) + k\_{2} \mathrm{b} \mathbf{c}\_{2}^{T} D^{\mathsf{f}} \theta(\mathbf{t} - \boldsymbol{\tau}) + m \mathrm{g} \sin \theta \quad \text{(77)}$$

with initial conditions

$$
\theta(t) = \frac{\pi}{3}, \text{ } \dot{\theta}(t) = -1, \ t \in [-0.1; 0] \tag{78}
$$

The delay component of the dynamic model is defined by τ = 0.1 *s*. The controller parameters are selected as *k*<sup>1</sup> = 200, *k*<sup>2</sup> = 15 that satisfy Equations (29)–(31) by employing the same parameters for *q*, *R* as in the previous example. The evolution of the fractional order variables is shown in Figure 9.

**Figure 9.** Fractional-order variable trajectories for FOM EXHAND with delay.

#### 3.1.3. EXHAND with Delay and Observer

An observer (Equations (49)–(51)) with *L*<sup>1</sup> = *L*<sup>2</sup> = [1.5 1.5 1.5 0] *<sup>T</sup>* is associated to the linearized dynamic model. The matrix *R* = *diag*(1111) verifies the condition as (*A*<sup>L</sup> − *R*) to be Hurwitz matrix. For *<sup>q</sup>* = [0.5 0.5 0.5 0.5] *<sup>T</sup>*, solution of *<sup>P</sup>*<sup>1</sup> is obtained with <sup>λ</sup>max(*P*1) = 0.0085. A control law (55) with *k*<sup>1</sup> = 20, *k*<sup>2</sup> = 8.5, σ = 0.05 were selected. Equations (57)–(61) are easily verified. Figure 10 shows the trajectories of physical significance variables, position and velocity, for the system and observer.

**Figure 10.** Trajectories of position and velocity for the system and observer.

#### *3.2. IHRG Experimental Platform*

The IHRG is an exoskeleton that supports the human hand and hand activities by using a control architecture for dexterous grasping and manipulation. IHRG is a medical device that acts in parallel to a hand in order to compensate for lost function. It is easy to use and can be a helpful tool in the home [16,34].

The mechanical architecture consists of articulated serial elements of which design covers functional and anatomic finger phalanges. The glove is created by a thin textile that represents an infrastructure suitable for actuation wires and sensors. A distributed actuation system is used for implementing the operations of the hand. An Arduino Flex Sensor network (with zeroth order sensors) is used to control the motions. An Arduino Mega 2560 hardware platform determines the movement of the glove's actuators for exercises like opening or closing of the fingers (Figures 11 and 12).

**Figure 11.** IHRG-human hand exoskeleton.

**Figure 12.** IHRG general architecture.

All the movements of the hand are controlled by the software of the hybrid IHRG system, which was developed in MATLAB and Simulink. The performance of each patient following the exercises program can be recorded by the same software. The control system of Control System 1 is implemented. In Figure 13 are shown the sensor signals during an open-close-open hand exercise.

**Figure 13.** (**a**) Bending sensor characteristics; (**b**) Output values of the bending sensors.

#### **4. Discussion**

I. We designed, built, and tested an intelligent haptic robotic glove for the rehabilitation of the patients that have a diagnosis of a cerebrovascular accident. The glove is created by a thin textile in order to have a comfortable environment for the grasping exercises. This thin textile creates an infrastructure suitable for wire actuation and sensors. This exoskeleton architecture ensures the mechanical compliance of human fingers. The driving and skin sensor system is designed to determine comfortable and stable grasping function. The dynamics of the exoskeleton hand are modeled by fractional order operators. To our knowledge, this paper is the first paper in which the interaction between biological systems (human hand) and mechanical associated components (exoskeleton) is analyzed by fractional order models. These new models are used to develop a class of algorithms for

the control of the stable grasping function. The control systems are based on the physical significance variable control that are generated by sensor classes implemented in the system. These sensors are also modeled as operators with delays. The paper proposes control solutions and determines the criterions for controller parameter tuning for several classes of models. The observer techniques are also discussed and implemented. The quality and the stability of motion, are analyzed by Lyapunov methods and techniques that derive from Yakubovici-Kalman-Popov Lemma.

Despite of the model complexity, the control systems are very simple, and the controllers are easily implemented in an Arduino Mega 2560 hardware platform. There were many advantages for using this platform since this hardware board has ports for PWM signals that are useful to be sent to the actuators and ports for reading the signals coming from the bending sensors.

In order to help patients to follow an after-stoke recovery program, the system uses a set of predefined rehabilitation exercises like open the hand, close it, try to grab an object or simple wave. The system is very easy to use at home, with minimal training. The predefined rehabilitation set of exercises was created to be used.

II. The control systems discussed in the previous sections are focused on the control problems of the IHRG system, where the EXHAND model is described by FOM operators and the sensor system is based on zeroth order sensors. These control solutions can be also used for a larger class of complex systems as hyper-redundant systems, that use complex FOM sensors (Figure 14).

**Figure 14.** Control system with FOM sensors.

III. In addition, we consider that control systems discussed in the previous sections can be applied to a class of control problems associated to the persons with disabilities. Figure 15 presents a wheelchair control system for this class of persons. In this case, the human operator is represented by the persons with hemiparesis/hemiplegia, with motor restriction (arm or leg-emphasized hemiparesis) and with serious brain damage [35],

**Figure 15.** Control with disability human operator.

$$H\_{\hbar}(\mathbf{s}) = k\_{\hbar} \frac{e^{-\imath s}}{s^{\beta}}$$

The transfer function of this human operator has a model that corresponds to a time delay fractional order model with time constant τ and fractional order β. These parameters are determined by the characteristics of the damaged brain, viscoelastic properties of the atrophied muscles, and propagation time along the nervous terminals.

We consider that these models can be studied by using the techniques developed in this paper.

**Author Contributions:** Conceptualization, M.I. and N.P.; methodology, N.P. and A.C.; software, D.P.; validation, M.I. and A.C.; formal analysis, M.P.; investigation, M.I.; resources, N.P.; data curation, D.P.; writing—A.C.; writing—review and editing, D.P.; visualization, M.P.; supervision, M.I.; project administration, N.P.; funding acquisition, N.P.

**Funding:** The authors gratefully acknowledge funding from European Union's Horizon 2020 Research and Innovation program under the Marie Sklodowska Curie grant agreement No. 813278 (A-WEAR: A network for dynamic wearable applications with privacy constraints, http://www.a-wear.eu/). This work does not represent the opinion of the European Union, and the European Union is not responsible for any use that might be made of its content

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Detection of Anomalous Behavior in Modern Smartphones Using Software Sensor-Based Data**

#### **Victor Vlădăreanu 1, Valentin-Gabriel Voiculescu 2, Vlad-Alexandru Grosu 2, Luige Vlădăreanu 1,\*, Ana-Maria Travediu 1, Hao Yan 3, Hongbo Wang <sup>3</sup> and Laura Ruse <sup>4</sup>**


Received: 31 March 2020; Accepted: 9 May 2020; Published: 13 May 2020

**Abstract:** This paper describes the steps involved in obtaining a set of relevant data sources and the accompanying method using software-based sensors to detect anomalous behavior in modern smartphones based on machine-learning classifiers. Three classes of models are investigated for classification: logistic regressions, shallow neural nets, and support vector machines. The paper details the design, implementation, and comparative evaluation of all three classes. If necessary, the approach could be extended to other computing devices, if appropriate changes were made to the software infrastructure, based upon mandatory capabilities of the underlying hardware.

**Keywords:** software sensor data; machine-learning classifier; smartphone security

#### **1. Introduction**

Among our gadgets, smartphones are our closest companions. They provide the primary access into the Internet and modern amenities, they hold our private data and are becoming one of the primary means of attack against the user, be it through power viruses (or other means to consume resources) or more ordinary malware menaces (calling or texting tolled numbers, install unwanted software, send the attacker private information about the device or its owner, spy on the owner using the camera or microphone, etc.).

For our research we picked an Android smartphone over an iPhone, primarily due to the open access into its software stack. This open source stack, provided by Google for their devices via Android Open Source Project (AOSP), includes a high-level Android framework, and an open source kernel. This level of openness into the source code of the stack allows access into both public and non-public API information, allowing for the purpose of our research access into the smartphone sensors. It also consists of the basis of the infrastructure of data collection architecture (used to obtain the raw data behind the initial dataset for this paper) described in [1] with an application for malware detection [2], using measurable events collected for a set of Android applications including both samples obtained from scientists in the field (Malgenome application set described by the authors of [3]) as well as more recent Internet sources (including [4,5] and Google Play for benign Android samples).

The purpose of this study was to assess if anomalous behavior could be detected through machine-learning classifiers based on input data sources from a variety of sensors within the device. We took into consideration that the smartphone phone itself can provide a large pool of data sources

about runtime behavior. Some of this data is accessible through public APIs and an additional set can become accessible by making suitable changes into the smartphone software stack. While typical developers do not have access to non-public APIs, such an approach could provide benefits for telecom or smartphone manufacturers which have access to the AOSP stack or its equivalent from the smartphone system-on-chip (SoC) manufacturers or similar providers. By adding such an application within their phone, smartphone manufacturers could increase the intrinsic value of their product.

Many recent papers are trying to tackle the problem of detecting anomalous behavior in modern smartphones using software sensors (measurable events). For modern smartphones approaches vary but usually involve a combination of static and dynamic behavioral extraction of interesting data pertaining to sensors of the target smartphone application.

The measurable sources of data used in static analysis fall into 4 categories according to a recent study of over 80 frameworks by Bakour et al. [6]: manifest-based, code-based, semantic-based, or application metadata-based. According to the same authors, using this data requires offline processing and is prone to weakness when obfuscation techniques are employed by the developers of malicious applications.

Recent frameworks use a combination of static [7–9] and dynamic [8,9] extraction of data from sensors before being processed through machine-learning techniques. SandDroid is a recent sandbox analysis project classifying Android applications on a set of features using both static and dynamic analysis [8].

Current anomaly-detection applications span a wide array of technical solutions, from traditional machine learning and statistics, to deep learning models for computer vision related tasks, with applications in virtually all fields of technology [10–12].

Gaussian Mixture Models (GMMs) are one of the traditional approaches to anomaly detection. Zong et al. [13] use a deep auto-encoder to generate a low-dimensional representation, which is further fed into a GMM for unsupervised anomaly detection in medical applications. Aljawarneh et al. [14] use a Gaussian dissimilarity measure for anomaly detection in Internet of Things (IoT) systems, while Li et al. [15] use GMM-based clustering for an application of early warning in civil aviation.

In viewing anomaly detection more as a classification problem, there are many options for the choice of classifier. Erfani et al. [16] use Support Vector Machines and Deep Belief Networks to mitigate the impact of high dimensionality in large-scale anomaly detection. Li et al. [17] use a modified Support Vector Machine (SVM) as an alternative to computer and information system security. The excellent surveys by Chalapathy [10] and Ye [11] give a detailed outlook of possible applications using deep learning and data mining solutions.

The literature involving detection of anomalies in modern smartphones usually focuses on centralized large-scale frameworks or collaborative efforts. Centralized frameworks are used for either static code analysis [7,8] or dynamic analysis using virtual machines on host systems [8], highly efficient in running Android apps in emulation. Collaborative efforts, such as Contagio [18] rely on a mix of online tools and offline community support. We considered the approach of running applications within an actual smartphone, as there are specific limitations (e.g., thermal throttling) or information (e.g., device identifiers) that do not appear in virtual machine simulation but make an impact on application behavior on real hardware (through metrics such as CPU load, network data frequency, and throughput of which we planned on measuring through our experiments). We used as starting points data acquired on actual smartphones, not via simulated/virtualized frameworks. In this way our environment also took into consideration the influence of additional constraint factors within the smartphone (such as thermal throttling, dynamic frequency changes and overall other types of computation throttling) which can modify an application resource requirements and through this means influence the collected raw data that are used as input for the machine-learning algorithms. Virtualized infrastructures do not simulate specific smartphone behavior such as thermal throttling, dynamic processor frequency changes etc.

Furthermore, the use of such methods, as described in the paper, is relatively limited in the context of smartphone technology. Most smartphone applications tend to veer towards classical anomaly detection and statistics, such as GMMs [13–15] or Hidden Markov Models (HMMs) [19], data mining and deterministic decision systems [20,21], or hybrid algorithms that incorporate learning classifiers (usually a version of SVMs) and belief networks [16,17,22,23]. An extensive literature review has found no comparative studies using the proposed methods on a challenging dataset, which is itself unique, as obtained through the methods described above, and very few instances of any one of the proposed methods being employed in anomaly detection on smartphones, despite their otherwise ubiquitous presence in machine-learning and anomaly-detection applications.

While all detections are aimed at malware applications, the detection itself is done through looking for patterns of anomalous behavior, as captured by the measurable sources of information (i.e., software sensors). The use of the proposed class of methods has to do with the way in which such potentially anomalous behavior could be learned, therefore it is intentionally modelled as a classification problem, rather than a standard anomaly-detection problem (for example through GMMs), due to the format of the data and the underlying assumptions of the learning model.

The choice of algorithm has very much to do with the available dataset. For the purposes of this paper, deep learning and some data mining solutions are not well suited to a small- to medium-sized dataset. In addition, GMMs are most frequently used when there is a large non-anomalous subset to train on, or when the dataset itself is imbalanced, and when it is thought that a classifier is unlikely to learn a coherent model for the positive (anomalous) samples, which is not the case here. As such, the paper investigates three types of classifiers for the machine-learning application: Logistic Regression, a shallow neural network for pattern recognition, and SVMs. The three are evaluated on several metrics, most important of which being the F1 score on the test set. The full details of the design, implementation, and evaluation of the learning algorithms are presented in their respective sections. Finally, the results show that all the three investigated algorithms perform reasonably well, with SVMs having a slight edge.

The remainder of the paper is divided as follows: Chapter 2 outlines the acquired data and the steps taken to curate the dataset, as well as giving a brief theoretical overview of the design of the classifier learning methods involved. Chapter 3 discusses the particulars of implementation and the obtained results for each model. Chapter 4 discusses the highlighted results and attempts to draw a conclusion based on the work reported. Opportunities and directions for future research are also discussed.

#### **2. Materials and Methods**

The general architecture, described in [1] includes measurable events as sources of data, extracted with a set of data collectors [1] into a server for further processing, as shown in Figure 1. The sources of data are made available through the Android framework or Linux user-space (via procfs virtual file system entries) and are provided originally by hardware sensors or software hooks added into the framework.

Synthesizing the information available in [1], Table 1 below shows the measurable events that were collected on a Nexus 4 smartphone running Android 4.2.2 JellyBean.

As said in [1] for the "SMS" event there are hooks for destination number and message content. For the "Call" event there is a hook for the destination number. For "WiFi", "Camera", "NFC", as well as "Bluetooth" we get information about sensor activity (turning on/off). "Sensors" offer information about what peripherals are registered as sensors in the Android framework terminology as well as real-time data from them. "Camera" indicates if the application wishes to access the camera peripheral or photo content within the phone.

**Figure 1.** General data collection architecture.



Through the data acquisition procedure, the smartphone remained mostly stationary, in an artificially lit room with weak GPS reception. As such, the device was not able to offer pertinent GPS or other motion (accelerometer, gyroscope, magnetometer) or lighting (ambient light) information for our experiments. We chose not to include in our smartphone a functionality (normally missing from commercial devices) that artificially injects GPS data, considering that by itself any additional GPS data injection component would represent an example of atypical application (hard to install by regular users) and behavior (and also being potentially hazardous if requiring administrative privileges into the phone in order to operate). For our research we focused on the scenarios that for malware applications the abnormal behavior would manifest primarily through disguising themselves as top (or banking) applications (for the purpose of extracting confidential login data) or by sending phone information to toll numbers or various URLs.

Another reason not to include GPS information was that for the majority of applications in our pool this information was not required, the main part of the experiments not including applications focused on user movement information (whether it is from GPS or other types of movement based sensors).

"App Install" detects the installation of a new application within the smartphone and it offers the app name for it. "Activity" intercepts information regarding the state of the Android application: "Create", "Start", "Stop", "Pause", "Resume" and "Destroy". The "Runtime Crash" detects the similarly named event at runtime. The "ANR" intercepts "Application Not Responding" type events generated by the Android framework following an application entering a freezing/non-functioning

state. These hooks were implemented in Java and they offer the ability to acquire data for the data collection server upon request, for the purpose of detecting anomalous behavior.

Due to the user experience oriented programming in modern smartphone apps, each feature is accessible within a short number of Graphical User Interface (GUI) interactions (for example in email apps you have login, message lists, folders/categories, individual email widows and a limited set of clicking and scrolling to do in each such window), in our case exposing behavior coded within Android applications to our software sensors. Empirically, as a trade-off between time spent collecting data for each target application and the implication on overall collecting time spent on the entire application pool we studied and collected data for, we found that approximately 5 min was considered enough to stop acquiring new types of events, and thus new type of behavior, from the application.

The final application set comprises 361 Android applications; however not all of them were installed at once, but consecutively during the experiments. For each sample application, during the dataset acquisition the procedure involved side-loading (installing Android applications in apk format via the adb interface), starting the sample application and monitoring its behavior while interacting with it manually (in foreground/background) for a short period of time (approximately 5 min), then uninstalling it.

The initial data was collected through the framework described in [1,2] which relies on a data collection framework within the smartphone and the installation of an additional application that performs the real-time monitoring while running in background (called AMDS in [2]). The application pool was built by using both reputable applications (the ones that the phone came with, top Google Play applications, or similar stores for other markets such as SlideMe [4]) as well as known malware applications. The data acquisition process involved installing new applications one by one, manually interacting with each in application-specific manner (e.g., clicking buttons, swiping) while also toggling between foreground/background for the duration of the experiment (around 5 min for each app). After this step the target application was uninstalled, to prevent both reaching lack of storage space during the experiment as well as preventing some applications from influencing one another and the next target application from the list was installed and experimented with. During the training phase a data of monitoring each target application was stored locally in the phone, and during testing phase a different pool of applications was used, each application installed, monitored, local data stored updated, while the real-time monitoring application described at large in [2], was running in background, and if abnormal behavior was detected this application would be prevented from running, it would be flagged to the user who would make the final decision (to continue preventing the application from running or allow it to continue) on a case by case basis.

The raw data obtained from the experimentation is organized as a set of unique application names, a set of labels for each application (clean or malware) and 19 sets of feature data, one for each investigated feature. Inside the feature data, there are multiple values for the same {application name; feature} pair, due to multiple sensor interrogations for the same feature on the particular application. Table 2 shows an example of the raw data obtained from the sensors.


**Table 2.** Example of raw data obtained from software sensors.

Table 3 shows the investigated features for which sensor triggers were programmed and a short description, as detailed in [1,2].


**Table 3.** Description of sensor data features.

We considered the possibility that while there is a usual correlation between the number of packets and bytes transferred via WiFi, in the case of applications disguising themselves as another application, there may be a difference in communication protocols between the fake and original application that could influence the correlation between number of packets and bytes transferred. Generally speaking, even if the two features were to be correlated, the presence of two correlated features among a total of 19 would not adversely affect the learning process to any significant extent. In this particular case, however, the possible lack of correlation could, potentially, in itself be a valuable learnable pattern for detecting behavior. Therefore, it was believed that including both features would result in new information being available to the model.

The measurable events pool provided by the infrastructure described in detail in [1] and used through the monitoring application described in [2] includes events collectable from Android (e.g., SMS, Bluetooth, Call etc.), events collectable from Linux user-space (e.g., via procfs, including CPU load, memory load etc.) and there were non-numerical metrics (e.g., app name, version) which were eliminated when we curated the data set. Features for which data was not available in all instances were then eliminated to preserve the integrity of the dataset, as a fully numeric, complete set of samples is required for the learning process.

As stated in [1,2] our collection infrastructure includes data collectors, a collection server, and an Android monitoring application. Data is acquired for both Android events (SMS, WiFi, and Bluetooth) as well as Linux user-space (CPU load, memory load, and network statistics) and kernel-space metrics

(performance counters). The collection server maintains the list of running Android applications (and correlation with corresponding process identification numbers) and based on application state manages monitoring via the data collectors. The Android application usually resides in background and allows high-level user interaction and control (including picking the infrastructure operating mode: idle, training, testing on a system-wide and per application level). Furthermore, part of the initial pre-processing of the dataset is that clearly anomalous readings are eliminated as data-points. This is partly achieved in the initial composition of the dataset, and partly done automatically when processing the numerical data.

The ground-truth target labels were manually set. Applications downloaded from reputable sources (top Google Applications) were marked as clean, while applications from the known malware sources (e.g., applications from the Malgenome set) were considered malware.

The dataset is first processed by parsing the feature data and identifying all the values associated with a unique application name on a particular feature set. These values are then averaged to obtain a single value for the {application name; feature} pair. The average value is obtained by running a first pass on the values and calculating their mean and variance, after which a new mean is computed, taking into consideration only the values that fall within a 95% confidence interval from the natural mean. In the extreme case that no values fall within this range, the median value of the set is selected instead. This contributes to the rejection of extreme sensor values, which are thought to most likely be the result of erroneous momentary interrogation.

Once all unique {application name; feature} values are parsed for a particular feature, the application names are checked against the existing list and only the pairs for which the application name already exists are kept. This is done to ensure the integrity of the dataset, in which all samples (i.e., applications) must have existing values for all features. The end result is a dataset with 361 samples for 19 features, plus the target labels, which are the inputs to the learning algorithms.

The features are all numeric, so there is no need to apply any type of coding for categorical variables. The target class labels are either clean (coded to 0) or malware (coded to 1).

The final dataset has its features normalized. The value for each sample within a feature is updated by subtracting the mean (μ*j*) and dividing by the standard deviation (*sj*) of that feature (*j*), as shown below.

$$\mathfrak{x}\_{ij} := \frac{\mathfrak{x}\_{ij} - \mu\_j}{s\_j} \tag{1}$$

This ensures that all input features are of similar range and centered on zero, which helps the speed and convergence of the learning process, as well as removing potential biases caused by the varying magnitudes. A cross-section of the dataset is exemplified in Table 4. Notice that the unique application names have been removed and each sample is now uniquely identifiable only implicitly through its row number. The actual data used for training is structured with samples as rows and features as columns, but is shown here transposed, for readability. It also includes an intercept term, which is a column of 1-s, not shown here. The sample and feature labels are not present in the actual dataset and are shown in the table only for convenience.

The dataset is then split into three sets, for training, cross-validation, and testing. The training set is used to learn the optimized parameters of the prediction model, the cross-validation set is used to tune its meta-parameters, and the testing set is used exclusively to rate the model performance on data it had not previously seen. The parameters and meta-parameters differ for each learning model and are explained in their respective sections.


**Table 4.** Example of dataset samples.

The split is done randomly each time the overall script starts, so the data is different for each complete run which learns and evaluates the three models, meaning the overall results differ slightly at each run. However, the same training, cross-validation, and test sets are used for learning and evaluation on each of the three learning models, within a complete run of the script, so that their performance can be properly compared.

The samples are split into 70% for the training set, 15% for the validation set and 15% for testing. This is done by generating a random index with the same length as the number of samples and then taking the appropriate percentage splits, using the index as the row number inside the dataset. In the context of 361 total samples, this works out to 253 samples for training, 54 samples for validation, and 54 samples for testing.

A True Positive (TP) is a sample correctly identified as malware, a True Negative (TN) is a sample correctly classified as clean, a False Positive (FP) is a clean sample misclassified as malware, and a False Negative (FN) is a malware sample misclassified as clean. A chart showing the number and rates of samples thus classified is called a Confusion Matrix. The results of each algorithm include such a chart for each sample set and for the overall set.

Two standard loss functions are used for the classification algorithms, Cross-Entropy Loss and Hinge Loss. Cross-entropy is the default loss function used for binary classification problems. It is intended for use with binary classification where the target values are in the set {0, 1}, as is the default coding in our dataset. It is used for the Logistic Regression and Pattern net learning algorithms [24]. Using Cross-Entropy Loss, the cost function to be minimized is shown below. *h*<sup>θ</sup> *xi* is the hypothesis outputted by the model (see also Equation (5)) and *yi* is the ground-truth label for that sample.

$$g(\boldsymbol{\theta}) = \frac{1}{m} \sum\_{i=1}^{m} \left[ y^i \Big( -\log h\_{\boldsymbol{\theta}} \big( \mathbf{x}^i \big) \big) + \Big( 1 - y^i \Big) \big( -\log \big( 1 - h\_{\boldsymbol{\theta}} \big( \mathbf{x}^i \big) \big) \big) + \frac{\lambda}{2m} \sum\_{j=1}^{n} \boldsymbol{\theta}^2\_j \tag{2}$$

where *h*<sup>θ</sup> *xi* = <sup>1</sup> <sup>1</sup>+*e*−θ*Txi* . Hinge Loss is an alternative to cross-entropy for binary classification problems, primarily developed for use with SVM models. It is intended for use with binary classification where the target values are in the set {−1, 1}, which are set automatically by the learning algorithm. The Hinge Loss function encourages examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values [24]. The formula of the Hinge Loss function is shown below.

$$J(\theta) = \mathbb{C} \sum\_{i=1}^{m} \left[ y^i \text{cost}\_1 \left( \theta^T \mathbf{x}^i \right) + \left( 1 - y^i \right) \text{cost}\_0 \left( \theta^T \mathbf{x}^i \right) \right] + \frac{1}{2} \sum\_{j=1}^{n} \theta\_j^2 \tag{3}$$

where *cost*(*x*) = -0, 1 − *y* ∗ *k*(*x*), *y* ∗ *k*(*x*) ≥ 1 *else* , *<sup>k</sup>*(*x*) being the similarity (kernel) function [25–27].

The following metrics are used for evaluation, as shown in Table 5.


Accuracy is the rate of correctly classified instances, irrespective of class. Precision is the fraction of relevant instances among the retrieved instances, i.e., the number of correctly predicted positives out of the samples that were predicted positive. Recall is the fraction of total relevant instances retrieved, i.e., the number of correctly predicted positives out of the samples that actually were positive. The F1 score is the harmonic mean of precision and recall [28].

Table 6 below gives an overview of the parameters and criteria used for optimization at each level of the training process.


**Table 6.** Parameters, meta-parameters and criteria.

Logistic Regression is used to obtain a model for the classification and prediction of a binary outcome from a set of samples, in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial [29,30].

The optimization problem is described as *<sup>Y</sup>*<sup>ˆ</sup> = <sup>θ</sup>∗*X*, where *<sup>X</sup>* is the input data, namely the dependent variables, containing all data-points, plus an intercept term, which helps prevent overfitting. *Y* is the output data that the algorithm is trying to learn, and θ is the matrix of coefficients used to

estimate *Y* from *X*. The challenge is finding the best coefficients, which minimizes the error between the actual *<sup>Y</sup>* and the estimate. This is obtained from *<sup>Y</sup>*<sup>ˆ</sup> = <sup>θ</sup>*LR* <sup>∗</sup> *<sup>X</sup>*, or, in extended form:

$$
\begin{bmatrix}
\dot{y}\_1 \\
\dot{y}\_2 \\
\dot{y}\_3 \\
\vdots \\
\dot{y}\_n
\end{bmatrix} = \begin{bmatrix}
\theta\_{11} & \cdots & \theta\_{1(m+1)} \\
\vdots & \ddots & \vdots \\
\theta\_{n1} & \cdots & \theta\_{n(m+1)} \\
\end{bmatrix} \begin{bmatrix}
\mathbf{x}\_1 \\
\mathbf{x}\_2 \\
\vdots \\
\mathbf{x}\_m \\
\vdots \\
\mathbf{int}
\end{bmatrix} \tag{4}
$$

Once the best available set of coefficients (θ) is found, a prediction for a given sample is done using the function:

$$p(\mathbf{x}) = \begin{cases} \ 0, & h\_0(\mathbf{x}) < th \\ \ 1, & h\_0(\mathbf{x}) \ge th \end{cases} \tag{5}$$

where *h*θ(*x*) = <sup>1</sup> <sup>1</sup>+*e*−θ*Tx* and *th* is the prediction threshold.

The cost function, shown previously in Equation (2), is computed as a measure of the difference between estimated values and the ground truth for the input data. A visualization of the cost for a particular sample is shown in Figure 2, where *y* is the ground truth and *h*θ(*x*) is the hypothesis outputted by our model [25].

**Figure 2.** Logistic Regression cost function.

If there are too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples, which is called overfitting [26]. Conversely, if there are not enough representative features or if the number of the training examples is not large enough to correctly create the model, it would underfit. Parameters such as alpha (α), the learning rate, or lambda (λ), the regularization parameter, must also be taken into account. The learning rate, α, should not be too large, because it could lead the algorithm to diverge, yet a value too small would cause slow learning and the cost function can get stuck in local minima [31]. Faster gradient descent can be obtained via an adaptive learning rate. Lambda (λ) is the regularization parameter and it determines how strongly a model is penalized for fitting too closely the learned features on the available data [25].

Gradient descent is an optimization algorithm where the potential solution is improved each iteration by moving along the feature gradient in the variable space. While it requires that the target function be differentiable and it is somewhat susceptible to local minima, gradient descent provides a stable and computationally inexpensive algorithm for function optimization. Figure 3 shows a visualization of the Gradient Descent algorithm, for two thetas [25]. This would correspond to a one variable learning problem, for which θ<sup>1</sup> is the coefficient of the independent variable x, and θ<sup>0</sup> is the intercept term. *J*(θ0, θ1) is the associated cost for each tuple of the two theta parameters, shown as the landscape on which the gradient descent algorithm searches for optima.

**Figure 3.** Gradient Descent algorithm.

SVMs, also called Large Margin Classifiers, are supervised learning models used for classification and regression. The input data is classified by optimizing the hyperplane equation so that the distance to the data representing the classes is maximum [32]. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using kernels, implicitly mapping their inputs into high-dimensional feature spaces [25].

Having the training data, the problem is to determine the parameters of the hyperplane that best separates the data into classes. The cost function for SVMs was shown previously in Equation (3). The parameter C is important for the system because it is responsible for finding the minimum of the cost function, as well as providing regularization. If the value of C is too large then the SVM will change the decision boundary in an intent to integrate the outliers which produces overfitting, but if C is too small, there is a risk of high bias [25]. The similarity function, named kernel, is used by the SVM to draw the decision boundary. Kernels must be semi-positive and symmetrical. The paper investigates the use of three of the most common options: linear, polynomial, and Gaussian (Radial Basis Function), in order to find normally distributed features across the feature map, while also being able to find features that have unusual values [33]. Because of faster execution, an SVM with a Gaussian kernel might outperform Logistic Regression for a medium-sized dataset. SVM finds a solution hyperplane which is as far as possible for the two categories while Logistic Regression does not have this property. In addition, Logistic Regression is more sensitive to outliers than SVM, because the Logistic cost function diverges faster than Hinge Loss [33]. Figure 4 provides a visual representation of the difference between the two cost functions, using the notations of *y* and θ*Tx* discussed earlier in the paper [34].

Artificial neural network models are based on a layer of hidden features (i.e., neurons), which replace the specific features used in other models, and which control the classification. Neural networks have an input layer that matches the dimension of an input sample and an output layer which matches the target variable. The model is optimized by successively tuning the weights of these neurons, through a process called back-propagation. A standard neural network is exemplified in Figure 5 [35,36], where *xi* are the features of the input layer, i.e., the dataset features shown previously, *wij* are the weights associated with the transition of from the input layer to the hidden layer, *uj* are the artificial features of the hidden layer, the number of which is an important meta-parameter, to be chosen by the designer, *w jk* are the weights associated with the transition from the hidden layer to the output layer, and *u <sup>k</sup>* are the output features, or dependent variables, of which there is only one (*y*) in the case of binary classification, such as presented in this paper.

**Figure 5.** Artificial neural network with one hidden layer.

#### **3. Results**

#### *3.1. Logistic Regression*

The first learning algorithm investigated is Logistic Regression with polynomial features. The initial dataset is expanded by mapping the second- and third-degree polynomials of the original features, to allow the algorithm to better account for the nonlinearities in the dataset. This means that:

$$X := \begin{bmatrix} X \ X^2 X^3 \end{bmatrix} \tag{6}$$

for all training, cross-validation, and test sets. There are 57 resulting features, plus an intercept term (see also Equation (4)), used to model all other phenomena on which the target might be dependent, which also helps prevent overfitting. This comes out to 58 total features, training on 253 samples. The polynomial degree was chosen to provide a good equilibrium between accounting for nonlinearities and the risk of overfitting the dataset.

One training run of the algorithm takes place for up to 400 iterations, starting from an initial parameter vector (θ) of all zeroes. The decrease of the cost function in an example run using gradient descent is shown in Figure 6.

**Figure 6.** Logistic Regression Cost Function over algorithm iterations.

Cross-validation is done simultaneously for two meta-parameters, lambda (λ) and the prediction threshold (th). The ranges of values for the two meta-parameters are shown in Table 7.

**Table 7.** Ranges of values for lambda and threshold.


There are therefore 40 combinations of possible {lambda; threshold} parameters. Each of these is used to perform a training run, for which the theta parameter vector is stored, and the F1 score of each resulting model is evaluated, with the best model being kept as the overall Logistic Regression model. Table 8 shows an example of the results obtained for a complete suite of training runs using Logistic Regression. For brevity, results are shown for each lambda, but with the threshold level set at the chosen best.

**Table 8.** Logistic Regression results on varying lambda.


An alternative approach was explored, whereby cross-validation was done separately, first for the lambda (λ) parameter, using the evaluation of the cost function. The best lambda (λ) and its respective theta parameter vector were chosen, after which the best prediction threshold was evaluated based on the F1 score. However, this approach consistently obtained poorer results in terms of F1 score to the simultaneous cross-validation method mentioned earlier.

As can be seen from Table 8, the best F1 score on the validation set is 0.89, which leads to the choice of λ = 10 and *th* = 0.5. Depending on the initial dataset split, which is random, the results do have a slight variation, both in the actual best meta-parameters chosen, as well as the resulting best F1 score for the validation set. The results obtained in Table 8 are among the best recorded for Logistic Regression, after multiple runs, with the evaluation score on the validation set being about as high as can be expected. The final evaluation for the Logistic Regression algorithm is the F1 score of 0.84 recorded on the test set.

The Confusion Matrix for the chosen Logistic Regression model is shown in Figure 7, displaying the Confusion Matrices for each of the training, cross-validation, and test sets, as well as the overall Confusion Matrix, for the entire dataset.

**Figure 7.** Confusion Matrix for Logistic Regression.

#### *3.2. Shallow Neural Network for Pattern Recognition*

The following model is a shallow pattern recognition neural network (PatternNet), with a single hidden layer composed of 20 neurons. The indices for the training, cross-validation, and test sets were kept from the initial dataset split and provided to the pattern net, so that the same sets are used for training and testing, which allows a proper comparison of the results obtained by the models. Figure 8 shows the architecture of a PatternNet neural network, with w being the transition weights of the network and b being the weights of the bias, i.e., intercept, term, which is automatically added by the network upon initialization.

**Figure 8.** PatternNet architecture.

Training is performed using Scaled Conjugate Gradient, with a similar Cross-Entropy Loss function as the Logistic Regression. The point of cross-validation, in this case, is to supervise the performance of the network and prevent overfitting. Figure 9 shows the performance of the network on the three sets, with the set of weights chosen for best performance on the validation set, displaying the loss function value throughout the training iterations, i.e., epochs, for the training, cross-validation, and test sets.

**Figure 9.** PatternNet performance.

The results of the PatternNet model are shown in Table 9.

**Table 9.** PatternNet results.


After multiple runs, the best F1 score obtained by a PatternNet model on this challenging dataset is 0.86. The associated Confusion Matrix is shown in Figure 10 for each of the training, cross-validation, and test sets, as well as the overall Confusion Matrix, for the entire dataset.

**Figure 10.** PatternNet Confusion Matrix.

#### *3.3. Support Vector Machines*

The SVM algorithm is trained using the same dataset split as for the other two models. Cross-validation is done on the choice of kernel, with the three possible hardcoded options of *K* = *linear* , *gaussian* , *polynomial*(3) . The maximum degree of the polynomial kernel was chosen to be comparable to the maximum degree of the feature mapping from Logistic Regression. An SVM model is trained for each of these kernel options and the best one is chosen, based on their respective F1 scores. The results are shown below in Table 10.

**Table 10.** SVM results on varying the kernel.


The associated Confusion Matrix is shown in Figure 11 below, for each of the training, cross-validation, and test sets, as well as the overall Confusion Matrix, for the entire dataset.

**Figure 11.** SVM Confusion Matrix.

#### **4. Discussion**

Table 11 provides an overall comparison of obtained metrics on the training, validation, and test sets, for each proposed model, using the meta-parameter values selected during validation. The results show that all of the three investigated algorithms perform reasonably well on a challenging dataset, with SVMs having an edge in performance on the test set, as well as maintaining a good equilibrium between the training, cross-validation, and test results.


**Table 11.** Overall comparison of model results.

A Cochran's Q test [37,38] was performed to evaluate the results, with the null hypothesis being that the three algorithms have similar performance in classification. The test obtains a *p*-value of 5.7268 <sup>∗</sup> 10−<sup>10</sup> and a Cochran Q value of 42.5614, for which the null hypothesis can be safely rejected at the α = 0.05 significance level, providing proof that the SVM model significantly outperforms the others.

Furthermore, a comparative post-hoc analysis of the statistical significance of the results obtained through the three models was performed using three McNemar tests [39,40] for each of the three pairs of models. The tests are undertaken on pairs of models, since a McNemar Test can only be run on two models at a time. The first test (exact–conditional McNemar) compares the accuracies of the two models, while the second (Asymptotic McNemar) and third (Mid-*p*-value McNemar) tests assess whether one model classifies better than the other. The null hypothesis for the first test is that the two models have equal predictive accuracies. The null hypothesis for the second and third tests is that one of the models is significantly less accurate than the other. The comparison is not cost sensitive, i.e., it assigns the same penalty for different types of misclassification. Detailed descriptions of McNemar tests and the procedure used can be seen in [40,41]. Table 12 shows the results of the statistical significance tests, where h is the hypothesis test result (*h* = 1 indicates the rejection of the null hypothesis, while *h* = 0 indicates a failure to reject the null hypothesis), p is the *p*-value of the test, and e1 and e2 are the classification losses or misclassification rates of the respective two models (which are the same, irrespective of test).


**Table 12.** McNemar tests for significance of model performance.

The comparison between Logistic Regression and PatternNet is somewhat inconclusive. While the exact–conditional test cannot reject the null hypothesis, that the two methods have similar accuracies, neither can the other two tests reject the assumption that one test performs better than the other, although it should be noted that the *p*-value of the first test is twice as large as the values for the second and third tests. However, in the comparisons with the SVM model, both for Logistic Regression, as well as for PatternNet, the null hypothesis of the first test is rejected, meaning that the two methods have dissimilar accuracies. This is further collaborated by the second and third tests, where there is strong evidence that the null hypothesis cannot be rejected, meaning that the SVM model significantly outperforms both the Logistic Regression, as well as the PatternNet, models.

The obtained *p*-values of the exact–conditional McNemar tests are manually corrected using the Discrete Bonferroni–Holm Method for Exact McNemar tests [42–44]. As stated by Holm (1979) [42], "except in trivial non-interesting cases the sequentially rejective Bonferroni test has strictly larger probability of rejecting false hypotheses and thus it ought to replace the classical Bonferroni test at all

instants where the latter usually is applied". Table 13 shows the corrected *p*-values and the resulting decision on the significance of the test.


**Table 13.** Corrected exact McNemar tests for significance of model performance.

The corrected *p*-values are obtained as described in Westfall et al. [43] and implemented in [44]. For the test result to be considered significant, its corrected *p*-value should be below the α = 0.05 threshold. The first test is evaluated as expected, since the initial series of McNemar tests showed that there is no statistical difference in the performance of Logistic Regression and PatternNet, by a very high *p*-value, in relation to the threshold. The second test is most interesting, as the hypothesis that the Logistic Regression model and the SVM model have significantly different performance is now rejected, upon strengthening the test. However, it should be noted that it is a borderline decision, and the test would have been accepted at a α = 0.1 significance level. The third test reinforces the conclusion that the SVM model performs significantly better, this time compared to the PatternNet model, meeting the significance requirement by a wide margin.

However, all their performances are satisfactory and comparable on the initial metric used. The intention is that the dataset be expanded in the future, at which time further testing will be undertaken to evaluate the ability of each model to generalize, as well as the performance of various ensemble methods, possibly comprising two or more of these same models.

The literature review has yielded few comparable studies for the chosen methods on smartphone applications, without even taking into account the considerably different datasets, features, extraction techniques, etc. that other studies work with. Most papers discuss detection accuracy, with very few using the F1 score metric. The reported accuracy also varies considerably across methods and especially datasets. The final metric values, on the test set, for the linear kernel SVM, compare favorably with most papers in the literature review, for which similar metrics were available. The closest comparison is the application described in [2], which obtains slightly higher metrics on a previous version of the dataset.

While the final hard numbers of the F1 score metric and the statistical significance tests clearly favor the SVM implementation, the difference in performance is oftentimes a relatively small number of misclassified samples, which is one of the issues to keep in mind when working with a small number of data-points.

Therefore, one of the main directions for future research is expanding the dataset, both in terms of the number of data-points (samples), as well as the number of features, where applicable. This would be a great benefit to the ability to train and test the models, as well as allow the investigation of deeper and more complex model architectures, some of which were touched upon in the discussion on the state of the art.

Another interesting future option is ensemble learning, which leverages the positive results of having three good classifiers, whose outputs can then be composed into a final prediction. This is particularly appealing, given the comparatively small difference in accuracies of the three investigated models.

In addition, detection of anomalous behavior from the user's interaction pattern standpoints could be a promising topic for further research, starting from the framework already described.

The obvious future step for the application is that of actual hardware implementation on a working prototype, through which future data can also be more easily gathered. The dataset split procedure, which was discussed at length throughout the paper, should give the model a good ability to generalize to as yet unseen data.

**Author Contributions:** Conceptualization, V.V., A.M.T. and V.L.; methodology, V.-G.V., L.R. and V.-A.G.; software, V.V. and A.-M.T.; validation, V.V., V.-G.V., A.-M.T., H.W. and V.-A.G.; formal analysis, H.Y., H.W.; investigation, A.-M.T., L.V. and H.Y.; resources, L.R.; data curation, A.-M.T., V.-G.V. and L.R.; writing—original draft preparation, V.V., V.-G.V. and A.-M.T.; writing—review and editing, V.V., L.V. and V.-A.G.; visualization, V.-A.G. and H.Y.; supervision, L.V., V.-G.V. and H.W.; project administration, V.-A.G. and L.V.; funding acquisition, L.V. and H.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** The paper was funded by the European Commission Marie Skłodowska-Curie SMOOTH project, Smart Robots for Fire-Fighting, H2020-MSCA-RISE-2016-734875, Yanshan University: "Joint Laboratory of Intelligent Rehabilitation Robot" project, KY201501009, Collaborative research agreement between Yanshan University, China and Romanian Academy by IMSAR, RO, and the UEFISCDI Multi-MonD2 Project, Multi-Agent Intelligent Systems Platform for Water Quality Monitoring on the Romanian Danube and Danube Delta, PCCDI 33/2018, PN-III-P1-1.2-PCCDI2017-2017-0637. This work was supported by a grant of the Romanian Ministry of Research and Innovation, CCCDI-UEFISCDI, MultiMonD2 project number PNIII-P1-1.2-PCCDI-2017-0637/2018, within PNCDI III, and by the European Commission Marie Skłodowska-Curie SMOOTH project, Smart Robots for Fire-Fighting, H2020-MSCA-RISE-2016-734875, Yanshan University: "Joint Laboratory of Intelligent Rehabilitation Robot" project, KY201501009, Collaborative research agreement between Yanshan University, China and Romanian Academy by IMSAR, RO.

**Acknowledgments:** The authors gratefully acknowledge the support of the Robotics andMechatronics Department, Institute of Solid Mechanics of the Romanian Academy.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
