**Semantic Segmentation of Intralobular and Extralobular Tissue from Liver Scaffold H&E Images**

**Miroslav Jirik 1,2, \* , Ivan Gruber 1 , Vladimira Moulisova 2 , Claudia Schindler 3 , Lenka Cervenkova 2 , Richard Palek 2,4 , Jachym Rosendorf 2,4 , Janine Arlt 3 , Lukas Bolek 2 , Jiri Dejmek 2 , Uta Dahmen 3 , Milos Zelezny <sup>1</sup> and Vaclav Liska 2,4**


Received: 3 November 2020; Accepted: 7 December 2020; Published: 10 December 2020

**Abstract:** Decellularized tissue is an important source for biological tissue engineering. Evaluation of the quality of decellularized tissue is performed using scanned images of hematoxylin-eosin stained (H&E) tissue sections and is usually dependent on the observer. The first step in creating a tool for the assessment of the quality of the liver scaffold without observer bias is the automatic segmentation of the whole slide image into three classes: the background, intralobular area, and extralobular area. Such segmentation enables to perform the texture analysis in the intralobular area of the liver scaffold, which is crucial part in the recellularization procedure. Existing semi-automatic methods for general segmentation (i.e., thresholding, watershed, etc.) do not meet the quality requirements. Moreover, there are no methods available to solve this task automatically. Given the low amount of training data, we proposed a two-stage method. The first stage is based on classification of simple hand-crafted descriptors of the pixels and their neighborhoods. This method is trained on partially annotated data. Its outputs are used for training of the second-stage approach, which is based on a convolutional neural network (CNN). Our architecture inspired by U-Net reaches very promising results, despite a very low amount of the training data. We provide qualitative and quantitative data for both stages. With the best training setup, we reach 90.70% recognition accuracy.

**Keywords:** H&E; decellularization; liver; tissue engineering; semantic segmentation; convolutional neural networks

## **1. Introduction**

Decellularized tissue scaffolds consisting of extracellular matrix proteins after complete cell removal represent natural three-dimensional matrices with great potential in tissue engineering [1,2]. Recellularization of the decellularized scaffold can be used for in vitro engineering of artificial organs [3,4], providing an alternative strategy to other methods such as cell repopulation of synthetic matrices [5] or growing chimeric organs in genetically altered animals [6].

Nevertheless, despite research efforts, the construction of liver tissue in vitro remains very challenging. The quality of decellularized scaffold is crucial for the initial cell-scaffold interaction [7,8], and thus determines the success of the cell repopulation process. However, the assessment of the scaffold quality prior to recellularization represents one of the remaining problems to be solved. The assessment criteria available are very fragmented and concentrate mainly on bulk properties. Morphological evaluation is mostly qualitative and rather superficial [9,10].

The Whole Slide Scan microscopy (WSS) has been widely used in last years. It allows to study and archive detail images of whole samples. The image processing techniques allow to design efficient semiautomatic and automatic procedures for quantitative analysis. The general algorithms available in free software can be often successfully used to solve simple tasks. In the paper [11], the authors used ImageJ application based on the Gray Level Co-occurence Matrix and Run-Length Matrix [12] to analyze liver fibrosis in H&E images. In more complex tasks, the use of an image processing tool and using a scripting language might be necessary. In [13], the Matlab software with its script language was used for quantitative analysis of cells and tissues. The most challenging tasks require the most advanced algorithms. The convolutional neural networks introduced by LeCun in [14,15] have promising results also in WSS microscopy. The most common tasks are image classification and image segmentation. The convolutional neural network-based approach to solve this problem for nuclei and cells can be found in [16].

The first method for the quantitative evaluation of the structure quality with respect to particular liver scaffold features such as intralobular sinusoidal vessel structures was introduced in [8]. However, this method requires an initial user input thus it is observer dependent. The first step in creating an observer independent and reproducible evaluation method of the scaffold structure quality is the semantic segmentation into three classes: background, intralobular area, and extralobular area.

Due to the neural networks improvements in recent years, most hand-crafted feature descriptors for semantic segmentation, if enough data are available, become obsolete. However, a suitable dataset with liver tissues does not exist and the creation of a new one includes per-pixel labels of high-resolution data which is very time demanding and costly.

Therefore, in this paper, we propose a two-stage method. In the first stage we utilize Naive Bayes classifier [17] trained on a simple texture descriptor. The outputs of this classifier we utilize as training data for the convolutional neural network.

The main contributions of this paper are the following:


#### **2. Materials and Methods**

#### *2.1. Scaffold Sample Preparation*

After the explantation from domestic pigs (Sus scrofa), the liver was decellularized by perfusion with detergent solutions (1% Triton X-100, 1% SDS) via the portal vein and hepatic artery, and finally washed with saline using a system of peristaltic pumps (Masterflex L/S, Cole-Palmer, Vernon Hills, IL, USA). Scaffold samples were fixed in 10% buffered formalin, embedded in paraffin, and eventually cut on a microtome in 4 µm thick sections. The tissues were taken with ethical approval from the Ministry of Education of the Czech Republic (no. MSMT-4428/2018-2).

## *2.2. Histological Staining and Imaging*

Histological sections were mounted on glass slides, deparaffinized, and subjected to hematoxylin-eosin staining resulting in blue stained nuclei and pink stained cytoplasm. Whole slide scans were produced using Nanozoomer 2.0HT Digital Slide Scanner (Hamamatsu, Hamamatsu City, Japan). The source lens used for data acquisition was 40×. The typical size of the source area was about 15 × 10 mm. The resolution of the images is 227 nm per pixel. The size of an uncompressed image data was 7 to 19 GB.

#### *2.3. Image Processing*

The input image of H&E stained scaffolds is described by selected texture features. As a result of the small amount of training data and the lack of full image annotation we used a two-stage method. In the first stage, the training set of partially annotated images was used. This classifier is then used per-pixel for the WSS segmentation. To increase accuracy, the classifier is trained based on a simple annotation for a particular image. Thus, the obtained segmentations are used in the second stage to train a convolutional neural network that does not require further adjustment.

## *2.4. Preprocessing and Data Annotation*

WSS data are stored in NDPI file format and partial annotations are stored in NDPA format. Background, intralobular, and extralobular areas are annotated by with magenta, black, and red color, respectively (see Figure 1). The area with the particular type of tissue is selected by drawing a polygon. With this procedure few representative parts of the image were picked. The full annotation of the whole slide image was not generated due to large time demands. Annotations were produced by an operator supervised by a tissue engineering expert.

**Figure 1.** Example of partially annotated H&E Whole Slide Scan (WSS). The manually selected extralobular area is labeled in red. The magenta delineation shows the scan background and the intralobular area is annotated in black. The green, cyan, blue, and yellow annotation represents the rough delineation of the central vein.

Based on metadata, the pixel size for each layer from the pyramid representation of the NDPI file format was extracted. The vertices of the annotation polygons were recalculated to the proper resolution. A 10 µm pixel bitmap is created from the pyramid representation of NDPI files. The image was then divided into tiles of 255 × 255 px for easier processing.

#### *2.5. Handcrafted Texture Feature Segmentation (HCTFS)*

The texture features were designed to describe the pixel intensity and the neighborhood texture. We started from our formerly designed method for scaffold texture segmentation [8] and extended the feature vector to better distinguish the differences between the "background" and the "intralobular" area. The flow-chart of the algorithm can be seen in Figure 2. To keep the computation demands low, the texture features are as simple as possible. The first three features originate from RGB intensity. This takes into account the color information in the H&E stained scaffold images. Only the red channel, which is strongly correlated with other color channels, is used in the calculation of other features. The next two features are obtained by a Gaussian filter [18] with a Standard Deviation for the Gaussian kernel of 2 and 5 pixels. The Sobel filter [19] is used to describe the local discontinuity in the image. The Sobel filter response at the pixel location is used as one feature. The information from the neighborhood discontinuity is generated by the Gaussian Response Filter of the Sobel filter with a standard deviation of 2 and 5 pixels. The last feature is a median of the neighborhood of 10 pixels in diameter. The responses of each feature extractor can be found in Figure 3.

**Figure 2.** Handcrafted Texture Feature Segmentation algorithm. Input H&E stained image is divided into tiles. Each tile is processed separately. Red (R), Green (G), and Blue (B) image channels are used as first features. The Sobel filter and the Gaussian smoothing with the standard deviation of 2 pixels and 5 pixels (Gauss(2)) and Gauss(5)) are applied to the Red channel. The output of the Sobel filter is used to calculate two features based on the Gaussian of the Sobel filter with a standard deviation of 2 pixels and 5 pixels (Gauss(2) of Sobel) and the median of Sobel with neighborhood with size 10(Med(10) of Sobel). These features are used for image segmentation based on per-pixel classification.

The features obtained from partially annotated areas of the image are then used to train the Gaussian Naive Bayes Classifier. The studies of the classifier can be found in the paper [20,21]. The *scikit-learn* implementation was used [22] for our experiments. The annotations were performed to distinguish the three following classes: background, intralobular areas, and extralobular areas. The classifier was pre-trained on a general dataset and then used for per-pixel segmentation. Before each use, it is additionally trained using target image data and available partial annotations for that image.

**Figure 3.** Features used for per-pixel Handcrafted Texture Feature Segmentation (HCTFS). In each subfigure, the intralobular, extralobular, and empty areas are on the left, middle (the vertical structure), and right, respectively. Red, Green, and Blue image channels are in the first row. The Gaussian smoothings with the standard deviation of 2 pixels and 5 pixels together with the Sobel filter are in the second row. The Gaussian of the Sobel filter with a standard deviation of 2 pixels and 5 pixels are in the third row. The last feature in the figure is the median of the Sobel filter with a neighborhood of size 10.

#### *2.6. Fully-Convolutional Neural Network*

The second tested method inspired by [23–25] is built upon a feed-forward fully-convolutional neural network (CNN), with an encoder–decoder structure. Based on our previous research [26], we believe that such a structure is perfectly suitable for semantic segmentation tasks. Firstly, the encoder compresses the data from raw image pixels on the input into a feature vector representation. Secondly, based on the feature vector, the decoder produces output maps with the same size as the input. One map is produced for each class, i.e., our network produces three maps in total.

Our architecture is based on U-Net [24], however, we have made a few minor changes. Firstly, our architecture also utilizes skip connections between corresponding layers of encoder and decoder, however, unlike skip connections in the original implementation of U-Net, our skip connections are implemented as element-wise additional. Secondly, due to the relatively small amount of training data, we employ a much smaller architecture to prevent overfitting. To be more specific, our architecture called UNet-Mini uses only 128k parameters, whereas the original implementation of U-Net uses over 17M parameters. Our encoder, and decoder are composed of only four (de)convolutional layers with 16, 32, 64, and 64 number of kernels, respectively, kernel size *ks* = 3 × 3 and stride *s* = 1.

Apart from these differences, our architecture follows a standard setup of (de)convolution followed by batch normalization and the ReLU activation. Four deconvolutions in decoder are followed by the convolution with kernel size *ks* = 1 × 1 and stride *s* = 1. This layer performs a classification task, therefore, it utilizes the classical Softmax activation function. The detailed description of the architecture can be found in Figure 4.

**Figure 4.** Structure of UNet-Mini architecture. The encoder is composed of four convolutional layers, each followed by batch normalization (BN) and the ReLU activation function. The decoder mirrors this structure. *N* in the last convolutional layer convF of the decoder represents the number of classes (i.e., 3). conF is followed by the Softmax activation.

The neural network is implemented and trained in Python using Chainer deep learning framework [27,28]. Experimental settings and results can be found in Section 3.2.

#### **3. Experiments and Results**

#### *3.1. Handcrafted Texture Feature Segmentation*

To train the first stage classifier in Handcrafted Texture Feature Segmentation, we used a dataset that contained 60 different areas of 8 WSS. This pre-trained classifier with small additional annotation for every image was then used to produce 33 WSS segmentations for the second stage based on CNN. The first stage segmentation output can be seen in Figure 5.

**Figure 5.** Output of the handcrafted texture feature-based segmentation. The background class is in dark purple, the intralobular area is represented by teal color, and the extralobular area is in yellow.

#### *3.2. Semantic Segmentation via CNN*

Generally, a huge amount of data is necessary for network training. For this initial experiment, we used only 33 WSS (with average resolution approximately 3000 × 2000 pixels) without any original labels. The annotations resulting from the HCTFS of the individual scans were then utilized as the labels. We believe our network should handle occasional mislabels of the HCTFS, learn the correct structure for each class, and outperform the first method.

The data were converted to gray-scale and split into three subsets—training (25 scans), development (4 scans), and testing (4 scans) set. Considering the size of scans, we decided to cut each of them into the crops of the size of 224 × 224 pixels with 100 pixels overlay. Furthermore, to produce more training data, we resized each scan to half of the original resolution and repeated the whole cutting process. This process was repeated two times in total. In the last step, we resized the original scan to the size of 224 × 224 pixels. Thanks to this process, we got 11,384 training images, 2739 development images, and 2425 testing images. Such amount of data represents still quite a small data set for the training of the neural network. To overcome this problem and improve the network's robustness, we also used data augmentations. To be more specific, each image crop was modified with a random number of augmentations. The possible augmentations were the following: horizontal flip, vertical flip, white noise, and Gaussian blur. This process was repeated three times for each image crop. This leads to 45,536 training images in total. All the pixel values were normalized from 0 to 1.

UNet-Mini is trained for the semantic segmentation of an input image into one of the three following classes: intralobular, extralobular, and background. The Adam optimization method [29] with standard parameters setup and also standard SGD optimizer with a starting learning rate *l* = 0.01 and step decay *d* = 0.1 every 10 epochs were the hyperparameters we used for updating UNet-Mini's parameters. In both cases, we use the cross-entropy loss for the network training and mini-batch size 32. The training is stopped after 35 epochs. We used 1 GPU NVidia 1080Ti for training.

Both optimizers reach comparable results, with the best recognition accuracy of 92.35% on the development set. It is necessary to note that the reached accuracy is calculated by comparing the network results with the results from HCTFS. As it was already mentioned, the HCTFS's results contain some mislabels, therefore, our goal was not to completely replicate the original results, but to filter out these mistakes and to learn to segment the scans more precisely.

To objectively compare both methods, we manually label additional ground-truth data patch on the original images. The resulting images can be found in Table 1. UNet-Mini overcomes the HCTFS method by more than 4% on both sets. This means that UNet-Mini learns to generalize better than the original method despite incorrect data in the training set. Plus, UNet-Mini does not need any additional image specific labels.

**Table 1.** Comparison of classification recognition rates. Bold font indicates best results.


Furthermore, we provide examples of qualitative results comparing both methods. Figures 6 and 7 show the results, where the UNet-Mini corrected or partially corrected the original mistakes in labels. On the other hand, an example of obvious mislabels made by the network can be found in Figure 8. Finally, Figure 9 provides an example of equally good results from both tested methods.

**Figure 6.** Example of semantic segmentation, where the neural network reached better results. The original image (on the **left**), the result from the HCTFS (in the **middle**), and the results from the neural network (on the **right**).

**Figure 7.** Example of semantic segmentation, where the neural network reached better results. The original image (on the **left**), the result from the HCTFS (in the **middle**), and the results from the neural network (on the **right**).

**Figure 8.** Example of semantic segmentation, where the neural network reached worse results. The original image (on the **left**), the result from the HCTFS (in the **middle**), and the results from the neural network (on the **right**).

**Figure 9.** Example of semantic segmentation, where the neural network reached comparable results. The original image (on the **left**), the result from the HCTFS (in the **middle**), and the results from the neural network (on the **right**).

#### **4. Discussion**

The scaffold function is directly linked to its structure [30]. Current approaches to analyze scaffold quality include the qualification of the residual DNA content, the amount (or ratio) of structural proteins such as collagen I, collagen IV, laminin, fibronectin, or elastin, and presence of glycosaminoglycans [31,32].

The morphological assessment consists of subjective evaluation of scaffold structure preservation which is supposed to be as close to the native liver structure as possible. H&E staining represents a fast and simple histological method to visualize the scaffold structure as well as cell removal from samples. The typical structural unit of the liver is a lobule, ideally a hexagonally shaped structure with intralobular space occupied by sinusoidal vessels surrounded by hepatocytes. The scaffold consists of the extracellular matrix of the vessel walls forming conduits, empty inter-sinusoidal space after the removal of hepatocytes, and interlobular septa formed by thick protein fibers.

The presence and distribution of individual structural proteins is usually confirmed by immunohistochemistry representing more time and cost consuming method. The ultrastructure can be visualized by scanning electron microscopy; however, the cost and extended time spent during sample processing makes this powerfull technique not always available. Scaffold images obtained by any of these methods have a potential to be quantitatively analyzed. However, for the development of a new quantitative method, we selected H&E stained images. They can be produced in a fast and easy way while still carrying the information necessary to evaluate structural integrity of the scaffold.

The segmentation of liver scaffold from H&E stained image based on handcrafted texture features works well in the interactive mode where additional partial segmentation of a particular image is given. Without additional per image classifier training, the segmentation algorithm provides unstable results. This makes it dependent on the manual annotation of each examined image. Considering the very promising results that we have reached in our initial experiments, we would like to further investigate possible usage of semantic segmentation via neural networks. In our future research, we would like to extend our training set with additional slides. Moreover, we would like to perform extensive testing of other neural network architectures.

## **5. Conclusions**

The first step in the decellularized liver analysis can be successfully represented by the whole slide segmentation. Due to the lack of completely annotated WSS, we designed a two-state solution. The first stage is segmentation based on hand-crafted features that are trained using partially annotated WSS. The second stage uses CNN with a U-Net scheme. The two-stage approach has proved to be useful to compensate the lack of training data, and reaches semantic segmentation accuracy over 90% and overcomes the handcrafted features by more than 4%. In our future work, firstly, we would like to enrich our dataset. Especially images obtained using different scanners are very desirable because such data can provide a classifier bigger robustness and better generalization capacity. Secondly, with more data, we believe, utilizing more complex neural network architecture would be possible. We also plan to use the suggested algorithm in the open-source application for the scaffold tissue evaluation.

**Author Contributions:** M.J., V.M., and V.L. conceived of the presented idea. M.J. prepared the Handcrafted Texture Features Segmentation method, wrote important parts of the software and manuscript, and also annotated the data. Vladimira Moulisova is a key person from the tissue engineering field. She collected the data, performed annotations, and wrote the introduction and data preparation paragraphs. I.G. prepared the experiments with convolutional neural networks—the software and the manuscript. M.Z. provided critical reading and suggested important changes in the manuscript. C.S., L.C., R.P., J.R., J.A., L.B., J.D., U.D., M.Z., and V.L. contributed to the data preparation. All authors provided critical feedback and helped shape the research, analysis, and manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Ministry of Education of the Czech Republic, project No. LTARF18017, and by Charles University Research Centre program UNCE/MED/006 "University Center of Clinical and Experimental Liver Surgery" and Ministry of Education project ITI CZ.02.1.01/0.0/0.0/17\_048/0007280: Application of modern technologies in medicine and industry. The research was also supported by the project LO 1506 of the Czech Ministry of Education, Youth and Sports. The authors appreciate the access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the program "Projects of Large Research, Development, and Innovations Infrastructures" (CESNET LM2015042).

**Acknowledgments:** All tissue samples were taken with ethical approval from the Ministry of Education of the Czech Republic (no. MSMT-4428/2018-2).

**Conflicts of Interest:** The authors declare no conflict of interest.

## **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Imaging Tremor Quantification for Neurological Disease Diagnosis**

#### **Yuichi Mitsui 1 , Thi Thi Zin 1, \* , Nobuyuki Ishii <sup>2</sup> and Hitoshi Mochizuki 2**


Received: 16 October 2020; Accepted: 20 November 2020; Published: 22 November 2020

**Abstract:** In this paper, we introduce a simple method based on image analysis and deep learning that can be used in the objective assessment and measurement of tremors. A tremor is a neurological disorder that causes involuntary and rhythmic movements in a human body part or parts. There are many types of tremors, depending on their amplitude and frequency type. Appropriate treatment is only possible when there is an accurate diagnosis. Thus, a need exists for a technique to analyze tremors. In this paper, we propose a hybrid approach using imaging technology and machine learning techniques for quantification and extraction of the parameters associated with tremors. These extracted parameters are used to classify the tremor for subsequent identification of the disease. In particular, we focus on essential tremor and cerebellar disorders by monitoring the finger–nose–finger test. First of all, test results obtained from both patients and healthy individuals are analyzed using image processing techniques. Next, data were grouped in order to determine classes of typical responses. A machine learning method using a support vector machine is used to perform an unsupervised clustering. Experimental results showed the highest internal evaluation for distribution into three clusters, which could be used to differentiate the responses of healthy subjects, patients with essential tremor and patients with cerebellar disorders.

**Keywords:** tremor; essential tremor; ataxia; finger–nose–finger test

#### **1. Introduction**

A tremor is one of the most common involuntary movements seen in neurological disorders. It is characterized as a rhythmic, involuntary oscillation of a body part by muscle innervations that imply repetitive contractions [1–4]. Various types of tremors occur, depending on their causes. In general, tremors can be divided into two types: resting and action tremors. Action tremors can be further classified into postural tremors, kinetic tremors, task-specific tremors and intention tremors [5,6]. In clinical practice, a tremor is most commonly classified by its appearance and cause or origin. There are actually more than 20 types of tremors. Among them, the most common cause of resting tremors is Parkinson's disease (*PD*). The most common causes of postural and kinetic tremors are essential tremors (*ET*) and cerebellar disorders (*CD*). It is easy to distinguish *PD* resting tremors from other tremors because trembles occur at rest and weaken when the target muscles contract [4,5,7]. On the other hand, there are various causes of action tremor, and it is not easy to identify the cause.

The most common causes of action tremors are essential tremors (*ET*) and cerebellar disorders (*CD*). *ET* tremor behaves regularly, but *CD* tremor behaves irregularly and sometimes includes intention tremor [5,6]. Both *ET* and *CD* patients have several common features, such as increased tremor when mentally stressed and restricted fine movements. In clinical practice, clinicians try to distinguish these two tremors, *ET* and *CD*, by neurological examinations such as FNF (finger–nose–finger) test [1,6,7]. However, it is not easy to detect subtle irregularities of finger movement and observe where the finger tremor becomes stronger during the FNF test by usual observation. For this reason, distinguishing between *ET* and *CD* could be difficult even for a skilled neurologist. Therefore, we propose a non-contact method of distinguishing *ET* from *CD* featuring image processing technology and including measurement of tremor severity to confirm its effectiveness.

The rest of this paper is organized as follows: In Section 2, we review the literature relating to our research. In Section 3, the materials and methods are proposed. The method of disease diagnosis and analysis is described in Section 4. Then some experimental results are described in Section 5 using datasets collected by two neurologists who specialize in patients with tremors. In Section 6, we present discussions and plans for future research. Finally, we conclude the paper by giving remarks in Section 7.

#### **2. Related Works**

*ET* is a disease in which tremors only appear as a symptom and are not life-threatening. However, it is clinically important to make an early diagnosis, allowing treatment specific to *ET* when available and improving the patient's quality of life [8]. The prevalence of *ET* is approximately 2.5–10% of the population [9]. Although it can develop in any age group, it is mainly seen in the elderly: 4% of people over 40 years old and 5–14% of people over 65 years old have *ET* [10–12]. Although the cause of *ET* is not well understood, speculation exists that a hyperexcited state of the sympathetic nerve is involved because the symptoms increase with stress [13]. *CD* results from causes such as cerebellar infarction, inflammation, demyelination, autoimmunity, trauma, degeneration and tumors. Symptoms of *CD* include cerebellar ataxia, intention tremor and cerebellar sway of the upper limbs [5,7,12]. *ET* is rarely life-threatening, while *CD* can be. However, *ET* often disturbs a patient's quality of life. Therefore, an early diagnosis, differentiating *ET* from *CD,* is important. In addition, even in those who are highly skilled, it may be difficult to differentiate tremors when the patient first presents with the symptom.

Currently, the diagnosis of a patient's disease using the characteristics of tremor is performed subjectively based on the experience and skills of specialists. However, there are various problems with this. Doctors who are not specialists, such as family doctors and on-duty doctors, do not have a means of quantitatively evaluating a tremor, incurring the risk of misdiagnosis. Such quantification is important in determining the proper treatment.

Various tremor rating scales have been used to evaluate symptoms [14–16], but these are qualitative and subjective, and errors may occur depending on the person who assesses them [17,18]. Therefore, how to more accurately quantify tremor characteristics has become an urgent subject of research. For example, such research has included the acquisition of tremor signals using devices such as multipolar *EMGs*, electromagnetic tracking devices, accelerometers and gyroscopes, using the resulting data for evaluation and diagnosis [19]. However, since these methods require a large-scale dedicated device, it is unrealistic to use them in an examination room. In addition, these methods often require attaching a sensor or the like to the patient, resulting in different symptoms due to stress or the burden imposed during the examination. Recent tremor-related research has been done using magnetic resonance imaging (MRI) and a neurophysiological assessment [20]. Results indicate a significant association between severe tremors and malfunctions in specific areas of the brain. Moreover, the literature includes applicable developments in image processing in the framework of deep learning [21,22].

The severity assessment of *ET* or *CD* is determined by expert opinion and is likely to be subjective in nature. Several investigators have tried to quantify these symptoms. Analysis of FNF test, a classic neurological examination method, has been reported using an accelerometer or inertial sensor. Using inertial sensors, the changes of spatiotemporal parameters are related to the disability level in patients with multiple sclerosis [23] or cerebellar ataxia [24]. Using a three-dimensional motion capture system, analyses of body movements during FNF test in patients with poststroke could discriminate between patients with mild and moderate upper limb impairments [25]. These studies have been

successful in quantifying the severity of symptoms. However, no research has ever tried to capture and distinguish between the characteristics of *ET* and *CD*. Furthermore, the fine movements of the fingertips during the FNF test has never been analyzed by monitoring them with a video over time.

#### **3. Materials and Proposed Imaging Method for Tremor Quantification**

In this section, we describe the architecture of our proposed system in which tremors are characterized based on visual data collected by performing the *FNF* (finger–nose–finger) test. Specifically, these visual data are collected using a smartphone, as in the actual diagnostic, for distinguishing *ET* from *CD*. The purpose of this analysis is to automatically and objectively diagnose the disease and measure its severity. In the *FNF* test, patients move their index finger back and forth between their nose and the examiner's finger to see whether tremors occur. As a result, non-specialist doctors such as family doctors or on-duty doctors can avoid misdiagnosis, detecting life-threatening problems, and averting *MRI* imaging and other unnecessary medical costs. In addition, without the need for sensors, no burden is placed on the patient, and the *FNF* test analysis can be performed easily. The system is composed of the following four components: the dataset collection system, image preprocessing, feature extraction, disease diagnosis and analysis. – – '

## *3.1. Subjects and Design of Data Collection System*

Data collection for the *FNF* test was conducted in an examination room at the Miyazaki University Hospital for *ET* patients (*N* = 10; female, *n* = 4; age, 71.5 +/− 8.1, mean +/− *SD*) and *CD* patients (*N* = 18; female, *n* = 9; age, 68.1 +/− 7.5), with images captured in a side view. Figure 1 illustrates the process of the data collection system. The smartphone camera is fixed at a distance of about 1~1.5 m from the doctor and the patient. The recorded video has a resolution of 480 × 640 pixels, and the frame rate is 30 frames per second (fps). The doctor places his index finger at various locations in front of the patient. The patient touches his/her index finger to the doctor's index finger and then touches his/her index finger to his/her own nose. Repeat several times with the doctors moving the target finger each time. − − − doctor's index finger

**Figure 1.** The illustration of the data collection system.

severity of the patients' tremor , upper limb tremor ≥ 1 test tremor ≥ 2 Two neurologists made diagnoses and evaluated the severity of the patients' tremors based on the essential tremor rating assessment scale [15] or the scale for assessing and rating ataxia [14]. According to these scales, patients were classified into two groups; mild (*ET*, upper limb tremor < 1 cm; *CD*, *FNF* test tremor < 2 cm) or severe (*ET*, upper limb tremor ≥ 1 cm; *CD*, *FNF* test tremor ≥ 2 cm). The video data were recorded by two neurologists using smartphones. The data or healthy subjects

(*N* = 8; female, *n* = *X*; age, *X* +/− *X*) include data recorded with the cooperation of members of the laboratory, featuring a recorded video of the diagnostic test performed on two ordinary healthy people at the Miyazaki University Hospital. The *FNF* test was performed by winding red or green tape around the subject's finger. This protocol was approved by the Ethics Committee of the University of Miyazaki, with a waiver of written informed consent obtained from all participants. − '

#### *3.2. Image Preprocessing Component*

In this component, image preprocessing is performed in preparation for further analysis. In the first step, the patient's finger area must be extracted from the video image in each frame. Figure 2 shows the algorithm for extracting the finger area. First, a background image is detected as a noise source, and then the region of interest is set by removing the noise. Subsequent processes include inputting an image for extracting the finger area, threshold processing, noise processing and finger area estimation, finally obtaining the coordinates of the finger area. '

**Figure 2.** Finger area extraction algorithm.

Due to the location for recording video in the examination room, noise can occur for many reasons when extracting the finger area. In the presence of noises, we perform a noise removal process by using background modeling with an initial background as an image that does not include the target object. By converting the RGB image to HSV, we perform the defined thresholding of the hue information in order to obtain the finger object. The input image and converted HSV image are shown in Figure 3a,b, respectively [16]. Hue information representing the hue of the HSV image, Saturation information representing the saturation, and Value information representing the brightness are thresholded, and the finger is obtained by taking the logical product with the region of interest.

We also performed noise processing by calculating the aspect ratio of each area resulting from the labeling process. Since the finger area has a shape close to that of a square, threshold values are applied to the calculated aspect ratio to remove areas of the same size as finger areas that are elongated in vertical or horizontal orientations and could not be removed by noise processing using labels.

**Figure 3.** (**a**) Input image. (**b**) The converted HSV image for the input image.

## 3.2.1. Estimation of Finger Area when Multiple Labels Exist

Even if noise processing is performed, not all noise can be removed. In addition, noise processing sometimes removes the finger area. In such a case, the finger area is estimated as follows: If multiple labels exist, such as frame *t*, select the label closest to the coordinates of the finger area surrounded by the red circle detected in the previous frame. The coordinates of the finger area are obtained by calculating the center of gravity of the labeled object.

#### 3.2.2. Estimation of Finger Area when No Label Remains

If the finger area is mistakenly removed during noise processing, all labels can be lost. In that case, a smoothing process is performed using finger area coordinates from preceding and subsequent frames to estimate the finger area. For example, as shown in Figure 4, when two frames with no label continue for two successive frames, the difference between *X* coordinate and *Y* coordinates is calculated from the preceding and subsequent frames, and the coordinate values are evenly calculated for the unlabeled frames. The finger area is estimated by substituting a value that changes.


**Figure 4.** Estimation of finger area when no label remains.

#### *3.3. Feature Extraction Process*

Now, we present the feature extraction process for the detected finger areas in an *FNF* test. In this process, we extract the following six measures.

#### 3.3.1. Mean Square Deviation (RMSD: Root-Mean-Square Deviation)

This measure quantifies the vertical distance of the patient's up and down positions. In Figure 5, the finger region is shown as a graph in the rectangular coordinate plane. In order to do so, we employ a linear-quadratic function along with the least square method. Since the linear-quadratic function represents a parabola curve, we can estimate the vertical distance that the finger moves up and down by computing the root mean square deviation (RMSD) measure. The formula for calculating RMSD is shown below: the vertical distance of the patient's up

$$RMSD = \sqrt{\frac{\sum\_{i=1}^{n} \left(y\_i - \widehat{y}\_i\right)^2}{n}},\tag{1}$$

where *n* is the number of plotted data points,*y<sup>i</sup>* is the plotted value, and ⌢ *yi* is the value of the approximated quadratic function.

**Figure 5.** Plot of finger trajectory and approximate curve.

#### 3.3.2. Dispersion of Acceleration

As the second measure in consideration, dispersion of acceleration digitizes variations in speed. This measure was selected because healthy people have constant finger movements. However, patients with tremor symptoms have varying rates of change. Therefore, acceleration is calculated for each frame using the coordinate data of the finger region. Next, by calculating the variance from the calculated acceleration, we can quantify the dispersion of acceleration. The formula for calculating this variance is shown below.

$$\text{Variance} = \frac{1}{n} \sum\_{i=1}^{n} \left( \mathbf{x}\_{i} - \overline{\mathbf{x}} \right)^{2},\tag{2}$$

 where *n* is the number of frames, *x<sup>i</sup>* is the acceleration in each frame, and *x* is the average value of the acceleration. Using the dispersion of acceleration that digitizes variation in speed change, we can quantify how the patient's finger is decelerating near the examiner's finger.

#### 3.3.3. Histogram Feature

In the case of tremor patients, particularly *ET* patients, their fingers often slow down near the examiner's finger when performing the *FNF* test. Therefore, the following equation is used to determine whether the finger is moving back and forth with a constant rhythm.

$$\text{Histogram} = \frac{(h\_{\text{max}} - h\_{\text{med}})}{n},\tag{3}$$

where *n* is the number of frames, *hmax* the maximum value of the histogram, and *hmed* is the median value of the histogram. The difference between the simple maximum value and the median value requires a different round-trip time (number of frames) depending on the moving image, so normalization is performed by dividing by the number of round-trip frames. In the histogram method, we first construct a histogram by taking the X coordinate on the image of the finger as the vertical axis and the frame number as the horizontal axis. We then divide the range between the maximum frequency of X coordinate and the minimum frequency of X coordinate into four equal parts [12]. Finger movement can be considered unstable if angle analysis indicates that the finger moves up and down an excessive number of times. Moreover, the ratio between the time required in the initial movement of the patient's finger from the examiner's finger to the patient's nose and the time required for the return trip is longer for tremor patients. In this case, the average moving distance is used for digitizing how much the finger is shaking.

#### 3.3.4. Angular Feature

In the *FNF* test, the fingers are moved horizontally, but in patients with tremor symptoms, fine up and down vibrations occur. In order to detect this, the angle at which the finger has moved between frames is considered and calculated by using the following equation.

$$\theta = \tan^{-1} \frac{y\_n - y\_{n-1}}{x\_n - x\_{n-1}} \, \text{s.t.} \tag{4}$$

However, the range of θ is −180 ◦ ≤ θ ≤ 180 ◦ . Here, (*xn*, *yn*) is the coordinate data for the finger region of the *nth* frame, and (*x*(*n*−1) , *y*(*n*−1) ) is the coordinate data for the *n* − 1 frame. The total number of frames satisfying −150 ◦ < θ < −30 ◦ or 30 ◦ < θ < 150 ◦ is determined using the angle obtained from the above equation; the number of times the finger swings up and down is also obtained. An example of the finger movement angle is shown in Figure 6.

**Figure 6.** Angle of finger movement.

'

#### 3.3.5. Measure of Round-Trip Time Ratio

Angle (˚)

In order to detect abnormality in the *FNF* test, the following equation is used to determine the ratio between the time required for initial and return paths in finger movement:

Time Ratio = *f* 0 *f r* , (5) 

where *f* 0 is the average number of frames on the initial path and *f r* is the average number of frames on the return path. Here, a threshold value is set for the amount of finger movement in each frame, as shown in Figure 7. The frames in which finger movement exceeds the threshold value are extracted, and the average number of frames for the initial and the return path is obtained. Then, the ratio between the time required for initial and return trips is calculated.

**Figure 7.** Amount of finger movement in each frame.

#### 3.3.6. Measure of Average Travel Distance

' The final feature to be extracted is the average moving distance digitizing how much the finger is shaking when first touching the examiner's finger. Tremor patients shake their fingers when touching the examiner's finger in the *FNF* test. This is especially remarkable in *ET* patients. Therefore, initial and return frames are extracted during the calculation process for the round-trip time ratio. Thus, the average moving distance of the fingers during that period is calculated using the following formula.

$$\overline{d} = \frac{1}{(m\_2 - m\_1) + (m\_4 - m\_3) + 2} \left( \sum\_{i=m\_1}^{m\_2} d\_i + \sum\_{i=m\_3}^{m\_4} d\_i \right) \tag{6}$$

where *d* represents the average moving distance, *d<sup>i</sup>* is the moving distance in the *i* frame, *m*<sup>1</sup> is the first frame of the first touch, *m*<sup>2</sup> is the last frame of the first touch, and *m*<sup>3</sup> is the first frame of the second touch, *m*<sup>4</sup> is the frame at the end of the second touch.

#### **4. Method of Disease Diagnosis and Analysis**

A classifier is learned by using the feature values obtained in Section 3, and the disease is diagnosed by classifying the data using that classifier. In this study, we classify by supervised learning. Supervised learning is a method of learning a classifier that correctly outputs the relationship between data and class, using the information on the label for the data and the class of data provided in advance. The methods used as classifiers are as follows: linear discriminant analysis, logistic regression analysis, support vector machine (*SVM*), and the *k*-nearest neighbor method (*k*-*NN* method). Verification of the classifier is performed by *k*-fold cross-validation. Briefly, we will describe these classifiers as follows:

#### *4.1. Linear Discriminant Analysis*

Linear discriminant analysis is a method of finding a straight line that can best classify which group to enter when new data are obtained using data provided in advance that was divided into different groups.

#### *4.2. Logistic Regression Analysis*

In medical statistics, logistic regression analysis is one of the statistical methods used in multivariate analysis. In this method, when the objective variable (class) is binary, the probability *P* that an event occurs when one of the classes is an event is expressed by the equations in (7).

$$\begin{array}{l}P' = \ln\left(\frac{P}{1-P}\right) = b\_0 + b\_1\mathbf{x}\_1 + b\_2\mathbf{x}\_2 + \dots + b\_p\mathbf{x}\_p\\P = \frac{1}{1-e^{-P'}}\end{array} \tag{7}$$

where *b*<sup>0</sup> is a constant, *b<sup>p</sup>* is a partial regression coefficient, and *x<sup>p</sup>* is a covariate (feature amount).

#### *4.3. Severity Measurement in ET Patients*

Figure 8 shows the algorithm for measuring severity in *ET* patients. Threshold processing is applied to the feature amount calculated by the histogram analysis, angle analysis, and the average moving distance proposed in Section 2, and if it is equal to or greater than the threshold, one point is added to each. If the total number of points finally scored is less than 2, the score is mild, and if the total score is 2 or more, the score is severe.

**Figure 8.** Severity measurement algorithm for essential tremors (*ET*) patients.

#### *4.4. Severity Measurement in CD Patients*

Figure 9 shows the algorithm for measuring severity in *CD* patients. Threshold processing is applied to the feature amount calculated by *RMSD*, histogram analysis, angle analysis, and average moving distance proposed in Section 2, and if it is above the threshold, one point is added to each. If the total score is less than 3 points, the severity score is mild, and if the total score is 3 points or more, the score is severe.

**Figure 9.** Algorithm for measuring severity in cerebellar disorders (*CD*) patients.

#### **5. Experimental Results**

– This section describes the results of experiments using the proposed method. This time, we conducted an experiment using 8 sets of data for healthy people, 10 data for *ET* patients, and 18 data for *CD* patients. For training and testing data separation, we applied the *k*-fold cross-validation technique. In our system, we set *k* = 5 and therefore, the dataset is split into five folds. Machine learning is performed in experiments that each have multiple classes, as follows: (1) healthy subjects and tremor patients, (2) *ET* patients and *CD* patients and (3) healthy subjects, *ET* patients and *CD* patients. After learning is completed, *k*-fold cross-validation is performed. The results are shown in Tables 1–3, respectively.


**Table 1.** Classification results of healthy subjects and tremor patients.



**Table 3.** Classification accuracy of healthy subjects, *ET* patients and *CD* patients.


We have also conducted experiments measuring the tremor severity of *ET* and *CD* patients using the method proposed in Section 3. In these experiments, the threshold was determined by using a total of four training data in experiments featuring examinations by Doctor A and Doctor B, which had a low rate of accuracy in determining the severity of *ET* and *CD* patients. Tables 4 and 5 provide samples of the experiment results.

**Table 4.** Accuracy of severity measurement in *ET* patients.



**Table 5.** Accuracy of severity measurement in *CD* patients.

#### **6. Discussion**

As a result of training the classifier using the proposed feature quantity and *k*-division cross-validation, in the classification experiment featuring healthy subjects and tremor patients, Table 1 shows the best results using *SVM*, with an accuracy of 86.7%. In all three misdiagnoses, the examining doctors incorrectly assessed the symptoms to be mild and had difficulty in correctly assessing the symptoms even when reviewing the videos.

According to the classification of *ET* and *CD* patients, from Table 2, the best result was obtained with *SVM*, and its accuracy was 83.6%. Four misdiagnoses occurred, in three of which physicians incorrectly assessed the symptoms to be mild. The data for the single remaining *ET* patient presented *CD*-like characteristics, such as a high *RMSD* value due to severe symptoms with much shaking of the finger up and down and much time for the initial trip from examiner's finger to patient's nose. These severe symptoms seem to be the cause of the misclassification.

As shown in Table 3, a classification experiment featuring healthy subjects, *ET* patients and *CD* patients, the best result was obtained using *SVM*, with an accuracy of 76.1%. The accuracy was slightly less than with the above two classification experiments. In order to improve the accuracy, new features that better differentiate classes must be studied, and the dataset should be enlarged.

#### *6.1. About Severity Measurement*

Improving the measurement of severity first involves adapting the threshold to the amount of feature data, calculating the score, and then taking the measurement. As a result of doing so, the measurement accuracy for both *ET* and *CD* patients could be improved to 75.0%. In one example, the inaccuracy resulted from repeating training for the *FNF* test. In this case, the score was low, and the doctor incorrectly assessed the symptoms as severe, though the experimental results indicate that the symptoms were actually mild. In addition, since the purpose of this study is to facilitate measurements in the examination room, the conditions for recording video, such as camera placement, have not yet been optimized. For this reason, some erroneous results were obtained because the feature amount for the average moving distance of the finger increases when the camera is close and decreases when the camera is far. In order to solve this problem, normalization processing could be added so that the feature amount does not change depending on the conditions of video recording.

#### *6.2. Future Outlook*

In the future, the main issues will be the examination of new features and the use of new methods. The *ET* tremor is regular, and its frequency is generally 4–12 Hz [4,26,27]. In order to focus on the frequency component of tremors in future research, it is expected that diagnostic accuracy will be improved by performing analysis using the fast-Fourier transform.

Due to the fact that severity is difficult to define, we only differentiated mild and severe symptoms rather than attempting a more granular assessment. As a future challenge, we will quantify the degree of severity. Doing so will allow understanding the effect of therapy and enables doctors to modify prescriptions when symptoms do not improve.

#### **7. Conclusions**

In this paper, we proposed a non-contact method of discriminating *ET* and *CD* and a method of measuring tremor severity by analyzing the *FNF* test using image processing technology. We proposed feature quantities to quantify what a doctor actually focuses on in the *FNF* test and trained a classifier using these feature quantities. As a result of performing *k*-fold cross-validation on the classifier, *SVM* obtained an accuracy of 83.6% in classifying *ET* and *CD* patients. In addition, threshold processing was applied to the amount of feature data in each dataset, the score was calculated, and the severity was evaluated. As a result, the severity of symptoms for both *ET* and *CD* patients could be evaluated with an accuracy of 75.0%.

In the future, we expect to improve diagnostic accuracy by examining new features and using new methods, including some analysis of frames per second (fps) increase and Eigen background models focusing on tremor frequency and on detecting the nose of the patient and the finger of the examiner. Since the Eigen background model is based on the method of principal component analysis, we expect that a more clear foreground image (in our case, the finger area) would be extracted. It is also necessary to consider various approaches to quantify severity. In addition, we will increase the amount of data collected and aim to build a more reliable system. Moreover, in our future work, we would like to explore and analyze the raw recorded data by using a machine learning approach, such as using recurrent convolutional neural networks to extract prominent features from the data.

**Author Contributions:** This paper is organized and written by the second author T.T.Z. The second author also laid down the conceptual model and supervised experiments by the first author Y.M. who was her Master's student. The third author N.I. and the fourth author H.M. provided the experimental environments to obtain real-life data and gave medical interpretations. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Sensors* Editorial Office E-mail: sensors@mdpi.com www.mdpi.com/journal/sensors

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-4031-3