Define network
class Net ( nn . Module ) :
def __init__ ( self ) :
super (Net , s el f ) . _ _ini t _ _ ( )
s e l f . conv1 = nn . Conv2d ( 3 , 6 , 5 )
s e l f . pool = nn . MaxPool2d ( 2 , 2 )
s e l f . conv2 = nn . Conv2d ( 6 , 16 , 5 )
s e l f . f c 1 = nn . Linear ( 1 6 * 96 * 146, 120)
s e l f . f c 2 = nn . Linear (120 , 84)
s e l f . f c 3 = nn . Linear ( 8 4 , 2 )
def forward ( self , x ) :
x = sel f . pool (F . relu ( sel f . conv1 ( x ) ) )
x = sel f . pool (F . relu ( sel f . conv2 ( x ) ) )
x = x . view ( −1 , 16 * 96 * 146)
x = F. relu ( self . fc1 (x) )
x = F. relu ( self . fc2 (x) )
x = self . fc3 (x)
return~x
net = Net ( )
```
To make it easier to reproduce our research results, here are the versions of the libraries we installed to carry out the experiments (see Listing 2) .

#### Listing 2: Versions of the libraries used.

```
#version for GPU
conda in s t all −c pytorch torchvision
conda ins tall pytorch seaborn numpy m a t pl o tli b
#version for CPU
conda ins tall pytorch ==1.2.0 torchvision ==0.4.0 cpuonly −c pytorch
conda install seaborn numpy m a t pl o tli b
```
#### **5. Experimental Session Details**

To carry out the research we have used the bitmap library.hpp—see [63] library, which allows simple image processing. We used RIYADH (37 objects with strong defects, 72 without damages and 27 with weak artefacts), MOSCOW (53 correct, 26 difficult and 20 with strong artefacts) and ASMOW dataset (data from our project, we have only few defects; 50 correct objects and only 3 damaged) to check our methods. With RIYADH and MOSCOW collections, our aim was to find a pre-processing technique that will improve learning by means of a deep neural network. With ASMOW (due to its small class with artefacts), we were looking for a technique to effectively mark artefacts in an image, without classifying them with a network. In an experiment with deep neural network classification—see details in Section 4.6—the image sets were divided into a training subset where the network is taught and the validation test set, 20 percent of the objects on which the final neural network was tested. To estimate the quality of the classification on RIYADH and MOSCOW data we used the Monte Carlo Cross Validation [59,64] 5 technique (MCCV5, i.e., five times train and test), presenting average results. In the tests, we considered two binary classifications and one of three classes. In case of separation of three classes, in the network settings (from Section 4.6) we set the number of classes and we consider three outputs from the network.

#### *5.1. Artefacts Detection in the RIYADH Dataset*

Consider the examples of problems with artefacts detection from our experimental groups of quick-look images. The first one is the RIYADH quick-look dataset. In Figure 3 we present sample images with artefacts and undamaged ones.

**Figure 3.** RIYADH quick-looks—exemplary pictures—the two on the left with artefacts, two on the right undamaged.

Next, we describe the steps of searching for our method of extracting artefacts on the RIYADH collection.

The first step was to convert the image into a greyscale form—according to the formula from Section 4.2.

In the next step, we tested the detectors of features based on pixel convolution.

#### *5.2. Overview of Feature Detectors Based on Convolution*

In this section, we present the selected feature detectors on the basis of an image with artefacts—see the left picture (Figure 4). We treated this step as a pre-processing stage allowing us to extract specific features of images. For example, masks allowed us to create frames of artefacts potentially useful for shape detection in the image. Some examples of well-functioning masks in this context are seen in Figures 4 and 5. Consider the effect of selected popular filters [49] usage: in the middle picture (Figure 4) we have Sobel's gradient sharpening, in the right picture (Figure 4) we have Laplace filtering (sharpening), in the left picture (Figure 5) we have Gaussian blur mask and finally in the right picture (Figure 5) we have Emboss mask usage. According to our tests, Gaussian blur filter is one the best at extracting artefacts from the RIYADH images. We applied this mask to the hybrid technique together with thresholding.

We have considered the following feature detectors:

**Figure 4.** On the left side we have sample image for testing feature detectors. In the middle there is convolution based on Sobel's gradient sharpening. On the right convolution based on Laplace filtering (sharpening).

**Figure 5.** On the left side we have convolution based on Gaussian blur mask. On the right convolution based on Emboss mask.

We provide more extensive testing of the Gaussian blur on the RIYADH data in Figure 6.

**Figure 6.** Comparison of the effects of Gaussian blur 3 × 3 convolution in the pictures, from the left: with artefacts, without artefacts and difficult—the discrimination of classes is not explicit.

#### *5.3. Application of Thresholding*

Next, we discuss the results of experiments with the application of thresholding techniques from Section 4.4. In Figure 7 we have a demonstration of the first threshold method from Section 4.4. As we can observe, discrimination is not very accurate when using this technique alone. Using the hybrid method, combining Gaussian blur and thresholding—see Figure 8, we get a slightly better artefacts separation effect. The application of the second thresholding method based on the pixel histogram can be seen in Figure 9.

**Figure 7.** Comparison of the effects of thresholding in the pictures, from the left: with artefacts, without artefacts and difficult—the discrimination of classes is not explicit.

**Figure 8.** Demonstration of the application of two steps, the convolution with Gaussian blur 3 × 3 and thresholding 0.1, from the left: with artefacts, without artefacts and difficult. Example of a Python (Jupiter Notebook) thresholding. The level of artefacts separation is higher after these two steps. Classification may consist in calculating the narrowest possible uniform white surface.

**Figure 9.** Visualization of the application of pixel frequency threshold. In this case a histogram is created for [0, 255], and the frequency threshold is set to 300. All color values that occur more often are replaced by black. After thresholding on the basis of frequency, it is clear that the artefacts have become exposed; however, there are also many unnecessary white pixels in the picture. From the left: with artefacts, without artefacts and difficult.

#### *5.4. Application of Nearest Neighbor Filtering*

In Figures 10 and 11, we present hybrid methods which involved thresholding based on pixel histogram and noise filtering technique using the nearest neighbors of pixels. The use of these methods clearly shows the possibility of separating artefacts.

**Figure 10.** Visualization of the application of pixel frequency threshold and one nearest neighbor filtering. In this case, a histogram was created for [0, 255], and the frequency threshold was set to 300. All color values that occur more often are replaced by black. The next step of the image processing was to swap pixels with their nearest neighbors. This step filtered out some unstructured pixels. The level of discrimination has clearly increased. From the left: with artefacts, without artefacts and difficult.

**Figure 11.** Visualization of the application of pixel frequency threshold and one nearest neighbor filtering. In this case a histogram is created for [0, 255], and the frequency threshold is set to 300. All color values that occur more often are replaced by black. The next step of the image processing was to swap pixels with their nearest neighbors. In the present case, we have done the swap procedure twice. This step filtered out some unstructured pixels. The level of discrimination has clearly increased. From the left: with artefacts, without artefacts and difficult.

#### *5.5. Summary of Results for Detection Artefacts in RIYADH Dataset*

Summarizing the results obtained, we can state that the most effective technique of detecting artefacts, from among the ones we have studied, is the use of thresholding based on histograms of pixel values and noise filtering using the closest neighbors method. The result of good performance of this combination was predictable, because artefact colors usually have a low frequency in pixel histograms; therefore, it is quite easy to visualize clear artefacts. Then the filtering method allows us to remove single, unstructured pixels. Let us present the results (before and after the application of our method) on the RIYADH dataset using the neural network described in Section 4.6. The exemplary epochs of learning for class *ok* (without artefacts) and *er* (with strong artefacts) (before and after the application of our method) is available in Figure 12. A similar result for classification of *ok* class and class *di f ficult* (with weak artefacts) is in Figure 13. In Figure 14 we have learning between the three mentioned classes. The exact results from the MCCV5 test can be seen in Table 1. The results are promising, considering the separation of the undamaged image class from the heavily damaged ones—the reference classification level was about 74 percent, and after applying our method, the degree of class distinction (artefacts detection) increased to a level close to 92 percent of accuracy. Similarly, the detection accuracy of weak artefacts has increased from about 68 percent to 84 percent. A spectacular increase can be observed in the level of discrimination between undamaged images and weak and strong artefacts from 54 percent to nearly 81 percent on validation set. Seeing Figures 12–14, the process of learning the neural network after applying our method seems to be more stable. Standard deviation (see Table 1) without pre-processing reaches 13 percentage points, after applying our method, it is within 5, 6 percentage points for variants (er vs. ok), and classifying all three classes. In case of the classification of weak artefacts is it similar in both cases within 8 percentage points.

**Figure 12.** Exemplary learning effect before (left side) and after using our technique (right side). Result for RIYADH dataset. Class ok (without artefacts) vs. class er (with strong artefacts). The line graphs shows accuracy (blue) and loss (orange) for 15 epochs (30 steps from batch). Although the image dataset is relatively small for Convolutional Neural Network (CNN), there can be observed a good trend in increasing accuracy, while minimizing loss.

**Figure 13.** Exemplary learning effect before (left side) and after using our technique (right side). Result for RIYADH dataset. Class ok (without artefacts) vs. class difficult (with weak artefacts). The line graphs shows accuracy (blue) and loss (orange) for 15 epochs (30 steps from batch). Although the image dataset is relatively small for CNN, there can be observed a good trend in increasing accuracy, while minimizing loss.

**Figure 14.** Exemplary learning effect before (left side) and after using our technique (right side). Result for RIYADH dataset. Class ok (without artefacts) vs. class er (with strong artefacts) vs. class difficult (with weak artefacts). The line graphs shows accuracy (blue) and loss (orange) for 15 epochs (30 steps from batch). Although the image dataset is relatively small for CNN, there can be observed a good trend in increasing accuracy, while minimizing loss. The accuracy of classification on the validation set came out nearly 55 percent.

**Table 1.** Summary of results for RIYADH–MCCV5 technique; nil.ok.er = accuracy of classification of er and ok classes before pre-processing, ok.er = accuracy of classification of er and ok classes after application of our method, nil.ok.tr, ok.tr, nil.all, all = analogous parameters, showing the class separation before and after application of our technique, SD = standard deviation of results, avg = average result.


Next, we discuss the results on artefacts collection from the MOSCOW database.

#### *5.6. Results for MOSCOW Dataset*

We considered three classes: *0* (no artefacts), *I* (weak artefacts) and *II* (strong artefacts)—see Figure 15. We conducted experiments with similar network settings as in Section 5.1. The size of classes *0*, *I* and *I I* was 53, 26 and 20, respectively.

**Figure 15.** MOSCOW quick-looks—exemplary pictures, from the left to right, without artefacts (class *0*), with weak artefacts (class *I*) and with strong artefacts (class *I I*).

We applied the same steps as for the RIYADH dataset. We used a frequency range of 1000 for thresholding. Samples of data after applying our method on the MOSCOW collection can be seen in Figure 16.

**Figure 16.** MOSCOW quick-looks—exemplary pictures, from the left to right, without artefacts (class *0*), with weak artefacts (class *I*) and with strong artefacts (class *I I*).

Next, we briefly present our results.

#### *5.7. Summary of Results for MOSCOW Dataset*

Detailed test results using the MCCV5 method are shown in Table 2. An example of the learning effect on these data before and after the application of our method can be seen in Figures 17–19. The results are not as good as those of RIYADH, but they indicate the positive effects of our method. For large artefacts, efficiency has increased from 61 percent to 80 percent after using our method. The results for the classification of weak artefacts (class *I*) are comparable on the validation set. Despite the fact that the quality of neural network learning on three classes has increased significantly (see Figure 19), the classification on the validation set increased by around 8 percentage points. Standard deviation (see Table 2) without pre-processing reaches around 15 percentage points, after applying our method, it is within 5, 6 percentage points for variants (*I I* vs. *0*) and classifying all three classes. In case of the classification of weak artefacts the results are comparatively low with high standard deviation up to 18 percentage points.

**Table 2.** Summary of results for MOSCOW–MCCV5 technique; nil.ok.er = accuracy of classification of er and ok classes before pre-processing, ok.er = accuracy of classification of er and ok classes after application of our method, nil.ok.tr, ok.tr, nil.all, all = analogous parameters, showing the class separation before and after application of our technique, SD = standard deviation of results, avg = average result.


**Figure 17.** Exemplary learning effect before (left side) and after using our technique (right side). Result for MOSCOW dataset. Class *0* (without artefacts) vs. class *I* (with weak artefacts). The line graphs shows accuracy (blue) and loss (orange) for 15 epochs (30 steps from batch). Although the image dataset is relatively small for CNN, there can be observed a good trend in increasing accuracy, while minimizing loss.

**Figure 18.** Exemplary learning effect before (left side) and after using our technique (right side). Result for MOSCOW dataset. Class *0* (without artefacts) vs. class *I I* (with strong artefacts). The line graphs shows accuracy (blue) and loss (orange) for 15 epochs (30 steps from batch). Although the image dataset is relatively small for CNN, there can be observed a good trend in increasing accuracy, while minimizing loss.

**Figure 19.** Exemplary learning effect before (left side) and after using our technique (right side). Result for MOSCOW dataset. Class *0* (without artefacts) vs. class *I* (with weak artefacts) vs. class *II* (with strong artefacts). The line graphs shows accuracy (blue) and loss (orange) for 15 epochs (30 steps from batch). Although the image dataset is relatively small for CNN, there can be observed a good trend in increasing accuracy, while minimizing loss.

Let us move on to test the ASMOW collection, an important one for our project.

#### *5.8. Results for ASMOW Dataset*

In this section, we describe the detection of artefacts, which appear in the ASMOW quick-look dataset. Performing referenced deep learning classification was not possible due to the availability of RFI-affected images. From here we simply present the procedure for detecting artefacts on this data—with the position of the artefact marked in the picture.

The detection procedure is as follows:


**Figure 20.** Example of original data from the ASMOW quick-look dataset, there is a sample with an artefact in the middle.

**Figure 21.** Step1: We carried out the pixel convolution with the Mask14 (Gaussian blur), on the data from Figure 20.

**Figure 22.** Step2: The thresholding described in Section 4.4 and the noise reduction by method 1 nn described in Section 4.5.

**Figure 23.** Step3: In this step, we binarized the pixels for better recognition of the artefact.

**Figure 24.** Step4: In the last step, we detected the artefact: a horizontal plane that passes unevenly through the line. We established that we are looking for a sufficiently long plane, consisting of seven-pixel blocks passing horizontally through the image at the same height. Our acceptance threshold for detecting the artefact, resulting from the experiments, is 40. So the artefact consists of a minimum of 40 seven-pixel blocks arranged horizontally.

#### *5.9. Summary of Results for ASMOW Dataset*

The experimental results show that the detection of these very little visible artefacts requires the following steps. The first two are analogous to the detection of the artefacts of RIYADH dataset, that is, (1) we use pixel convolutions by means of Gaussian blur, (2) thresholding on the basis of the color occurrence frequency, (3) defrosting using the 1 nn technique, (4) binarization and (5) detection of a plane consisting of pixel blocks. Artefacts can be successfully detected as we can see in Figure 24.

#### **6. Conclusions**

In this work, we tested a group of image processing techniques to separate clear RFI artefacts from undamaged images. We reviewed the masks used to select features in the process of convolution. Then we tested methods of thresholding. The first one is based on the selection of pixels in a fixed standardized range. The second one involves filtering pixels that do not meet the established criterion of color frequency (based on color histogram). Then we tested the hybrid solution with thresholding and filtration method based on the nearest neighbor of pixels. We verified the results of our methods of separating artefacts using a convolutional neural network as a reference classifier. The classification was carried out on raw data and on data prepared by our methods—with the Monte Carlo Cross Validation 5 model. In our work, we considered three datasets with artefacts. The first one comes from RIYADH, the second one from MOSCOW and the third one from ASMOW quick-look dataset. In case of RIYADH and MOSCOW datasets we see that for the separation of our RFI artefacts, the best level of their separation is given by the Gaussian blur. The best method, among the tested ones, that gives a clear separation of large artefacts is a hybrid of frequency based on thresholding and filtration with the nearest neighbors method. In case of the ASMOW dataset of RFI artefacts, we applied the same steps as with RIYADH and MOSCOW, additionally applying binarization and detection of the plane arranged horizontally on the image consisting of pixel blocks. The initial goal was achieved, we found an easy to implement method of separating large RFI artefacts—not repairable with image filtering methods. The proposed solution can effectively support the automatic PSInSAR processing workflow by recognizing RFI-affected data, and as a consequence, removing them from the stack of SAR images required to determine the ground displacements. Our method improves (compared to the classification on raw data) the efficiency of artefact detection by up to 27 percentage points depending on the classification context under consideration. The standard deviation of the results after application of our methods is nearly 5, 6 percentage points (except for the unstable classification of weak artefacts). The obvious conclusion: it is difficult to find a method to generalize the problem of searching for artefacts. Each dataset containing artefacts should be treated individually. The model structure should be selected in a personalized way.

One of the methods of developing our detection system was the use of techniques for the recognition of specific shapes relevant to the appearing RFI artefacts and the use of complex convolutional neural networks—which is the foreground of our future research.

**Author Contributions:** Conceptualization, P.A., J.R. and A.C.; methodology, software, validation, formal analysis and investigation, P.A.; data acquisition, J.R. and A.C.; resources, A.C. and P.A.; writing—original draft preparation, A.C. and P.A.; writing—review and editing, A.C. and P.A.; funding acquisition, J.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research conducted under the project entitled automatic system for monitoring the influence of high-energy paraseismic tremors on the surface using GNSS/PSInSAR satellite observations and seismic measurements (Project No. POIR.04.01.04-00-0056/17), co-financed from the European Regional Development Fund within the Smart Growth Operational Programme 2014-2020.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. NASA. Landsat 1. Landsat Science. 2020. Available online: https://landsat.gsfc.nasa.gov/landsat-1/ (accessed on 28 February 2020).


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Waterfall Atrous Spatial Pooling Architecture for Efficient Semantic Segmentation**

#### **Bruno Artacho and Andreas Savakis \***

Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY 14623, USA; bmartacho@mail.rit.edu

**\*** Correspondence: andreas.savakis@rit.edu

Received: 25 October 2019; Accepted: 29 November 2019; Published: 5 December 2019

**Abstract:** We propose a new efficient architecture for semantic segmentation, based on a "Waterfall" Atrous Spatial Pooling architecture, that achieves a considerable accuracy increase while decreasing the number of network parameters and memory footprint. The proposed Waterfall architecture leverages the efficiency of progressive filtering in the cascade architecture while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method does not rely on a postprocessing stage with Conditional Random Fields, which further reduces complexity and required training time. We demonstrate that the Waterfall approach with a ResNet backbone is a robust and efficient architecture for semantic segmentation obtaining state-of-the-art results with significant reduction in the number of parameters for the Pascal VOC dataset and the Cityscapes dataset.

**Keywords:** semantic segmentation; computer vision; atrous convolution; spatial pooling

#### **1. Introduction**

Semantic segmentation is an important computer vision task [1–3] with applications in autonomous driving [4], human–machine interaction [5], computational photography [6], and image search engines [7]. The significance of semantic segmentation, in both the development of novel architectures and its practical use, has motivated the development of several approaches that aim to improve the encouraging initial results of Fully Convolutional Networks (FCN) [8]. One important challenge to address is the decrease of the feature map size due to pooling, which requires unpooling to perform pixel-wise labeling of the image for segmentation.

DeepLab [9], for instance, used dilated or Atrous Convolutions to tackle the limitations posed by the loss of resolution inherited from unpooling operations. The advantage of Atrous Convolution is that it maintains the Field-of-View (FOV) at each layer of the network. DeepLab implemented Atrous Spatial Pyramid Pooling (ASPP) blocks in the segmentation network, allowing the utilization of several Atrous Convolutions at different dilation rates for a larger FOV.

A limitation of the ASPP architecture is that the network experiences a significant increase in size and memory required. This limitation was addressed in [10], by replacing ASPP modules with the application of Atrous Convolutions in series, or cascade, with progressive rates of dilation. However, although this approach successfully decreased the size of the network, it presented the setback of decreasing the size of the FOV.

Motivated by the success achieved by a network architecture with parallel branches introduced by the Res2Net module [11], we incorporate Res2Net blocks in a semantic segmentation network. Then, we propose a novel architecture named the Waterfall Atrous Spatial Pooling (WASP) and use it in a semantic segmentation network we refer to as WASPnet (see segmentation examples in Figure 1). Our WASP module combines the cascaded approach used in [10] for Atrous Convolutions

with the larger FOV obtained from traditional ASPP in DeepLab for the deconvolutional stages of semantic segmentation.

**Figure 1.** Semantic segmentation examples using WASPnet.

The WASP approach leverages the progressive extraction of larger FOV from cascade methods, and is able to achieve parallelism of branches with different FOV rates while maintaining reduced parameter size. The resulting architecture has a flow that resembles a waterfall, which is how it gets its name.

The main contributions of this paper are as follows.


#### **2. Related Work**

The innovations in Convolutional Neural Networks (CNNs) by the authors of [12–15] form the core of image classification and serve as the structural backbone for state-of-the-art methods in semantic segmentation. However, an important challenge with incorporating CNN layers in segmentation is the significant reduction of resolution caused by pooling.

The breakthrough work of Long et al. [8] introduced Fully Convolutional Networks (FCN) by replacing the final fully connected layers with deconvolutional stages. FCN [8] addressed the

resolution reduction problem by deploying upsampling strategies across deconvolution layers. These deconvolution stages attempt to reverse the convolution operation and increase the feature map size back to the dimensions of the original image. The contributions of FCN [8] triggered research in semantic segmentation that led to a variety of different approaches that are visually illustrated in Figure 2.

**Figure 2.** Semantic segmentation research overview.

#### *2.1. Atrous Convolution*

The most popular technique shared among semantic segmentation architectures is the use of dilated or Atrous Convolutions. An early work by Yu et al. [16] highlighted the uses of dilation. Atrous convolutions were further explored by the authors of [9,10,17,18]. The main objectives of Atrous Convolutions are to increase the size of the receptive fields in the network, avoid downsampling, and generate a multiscale framework for segmentation.

The name Atrous is derived from the French expression "algorithm à trous", or translated to English "Algorithm with holes". As alluded by its name, Atrous Convolutions alter the convolutional filters by the insertion of "holes", or zero values in the filter, resulting in the increased size of the receptive field, resembling a hybrid of convolution and pooling layers. The use of Atrous Convolutions in the network is shown in Figure 3.

In the simpler case of a one-dimensional convolution, the output of the signal is defined as follows [9],

$$\log[i] = \sum\_{k=1}^{K} \mathbf{x}[i+rk] \cdot w[k] \tag{1}$$

where *r* is the rate at which the Atrous Convolution is dilated, *ω*[*k*] is the filter of length K, *x*[*i*] is the input, and *y*[*i*] is the output of a pixel. As pointed out in [9], a rate value of the unit results in a regular convolution operation.

**Figure 3.** Input pixels using a 3 × 3 Atrous Convolutios with different dilation rates of 1, 2, and 3, respectively.

Leveraging the success of the Spatial Pyramid Pooling (SPP) structure by He et al. [19], the ASPP architecture was introduced in DeepLab [9]. The special configuration of ASPP assembles dilated convolutions in four parallel branches with different rates. The resulting feature maps are combined by fast bilinear interpolation with an additional factor of eight to recover the feature maps in the original resolution.

#### *2.2. DeepLabv3*

The application of Atrous Convolution followed the ASPP approach in [9] was later extended in [10] to the cascade approach, that is, the use of several Atrous Convolutions in sequence with rates increasing through its flux. This approach, named Deeplabv3 [10], allows the architecture to perform deeper analysis and increment its performance using approaches similar to those in [20].

Contributions in [10] included module realization in a cascade fashion, investigation of different multi-grid configurations for dilation in the cascade of convolutions, training with different output stride scales for the Atrous Convolutions, and techniques to improve the results when testing and fine-tuning for segmentation challenges. Another addition presented by [10] is the inclusion of a ResNet101 model, pretrained on both ImageNet [21] and JFT-300M [22] datasets.

More recently, DeepLabv3+ [17] proposed the incorporation of ASPP modules with the encoder–decoder structure adopted by [23], reporting a better refinement in the border of the objects being segmented. This novel approach represented a significant improvement in accuracy from previous methods. In a separate development, Auto-DeepLab [24] uses an Auto-ML approach to learn a semantic segmentation architecture by searching both the network level and the cell level of the structure. It achieves results comparable to current methods without requiring ImageNet [21] pre-training or hierarchical architecture search.

#### *2.3. CRF*

A complication resulting of the lack of pooling layers is a reduction of spatial invariance. Thus, additional techniques are used to recover spatial definition, namely, Conditional Random Fields (CRF) and Atrous Convolutions. One popular method relying on CRF is CRFasRNN [25]. Aiming to better delineate objects in the image, CRFasRNN combines CNN and CRF in a single network to incorporate the probabilistic method of the Gaussian pairwise potentials during inference. That enables end-to-end training, avoiding the need of postprocessing with a separate CRF module, as done in [9]. A limitation of architectures using CRF is that CRF has difficulty capturing delicate boundaries, as they have low confidence in the unary term of the CRF energy function.

The postprocessing module of CRF performs refining of the prediction by Gaussian filters and iterative comparisons of pixels in the output image. The iteration process aims to minimize the "energy" *E*(*x*) below.

$$E(\mathbf{x}) = \sum\_{i} \theta\_i(\mathbf{x}\_i) + \sum\_{ij} \theta\_{ij}(\mathbf{x}\_i, \mathbf{x}\_j) \tag{2}$$

*Sensors* **2019**, *19*, 5361

The energy consists of the summations of the unary potentials *θi*(*xi*) = −*logP*(*xi*), where *P*(*xi*) is the probability (softmax) that pixel *i* is correctly computed by the CNN, and the pairwise potential energy *θij*(*xi*, *xj*), which is determined by the relationship between two pixels. Following the authors of [26], *θij*(*xi*, *xj*) is defined as

$$\theta\_{ij}(\mathbf{x}\_i, \mathbf{x}\_j) = \mu(\mathbf{x}\_i, \mathbf{x}\_j) \left[ \omega\_1 \cdot \exp\left(-\frac{||p\_i - p\_j||^2}{2\sigma\_\mathbf{a}^2} - \frac{||I\_i - I\_j||^2}{2\sigma\_\mathbf{\hat{g}}^2}\right) + \omega\_2 \cdot \exp\left(-\frac{||p\_i - p\_j||^2}{2\sigma\_\mathbf{\hat{g}}^2}\right) \right] \tag{3}$$

where the function *μ*(*xi*, *xj*) is defined to be equal to 1 in the case of *xi* = *xj* and zero otherwise, that is, the CRF only accounts for energy that needs to be minimized when the labels differ. The pairwise potential function utilizes two Gaussian kernels: the first depends on pixel positions *p* and the RGB color *I*; the second depends only on pixel positions. The Gaussian kernels are controlled by the hyperparameters *σα*, *σβ*, and *σγ*, which are determined through the iterations of the CRF, as well as the weights *ω*<sup>1</sup> and *ω*2.

#### *2.4. Other Methods*

In contrast to the large scale of segmentation networks using Atrous Convolutions, the Efficient Neural Network (ENet) [18] produces a real-time segmentation by trading-off some of its accuracy for a significant reduction in processing time, ENet is up to 18× faster than other architectures.

During learning, CNN architectures have the tendency to learn information that is specific to the scale of the input image dataset. In an attempt to deal with this issue, a multiscale approach is used. For instance, the authors of [27] proposed a network with two paths containing the original resolution image and another with double the resolution. The former is processed through a short CNN and the latter through a fully convolutional VGG-16. The first path is then combined with the upsampled version of the second resulting in a network that can deal with larger variations in scale. A similar approach is applied in [28–30], expanding the structure to include a larger amount of networks and scales.

Other architectures achieved good results in semantic segmentation by using an encoder–decoder variant. For instance, SegNet [23] utilizes both an encoder and decoder phase, while relying on pooling indices from the encoder phase to aid the decoder phase. The Softmax classifier generates the final segmentation prediction map. The architecture presented by SegNet was further developed to include Bayesian techniques to model uncertainty in the network [31].

Contrasting with the work in [8], ParseNet [32] completes an early fusion in the network, by performing an early merge of the global features from previous layers with the current map of the posterior layer. In ParseNet, the previous layer is unpooled and concatenated to the following layers to generate the final classifier prediction with both having the same size. This approach differs from FCN where the skip connection concatenates maps of different sizes.

Recurrent Neural Networks (RNN) have been used to successfully combine pixel-level information with local region information, enabling the RNN to include global context in the construction of the segmented image. A limitation of RNN, when used for Semantic Segmentation, is that it has difficulty constructing a sequence based on the structure of natural images. ReSeg [33] is a network based on previous work by ReNet [34]. ReSeg presents an approach where RNN blocks from ReNet are applied after a few layers of a VGG structure, generating the final segmentation map by the use of upsampling by transposed convolutions. However, RNN-based architectures suffer from the vanishing gradient problem.

Networks using Long Short-Term Memory (LSTM) aim to tackle the issue of vanishing gradients. For instance, LSTM Context Fusion (LSTM-CF) [35] utilizes the concatenation of an architecture similar to DeepLab to process RGB and depth information. It uses three different scales for the RGB feature response and depth, similar to the work in [36]. Likewise, the authors of [37] used four different LSTM cells, each receiving distinct parts of the image. Recurrent Convolutional Neural Networks (rCNN) [38]

recurrently train the network using different input window sizes fed into the RNN. This approach achieves better segmentation and avoids the loss of resolution encountered with fixed window fitting in RNN methods.

#### **3. Methodology**

We propose an efficient architecture for Semantic Segmentation making use of the large FOV generated by Atrous Convolutions combined with cascade of convolutions in a "Waterfall" configuration. Our WASP architecture provides benefits due to its multiscale representations as well as efficiency in the reduced size of the network.

The processing pipeline is shown in Figure 4. The input image is initially fed into a deep CNN (namely a ResNet-101 architecture) with the final layers replaced by a WASP module. The resultant score map with the probability distributions obtained from Softmax is processed by a decoder network that performs bilinear interpolation and generates a more efficient segmentation without the use of postprocessing with CRF. We provide a comparison of our WASP architecture with DeepLab's original ASPP architecture and with a modified architecture based on the Res2Net module.

**Figure 4.** WASPnet architecture for semantic segmentation.

#### *3.1. Res2Net-Seg Module*

Res2Net [11] is a recently developed architecture designed to improve upon ResNet [15]. Res2Net incorporates multiscale features with a Squeeze-and-Excitation (SE) block [39] to obtain better representations and achieves promising results. The Res2Net module divides the original bottleneck block into four parallel streams, each containing 25% of the layers that are fed to 4 different 3 × 3 convolutions. Simultaneously, it incorporates the output of the parallel convolution. The SE block is an adaptable architecture that can recalibrate the responses in the feature map channel by modeling the interdependencies between channels. This allows improvements in performance by exploiting the dependencies between feature maps without increase in the network size.

Inspired by the work in [11], we present a modified version of the Res2Net module that is suitable for segmentation, named Res2Net-Seg. The Res2Net-Seg module, shown in Figure 5, includes the main structure of Res2Net and, additionally, utilizes Atrous Convolutions for each scale for increased FOV and a fifth parallel branch that performs average pooling of all features, which incorporates the original scale in the feature map. The Res2Net-Seg module is utilized in the WASPnet architecture of Figure 4 in place of the WASP module. We next propose the WASP module, inspired by multiscale representations, which an improvement over both the Res2Net-Seg and the ASPP configuration.

**Figure 5.** Res2Net-Seg block.

#### *3.2. WASP Module*

We propose the "Waterfall Atrous Spatial Pyramid" module, shown in Figure 6. WASP is a novel architecture with Atrous Convolutions that is able to leverage both the larger FOV of the ASPP configuration and the reduced size of the cascade approach.

An important drawback of Atrous Convolution, applied in either the cascade fashion or the ASPP (parallel design), is that it requires a larger number of parameters and more memory for its implementation, compared to standard convolution. In [9], there was experimentation to replace convolutional layers of the network backbone architecture, namely, VGG-16 or ResNet-101, with Atrous Convolution modules, but it was too costly in terms of memory requirements. A compromise solution is to apply the cascade of Atrous Convolutions and ASPP modules starting after block 5 when ResNet-101 was utilized.

We overcome these limitations with our Waterfall architecture for improved performance and efficiency. The Waterfall approach is inspired by multiscale approaches [28,29], the parallel structures of ASPP [9], and Res2Net modules [11], as well as the cascade configuration [10]. It is designed with the goal of reducing the number of parameters and memory required, which are the main limitation of Atrous Convolutions. The WASP module is utilized in the WASPnet architecture shown in Figure 4.

A comparison between the ASPP module, cascade configuration, and the proposed WASP module is visually highlighted in Figures 6 and 7, for the ASPP and cascade modules. The WASP configuration consists of four branches of a Large-FOV being fed forward in a waterfall-like fashion. In contrast, the ASPP module uses parallel branches that use more parameter and are less efficient, while the cascade architecture uses sequential filtering operations lacking the larger FOV.

**Figure 6.** Proposed Waterfall Atrous Spatial Pooling (WASP) module.

**Figure 7.** Comparison for Atrous Spatial Pyramid Pooling (ASPP) [9] and Cascade configuration [10].

#### *3.3. Decoder*

To process the score maps resulting from the WASP module, a short decoder stage was implemented containing the concatenation with low level features from the first block of the ResNet backbone, convolutional layers, dropout layers, and bilinear interpolations to generate output maps in the same resolution of the input image.

Figure 8 shows the decoder and the respective stage dimensions and number of layers. The representation considers an input image with dimensions of 1920 × 1080 × 3 for width, height, and RGB color, respectively. In this case, the decoder receives 256 maps of dimensions 240 × 135 and 256 low level features of dimension 480 × 270. After matching the dimensions for inputs of the decoder, the layers are concatenated and processed through convolutional layers, dropout, and a final bilinear interpolation to reach the original input size.

**Figure 8.** Decoder used in the WASPnet method.

#### **4. Experiments**

#### *4.1. Datasets*

We performed experiments on three datasets used for pre-training, training, validation, and testing. Microsoft Common Objects in Context (COCO) dataset [40] was used by [9] as pre-training as it includes a large amount of data, allowing a good balance of starting weights when training with different datasets, and consequently allowing the increase in precision of the segmentation.

Pascal Visual Object Class (VOC) 2012 [41] is a dataset containing objects in different scenarios including people, animals, vehicles, and indoor objects. It contains three different types of challenges: classification, detection, and segmentation; the latter was utilized in this paper. For the segmentation benchmark, the dataset contains 1464 images for training, 1449 images for validation, and 1456 images for testing annotated for 21 classes. Data augmentation was used to increase the training set size to 10,582.

Cityscapes [42] is a larger dataset containing urban scene images recorded in street scenes of 50 different cities with pixel annotations of 25,000 frames. In the Cityscapes dataset, 5000 images are finely annotated at pixel level divided into 2975 images for training, 500 for validation, and 1525 for testing. Cityscapes is annotated in 19 semantic classes divided into 7 categories (construction, ground, human, nature, object, sky, and vehicle).

#### *4.2. Evaluation Metrics*

We based our comparison of performance to other methods using Mean Intersection over Union (mIOU), considered the most important and more widely used metric for semantic segmentation. A pixel-level analysis of detection is conducted, reporting the intersection of true positive (TP) pixels detection as a percentage of the union of TP with false negative (FN) and false positive (FP) pixels.

#### *4.3. Simulation Parameters*

We calculate the learning rate based on the polynomial method ("poly") [32], also adopted in [9]. The poly learning rate *LRpoly* results in more effective updating of the weights when compared to the traditional "step" learning rate, given as

$$LR\_{poly} = (1 - \frac{iter}{max\\_iter})^{power} \tag{4}$$

where *power*=0.9 was employed. We utilized a batch size of eight due to physical memory constraints in the hardware available, lower than the batch size of ten used by DeepLab. A subtle improvement in training with a larger batch size is expected for the architectures proposed.

We experimented with different rates of dilation on WASP. We found that larger rates result in better mIOU. A set rate of *r* = {6, 12, 18, 24} was selected for the WASP module. In addition, we performed pre-training using the MS-COCO dataset [40], and data augmentation in randomly selected images scaled between (0.5,1.5).

#### **5. Results**

Following training, validation, and testing procedures, the WASPnet architecture was implemented utilizing WASP module, Res2Net-Seg module, or ASPP module. The validation mIOU results are presented in Table 1 for the Pascal VOC dataset. When following similar guidelines as in [9] for training and hyperparameters, and using the WASP module, an mIOU of 80.22% is achieved without the need for CRF postprocessing. Our WASPnet resulted in a gain of 5.07% on the validation set and reduced the number of parameters by 20.69%.

**Table 1.** Pascal Pascal Visual Object Class (VOC) validation set results.


The Res2Net-Seg approach results in an mIOU of 78.53% without CRF, achieves mIOU of 80.12% with CRF, and reduces the number of parameters by 14.99%. The Res2Net-Seg approach still shows benefits with the incorporation of CRF as postprocessing, similar to the cascade and ASPP methods.

Overall, the WASP architecture provides the best result and the highest reduction in parameters. Sample results for the WASPnet architecture are shown in Figure 9 for validation images from the Pascal VOC dataset [41]. Note, from the generated segmentation, that our method presents a better definition in the detection shape, being closer to the ground-truth when compared to previous methods utilizing ASPP (DeepLab).

We tested the effects of different dilation rates (in our WASP module) on the final segmentation. In our tests, all kernel sizes were set to 3 following procedures as in [9]. Table 2 reports the accuracy, in mIOU, for the Pascal VOC dataset for different dilation rates in the WASP module. The configuration with dilation rates of {6, 12, 18, 24} resulted in the best accuracy for the Pascal VOC dataset, therefore, the following tests were conducted using this dilation rate.

**Table 2.** Pascal VOC validation set results for different sets of dilation in the WASP module.


We also experimented with postprocessing using CRF. The application of CRF has the benefit of better defining the shapes of the segmented areas. Similarly to the procedures followed in [9], we performed parameter tuning, for the parameters of Equation (3), by varying *ω*<sup>1</sup> between 3 and 6, *σα* from 30 to 100, and *σβ* from 3 to 6, while fixing both *ω*<sup>2</sup> and *σγ* to 3.

**Figure 9.** Results sample for Pascal VOC dataset [41].

The addition of CRF postprocessing to our WASPnet method resulted in a modest increase of 0.2% in the mIOU for both the validation and test sets of the Pascal VOC dataset. The gains from using CRF are less significant than those in [9], due to more efficient use of FOV by WASPnet. The effects of CRF on accuracy were not consistent across different classes. Classes with objects that do not have extremities, such as bottle, car, bus, and train, benefited most, whereas there was a decrease in accuracy for classes with more delicate boundaries such as bicycle, plant, and motorcycle.

Results on the testing Pascal VOC dataset are shown in Table 3. The additional training dataset column refers to DeepLabv3 types of models where a ResNet-101 model was pretrained on both ImageNet [21] and JFT-300M [22] when performing the test challenge for Pascal VOC. JFT-300M consists of Google's internal dataset of 300 million images labeled in 18,291 categories, and therefore these results cannot be compared directly to other external architectures including this work. The addition of the JFT dataset for training allows the architecture to achieve performance improvements that are not possible without the such a large number of training samples. Note that training of the WASPnet network was performed only on the training dataset provided by the challenge, consisting of 1464 images. Based on these results, WASPnet outperforms all of the other methods that are trained on the same dataset.


**Table 3.** Pascal VOC test set results.

WASPnet was also used with the Cityscapes dataset [42] following similar procedures. Table 4 shows the results obtained for Cityscapes, resulting in an mIOU of 74.0%, a gain of 4.2% from [9]. The Res2Net-Seg version of the network achieved 72.1% mIOU.


**Table 4.** Cityscapes validation set results.

For both WASP and Res2Net-Seg architectures tested on the Cityscapes dataset, the CRF postprocessing did not have much benefit. A similar result was found with DeepLab where CRF resulted in a small improvement of the mIOU. The higher resolution and shape of detected instances in the Cityscapes dataset likely affected the effectiveness of the CRF. With Cityscapes, we used a batch size of 4 due to hardware constraints during training; other architectures have used batch sizes of up to ten.

Table 5 shows the results of WASPnet on the Cityscapes testing dataset. WASPnet achieved mIOU of 70.5% and outperformed other architectures trained on the dame dataset. We only performed training on the fine annotation images from the Cityscapes dataset, containing 2975 images, whereas the DeepLabv3 style architectures used larger datasets for training, such as JFT-300M containing 300 million images for pre-training and and coarser dataset from Cityscapes containing 20,000 images.


**Table 5.** Pascal Cityscapes test set results.

Figure 10 shows examples of Cityscapes image segmentations with the WASPnet method. Like our observations from the Pascal VOC dataset, our method produces better defined shapes for the segmentation compared to DeepLab. Our results are closer to the ground-truth data, and show better segmentation of smaller objects that are further away from the camera.

**Figure 10.** Results sample for Cityscapes dataset [42].

Our results in Table 4 illustrate that postprocessing with CRF slightly decreased the mIOU by 0.8% in the Cityscapes dataset: CRF has difficulty dealing with delicate boundaries, which are common in the Cityscapes dataset. With WASPnet, the presence of larger FOV due to the WASP module is able to offset the potential gains of the CRF module from previous networks. An additional limitation is

that CRF requires substantial extra time for processing. For these reasons, we conclude that WASPnet can be used without CRF postprocessing.

#### *Fail Cases*

Classes that contain more delicate, and consequently harder to accurately detect, shapes contribute the most to segmentation errors. Particularly, tables, chairs, leaves, and bicycles present a bigger challenge to segmentation networks. These classes also resulted in a lower accuracy when applying CRF. Representative examples of fail cases are shown in Figure 11 for classes chair and bicycle, which are the most difficult to segment. Even in these cases, WASPnet (without CRF) is able to better detect the general shape compared to DeepLab.

**Figure 11.** Occurrence of fail cases to detect more delicate boundaries

#### **6. Conclusions**

We propose a "Waterfall" architecture based on the WASP module for efficient semantic segmentation that achieves high mIOU scores on the Pascal VOC and Cityscapes datasets. The smaller size of this efficient architecture improves its functionality and reduces the risk of overfitting without the need for postprocessing with the time consuming CRF. The results of WASPnet segmentation demonstrated superior performance compared to Res2Net-Seg and Deeplab. This work provides the foundation for further application of WASP in a broader range of applications for more efficient multiscale analysis.

**Author Contributions:** Conceptualization, B.A. and A.S.; methodology, B.A.; algorithm and experiments, B.A. and A.S.; original draft preparation, B.A. and A.S.; review and editing, B.A. and A.S.; supervision, A.S.; project administration, A.S.; funding acquisition, A.S.

**Funding:** This research was funded in part by National Science Foundation grant number 1749376.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:

ASPP Atrous Spatial Pyramid Pooling



#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### **Vision-Based Tra**ffi**c Sign Detection and Recognition Systems: Current Trends and Challenges**

#### **Safat B. Wali 1, Majid A. Abdullah 2, Mahammad A. Hannan 2,\*, Aini Hussain 1, Salina A. Samad 1, Pin J. Ker 2and Muhamad Bin Mansor <sup>2</sup>**


Received: 2 April 2019; Accepted: 26 April 2019; Published: 6 May 2019

**Abstract:** The automatic traffic sign detection and recognition (TSDR) system is very important research in the development of advanced driver assistance systems (ADAS). Investigations on vision-based TSDR have received substantial interest in the research community, which is mainly motivated by three factors, which are detection, tracking and classification. During the last decade, a substantial number of techniques have been reported for TSDR. This paper provides a comprehensive survey on traffic sign detection, tracking and classification. The details of algorithms, methods and their specifications on detection, tracking and classification are investigated and summarized in the tables along with the corresponding key references. A comparative study on each section has been provided to evaluate the TSDR data, performance metrics and their availability. Current issues and challenges of the existing technologies are illustrated with brief suggestions and a discussion on the progress of driver assistance system research in the future. This review will hopefully lead to increasing efforts towards the development of future vision-based TSDR system.

**Keywords:** Traffic sign detection and tracking (TSDR); advanced driver assistance system (ADAS); computer vision

#### **1. Introduction**

In all countries of the world, the important information about the road limitation and condition is presented to drivers as visual signals, such as traffic signs and traffic lanes. Traffic signs are an important part of road infrastructure to provide information about the current state of the road, restrictions, prohibitions, warnings, and other helpful information for navigation [1,2]. This information is encoded in the traffic signs visual traits: Shape, color and pictogram [1]. Disregarding or failing to notice these traffic signs may directly or indirectly contribute to a traffic accident. However, in adverse traffic conditions, the driver may accidentally or deliberately not notice traffic signs [3]. In these circumstances, if there is an automatic detection and recognition system for traffic signs, it can compensate for a driver's possible inattention, decreasing a driver's tiredness by helping him follow the traffic sign, and thus, making driving safer and easier. Traffic sign detection and recognition (TSDR) is an important application in the more recent technology referred to as advanced driver assistance systems (ADAS) [4], which is designed to provide drivers with vital information that would be difficult or impossible to come by through any other means [5]. The TSDR system has received an increasing interest in recent years due to its potential use in various applications. Some of these applications have been well defined and summarized in [6] as checking the presence and condition of the signs on highways, sign inventory in towns and cities, re-localization of autonomous vehicles; as well as

its use in the application relevant to this research, as a driver support system. However, a number of challenges remain for a successful TSDR systems; as the performance of these systems is greatly affected by the surrounding conditions that affect road signs visibility [4]. Circumstances that affect road signs visibility are either temporal because of illumination factors and bad weather conditions or permanent because of vandalism and bad postage of signs [7]. Figure 1 shows an example of some non-ideal invariant traffic signs. These non-identical traffic signs cause difficulties for TSDR.

**Figure 1.** Non-identical traffic signs: (**a**) Partially occluded traffic sign, (**b**) faded traffic sign, (**c**) damaged traffic sign, (**d**) multiple traffic signs appearing at a time.

This paper provides a comprehensive survey on traffic sign detection, tracking and classification. The details of algorithms, methods and their specifications on detection, tracking and classification are investigated and summarized in the tables along with the corresponding key references. A comparative study on each section has been provided to evaluate the TSDR methods, performance metrics and their availability. Current issues and challenges of the existing technologies are illustrated with brief suggestions and a discussion on the progress of driver assistance system research in the future. The rest of this paper is organized as follows: In Section 2, an overview on traffic signs and recent trends of the research in this field is presented. This is followed by providing a brief review on the available traffic sign databases in Section 3. The methods of detection, tracking, and classification are categorized, reviewed, and compared in Section 4. Section 5 revises current issues and challenges facing the researchers in TSDR. Section 5 summarizes the paper, draws the conclusion and suggestions.

#### **2. Tra**ffi**c Signs and Research Trends**

Aiming at standardizing traffic signs across different countries, an international treaty, commonly known as the Vienna Convention on Road Signs and Signals [8], was agreed upon in 1968. To date, 52 countries have signed this treaty, among which 31 are in Europe. The Vienna convention classified the traffic signs into eight categories, designated with letters A–H: Danger/warning signs (A), priority signs (B), prohibitory or restrictive signs (C), mandatory signs (D), special regulation signs (E), information, facilities or service signs (F), direction, position or indication signs (G), and additional panels (H). Examples of traffic signs in the United Kingdom for each of the categories are shown in Figure 2.

**Figure 2.** Examples of traffic signs: (**a**) A danger warning sign, (**b**) a priority sign, (**c**) a prohibitory sign, (**d**) a mandatory sign, (**e**) a special regulation sign, (**f**) an information sign, (**g**) a direction sign and (**h**) an additional panel.

Despite the well-defined laws in the Vienna Treaty, variations in traffic sign designs still exist among the countries' signatories to the treaty, and in some cases considerable variation within traffic sign designs can exist within the nation itself. These variations are easier to be detected by humans, nevertheless, they may pose a major challenge to an automatic detection system. As an example, different designs of stop signs in different countries are shown in Table 1.

**Table 1.** Example of stop signs in different countries.

In terms of research, recently there has been a growing interest in developing efficient and reliable TSDR systems. To show the current state of scientific research regarding this development, a simple search of the term "traffic sign detection and recognition" in the Scopus database has been carried out, with the aim of locating articles published in journals indexed in this database. To focus on the recent and most relevant research, the search has been restricted to the past decade (2009–2018) and only in the subjects of computer science and engineering. In this way, a set of 674 articles and 5414 citations were obtained. The publication and citation trends are shown in Figures 3 and 4, respectively. Generally, the figures indicate a relatively fast growth rate in publications and a rapid increase in citation impact. More importantly, it is clear from the figures that TSDR research has grown remarkably in the last three years (2016–2018), with the highest number of publications and citations representing 41.69% and 60.34%, respectively.

**Figure 3.** Trends of research for a traffic sign detection and recognition (TSDR) topic based on Scopus analysis tools.

**Figure 4.** Trends of citations for a TSDR topic based on Scopus analysis tools.

#### **3. Tra**ffi**c Sign Database**

A traffic sign database is an essential requirement in developing any TSDR system. It is used for training and testing the detection and recognition techniques. A traffic sign database contains a large number of traffic sign scenes and images representing samples of all available types of traffic signs: Guide, regulatory, temporary and warning signs. During the past few years, a number of research groups have worked on creating traffic sign datasets for the task of detection, recognition and tracking. Some of these datasets are publicly available for use by the research community. The detailed information regarding the publicly available databases are summarized Table 2. According to [1,9], the first and most widely used dataset is the German traffic sign dataset, which has two datasets: The German Traffic Signs Detection Benchmark (GTSDB) [10] and German Traffic Signs Recognition Benchmark (GTSRB) [11]. This dataset collects three important categories of road signs (prohibitory, danger and mandatory) from various traffic scenes. All traffic signs have been fully annotated with the rectangular regions of interest (ROIs). Examples of traffic scenes in the GTSDB database are shown in Figure 5 [12].


**Table 2.** Publicly available traffic sign databases [13].

**Figure 5.** Examples of traffic scenes in the German Traffic Signs Detection Benchmark (GTSDB) database [12].

#### **4. Tra**ffi**c Sign Detection, Tracking and Classification Methods**

As aforementioned, a TSDR is a driver supportive system that can be used to notify and warn the driver in adverse conditions. This system is a vision-based system that usually has the capability to detect and recognize all traffic signs, even those signs that may be partially occluded or somewhat distorted [14]. Its main tasks are locating the sign, identifying it and distinguishing one sign from another [15,16]. Thus, the procedure of the TSDR system can be divided into three stages, the detection, tracking and classification stages. Detection is concerned with locating traffic signs in the input scene images, whereas classification is about determining what type of sign the system is looking at [17,18]. In other words, traffic sign detection involves generating candidate region of interests (ROIs) that are likely to contain regions of traffic signs, while traffic sign classification gets each candidate ROI and tries to identify the exact type of sign or rejects the identified ROI as a false detection [4,19]. Detection and

classification usually constitute recognition in the scientific literature. Figure 6 illustrates the main stages of the traffic sign recognition system. As indicated in the figure, the system is able to work in two modes, the training mode in which a database can be built by collecting a set of traffic signs for training and validation, and a testing mode in which the system can recognize a traffic sign which has not been seen before. In the training mode, a traffic sign image is collected by the camera and stored in the raw image database to be classified and used for training the system. The collected image is then sent to color segmentation process where all background objects and unimportant information in the image are eliminated. The generated image from this step is a binary image containing the traffic sign and any other objects similar to the color of the traffic sign. The noise and small objects in the binary image are cleaned by the object selector process and the generated image is then used to create or update the training image database. According to [20], feature selection has two functions in enhancing the performances of learning tasks. The first function is to eliminate noisy and redundant information, thus getting a better representation and facilitating the classification task. The second function is to make the subsequent computation more efficient through lowering the feature space. In the block diagram, the features are then extracted from the image and used to train the classifier in the subsequent step. In testing mode, the same procedure is followed, but the extracted features are used to directly classify the traffic sign using a pre-trained classifier.

**Figure 6.** Block diagram of the traffic sign recognition system.

Tracking is used by some research in order to improve the recognition performance [21]. The three stages of a TSDR system are shown in Figure 7, and further discussed in the subsequent sections.

**Figure 7.** General procedure of TSDR system [22].

#### *4.1. Detection Phase*

The initial stage in any TSDR system is locating potential sign image regions from a natural scene image input. This initial stage is called the detection stage, in which a ROI-containing traffic sign ise actually localized [17,23,24]. Traffic signs usually have a strict color scheme (red, blue, and white) and specific shapes (round, square, and triangular). These inherent characteristics distinguish them from other outdoor objects making them suitable to be processed by a computer vision system automatically, thus, allow the TSDR system to distinguish traffic signs from the background scene [21,25]. Therefore, traffic sign detection methods have been traditionally classified into color-based, shape-based and hybrid (color–shape-based) methods [23,26]. Detection methods are outlined in Figure 8 and compared in the following subsections.

**Figure 8.** Different methods applied for traffic sign detection.

#### 4.1.1. Color-Based Methods

Color-based methods take advantage of the fact that traffic signs are designed to be easily distinguished from their surroundings, often colored in highly visible contrasting colors [17]. These colors are extracted to detect ROI within an input image based on different image-processing methods. Detection methods based on the color characteristics have low computing, good robustness and other characteristics, which can improve the detection performance to a certain extent [25]. However, methods based on color information can be used with a high-resolution dataset but not with grayscale images [23]. In addition, the main problem with using the color parameter is its sensitivity to various factors such as the distance of the target, weather conditions, time of the day, as well as reflection, age and condition of the signs [17,23].

In color-based approaches, the captured images are partitioned into subsets of connected pixels that share similar color properties [26]. Then the traffic signs are extracted by color thresholding segmentation based on smart data processing. The choice of color space is important during the detection phase, hence, the captured images are usually transformed into a specific color space where the signs are more distinct [9]. According to [27], the developed color-based detection methods are based on the red, green, blue (RGB) color space [28–30], the hue, saturation, and value (HSV) color space [31,32], the hue, saturation, and intensity (HSI) color space [33] and various other color spaces [34,35]. The most common color-based detection methods are represented in Figure 9 and reviewed respectively in Table 3.

**Figure 9.** Most popular color-based detection methods.

Color thresholding segmentation is one of the earliest techniques used to segment digital images [26]. Generally, it is based on the assumption that adjacent pixels whose value (grey level, color value, texture, etc.) lies within a certain range belong to the same class [36]. Normal color segmentation was used for traffic sign detection by Varun et al. [37] with their own created dataset, containing 2000 test images, resulting in an accuracy level of 82%. The efficiency was improved in [38] by using color segmentation followed by a color enhancement method. In recent research, color thresholding has commonly been used for pre-processing purposes [39,40]. In [39], pre-filtering was used to train a color classifier, which created a regression problem, whose core was to find a linear function, as shown in (1).

$$f(\mathbf{x}) = (w, \mathbf{x}) + b, \mathbf{x} = (v\_1, v\_2, v\_3)^{\dot{\mathbf{i}}} \tag{1}$$

where *vi* is the intensity value of *ith* channel (i = 1, 2, 3 for a three-channel RGB image), (*w*, *b*) ∈× are parameters that control the function and the decision rule is given by sgn(*f*(*x*)). In [40], Vazquez-Reina et al. used RGB to HSI color space conversion with the additional feature of white sign detection. The main advantage of this feature is its illuminated sign detection. In Refs. [33,41–45], HSI/HSV transformation approach was used for the purpose of detection. The major advantages of the HSI color space over the RGB color space are that it has only two components, hue and saturation, both are very similar to human perception and it is more immune to lighting conditions. In [33], a simple RGB to HSI color space transformation is used for the TSDR purpose. In [44], the HSI color space was used for detection, and then, the detected signal was passed to the distance to borders (DtBs) feature for shape detection to increase the accuracy level. The average accuracy was approximately 88.4% on GRAM database. The main limitation of using HSV transformation is the strong hue dependency of brightness. Hue is only a measurement of the physical lightness of a color, not the perceived brightness. Thus, the value of a fully saturated yellow and blue is the same.

Region growing is another simple and popular technique used for detection in TSDR systems. Region growing is a pixel-based image segmentation method that starts by selecting a starting point or seed pixel. Then, the region develops by adding neighboring pixels that are uniform, according to a certain match criterion, increasing step-by-step the size of the region [46]. This method was used by Nicchiotti et al. [47] and Priese et al. [48] for TSDR. Its efficiency was not very high, approximately 84%. Because this method is dependent on seed values, problems can occur when the seed points lie on edges, and, if the growth process is dominated by the regions, uncertainty around edges of adjacent regions may not be resolved correctly.

The color indexing method is another simple method that identifies objects entirely on the basis of color [49]. It was developed by Swain and Ballard [50] and was used by researchers in the early 1990s. In this method, a comparison of any two colored images is done by comparing their color histogram. For a given pair of histograms, *I* and *M*, each containing *n* bins, the histogram intersections are defined as [50]:

$$\sum\_{j=1}^{n} \min\{I\_j, \mathcal{M}\_j\}.\tag{2}$$

The match value is then,

$$H(I,M) = \frac{\sum\_{j=1}^{n} \min\{I\_j, M\_j\}}{\sum\_{j=1}^{n} M\_j}.\tag{3}$$

The advantage of using color histograms is their robustness with respect to geometric changes of projected objects [51]. However, color indexing is segmentation dependent, and complete, efficient and reliable segmentation cannot be performed prior to recognition. Thus, color indexing is negatively characterized as being an unreliable method.

Another approach to color segmentation is called a dynamic pixel aggregation [52]. In this method, the segmentation process is accomplished by introducing a dynamic thresholding to the pixel aggregation process in the HSV color space. The applied threshold is independent in terms of linearity and its value is defined as [52],

$$a = k - \sin(s\_{\text{send}}) \tag{4}$$

where, *k* is the normalization parameter and *Sseed* is the seed pixel saturation. The main advantage of this approach is hue instability reduction. However, it fails to reduce other segmentation-based problems, such as fading and illumination. This method was tested in [52] on their own created database with 620 outdoor images, resulting in an accuracy level approximately 86.3 to 95.7%.

The International Commission on Illumination 1997 Interim Color Appearance Model (CIECAM97) appearance model is another method has been used to detect and extract color information and to segment and classify traffic signs. Generally, color appearance models are capable of predicting color appearance under a variety of viewing conditions, including different light sources, luminance levels, surrounds, and lightness of backgrounds [53]. This model was used by Gao et al. [54] to transform the image from RGB to (International Commission on Illumination) CIE XYZ values. The main drawback of this model is its chromatic-adaptation transform, which is called the Bradford transform, where chromatic blues appear purple as the chroma is reduced at a constant hue angle.


**Table 3.** Colors based approaches for TSDR system.

The Green (Y), Blue (Cb), Red (Cr) (YCbCr) color space has been considered in recent approaches. Different from the most common color space RGB, which represents color as red, green and blue, YCbCr represents color as brightness and two-color difference signals. It was used for detection in [55], showing an accuracy level over 93% on their own collected database. The efficiency was improved to approximately 97.6% in [56] by first transforming RGB color space to YCbCr color space, then segmenting the image and performing shape-based analysis.

#### 4.1.2. Shape-Based Methods

Just as traffic signs have specific colors, they also have very well-defined shapes that can be searched for. Shape-based methods ignore the color in favor of the characteristic shape of signs [17]. Detection of a traffic sign via its shape follows the defining algorithm of shape detection i.e., to finding the contours and approximating it to reach a final decision based on the number of contours [15,23]. Shape detection is preferred for traffic signs recognition as the colors found on traffic signs changes according to illumination. In addition, shape detection reduces the search for a road sign regions

from the whole image to a small number of pixels [57]. However, for this method the memory and computational requirement is quite high for large images [58]. In addition, damaged, partially obscured, faded and blurred traffic signs may cause difficulties in detecting traffic signs accurately, leading to a low accuracy rate. Detection of the traffic signs in these methods is made from the edges of the image analyzed by structural or comprehensive approaches [23]. Many shape-based methods are popular in TSDR systems. These methods are represented in Figure 10 and reviewed respectively in Table 4.

**Figure 10.** Most popular shape-based detection methods.

The most common shape-based approach is the Hough transformation. The Hough transformation usually isolates features of a particular shape within a given frame/images [15]. It was applied by Zaklouta et al. in [59] to detect triangular and circular signs. Their own test datasets contained 14,763 and 1584 signs, and the accuracy rate was approximately 90%. The main advantage of the Hough transformation technique is that it is tolerant of gaps in feature boundary descriptions and is relatively unaffected by image noise [60]. However, its main disadvantage is the dependency on input data. In addition, it is only efficient for a high number of votes that fall in the correct bin. When the parameters are large, the average number of votes cast for a single bin becomes low, and thus, the detection rate is decreased.

Another shape-based detection method is the similarity detection. In this method the detection is performed by computing a similarity factor between a segmented region and set of binary image samples representing each road sign shape [57]. This method was used by Vitabile et al. [52] on their collected dataset with an accuracy level over 86.3%. The main advantage of this method is its straightforwardness, whilst its main drawback is that the input image should be perfectly segmented and the dimensions have to be same. In [52], the images were initially converted from RGB to HSV, then they were segmented and resized into 36 × 36 pixels. The similarity detection equation is,

$$\mathbf{x}' = \frac{\mathbf{x} - \mathbf{x\_{min}}}{\mathbf{x\_{max}} - \mathbf{x\_{min}}} \mathbf{.}\\\mathbf{n}y' = \frac{y - y\_{\text{min}}}{y\_{\text{max}} - y\_{\text{min}}} \mathbf{.}\tag{5}$$

where, *x*max, *y*max, *x*min and *y*min are the coordinates of the rectangle vertices.

Distance transform matching (DTM) is also another type of shape-based detection method. In this method, the distance transform of the image is formed by assigning each non-edge pixel a value that is a measure of distance to the nearest edge pixel. It was used by Gavrila [61] to capture large variations in object shape by identifying the template features to the nearest feature image from a distribution of distances. This distance is inversely proportional to the matching of the image and the templates of the images. The chamfer distance equation is:

$$D\_{channeler}(T, I) \equiv \frac{1}{T} \sum\_{t \in T} d\_I(t) \tag{6}$$

where |*T*| and *dI*(*t*) denote the number of features and the distance between features *t* in *T* and the closest feature in *I*, respectively. In his experiment, Gavrila [61] used DTM to examine 1000 collected test images, and the accuracy was approximately 95%. The DTM technique is efficient for detecting arbitrary shapes within images. However, its main disadvantage is the vulnerability of detecting cluttered images.

Another popular two colorless traffic sign detection methods are edge detection features and Haar-like features. Edge detection refers to the process of identifying and locating sharp discontinuities in an image [62]. By using this method, image data is simplified for the purpose of minimizing the amount of data to be processed. This method was used in [63–67] for indicating the boundaries of objects within the image through finding a set of connected curves. The Haar-like features method was proposed by Paul Viola and Michael Jones [68] based on the Haar wavelet to recognize the target objects. As indicated in Table 4, the Haar-like features based detection method was used in [69,70] for traffic sign detection. The main advantage is its calculating speed, where any size of images can be calculated in a constant time. However, its weakness is the requirement of a large number of training images and high false positive rates [23].

#### 4.1.3. Hybrid Methods

As previously discussed, both color-based and shape-based methods have some advantages and disadvantages. Therefore, researchers recently have tried to improve the efficiency of the TSDR system using a combination of color- and shape-based features. In the hybrid methods, either color-based approaches take shape into account after having looked at colors, or shape detection is used as the main method but integrate some color aspects as well. In color-based approaches a two-stage strategy is usually employed. First, segmentation is done to narrow the search space. Subsequently, shape detection is implemented and is applied only to the segmented regions [58]. Color and shape features were combined into traffic sign detection algorithms in studies [71–76]. In these studies, different signs with various colors and shapes were detected using different datasets.

**Table 4.** Shape-based methods for TSDR system.


#### *4.2. Tracking Phase*

For robust detection and in order to increase the accuracy of the information used in identifying traffic signs, the signs are tracked using a simple motion model and temporal information propagation. This tracking process is very important for real-time applications, by which the TSDR system verifies correctness of the traffic sign and keeps tracking the sign to avoid handling the same detected sign more than once [21,83]. The tracking process is performed by feeding the TSDR system with a video recorded by a camera fixed on the vehicle and monitoring the sign candidates on a number of consecutive frames. The accepted sign candidates are only those shown up more than once. If the object is not a traffic sign or a sign that only shows up once, it can be eliminated as soon as possible, and thus, the computation time of the detection task can be reduced [84]. According to [85] and as shown in Table 5, the most common tracker adapted is the Kalman filter, as in [82,85–88]. The block diagram of a TSDR system with a tracking process based on the Kalman filter as proposed in [82] is shown in Figure 11. In the figure, SIFT, CCD and MLP are abbreviations of scale-invariant feature transform, contracting curve density and multi-layer perceptrons, respectively.

**Table 5.** Sign tracking based on Kalman Filter approaches.


**Figure 11.** An example of a TSDR system includes tracking process based on Kalman filter [81].

#### *4.3. Classification Phase*

After the localization of ROIs, classification techniques are employed to determine the content of the detected traffic signs [1]. Understanding the traffic rule enforced by the sign is achieved by reading the inner part of the detected traffic sign using a classifier method. Classification algorithms are neither color-based nor shape-based. The classifier usually takes a certain set of features as the input, which distinguishes the candidates from each other. Different algorithms are used to classify the traffic signs swiftly and accurately. Some conventional methods used for classification of traffic signs are outlined in Figure 12 and reviewed respectively in Tables 6–13.


**Figure 12.** Most popular classification methods.

**Table 6.** Examples of TSDR systems using a template matching method.


Template matching is a common method in image processing and pattern recognition. It is a low-level approach which uses pre-defined templates to search the whole image pixel by pixel or to perform the small window matching [15]. It was used for TSDR by Ohara et al. [90] and Torresen et al. [91]. It has the advantages of being fast, straightforward and accurate (with a hit rate of approximately 90% on their own pictured images dataset). However, the drawback of this method is that it is very sensitive to noise and occlusions. In addition, it requires a separate template for each scale and orientation. Examples of TSDR systems using a template matching method are shown in Table 6.

Another common classification method is the random forest. It is a machine learning method that operates by constructing a multitude of decision trees during the training time and outputting the class that is the mode of the output of the class of individual trees. This method was compared in [92,93] with SVM, MLP, Histogram of Oriented Gradient (HOG)-based classifiers, showing the highest accuracy rate and the lowest computational time. Based on their own dataset, the accuracy was approximately 94.2%, whereas the accuracy of the SVM is 87.8% and that of MLP is 89.2%. In terms of computational time for a single classification, the SVM takes 115.87 ms, MLP takes 1.45 ms, and a decision tree takes 0.15 ms. Despite its high accuracy and low computation time, the main limitation of a random forest is that a large number of trees can make the algorithm slow and ineffective for real-time predictions. Examples of TSDR systems using a decision tree method are shown in Table 7.


**Table 7.** Examples of TSDR systems using a decision tree method.

**Table 8.** Examples of TSDR systems using a genetic algorithm.


Genetic algorithm is another classification method. It is based on a natural selection process that mimics biological evolution, which was used early in this century. This method was used for traffic sign recognition by Aoyagi et al. [98] and Eccalera et al. [99]. It was proved in these studies that this method is effective in detection of the traffic sign even if the traffic sign has some shape loss or illumination problem. The disadvantage of the genetic algorithm is non-deterministic work time and non-guarantee finding of the best solution [57]. Examples of TSDR systems using a genetic algorithm method are shown in Table 8.

The other most common method for classification is using an artificial neural network (ANN). This method has gained an increasing popularity in recent years due to the advancement in general-purpose computing on graphics processing units (GPGPU) technologies [2]. In addition, it is popular due to its robustness, greater adaptability to changes, flexibility and high accuracy rate [100]. Another key advantage of this method is its ability to recognize and classify objects at the same time, while maintaining high speed and accuracy [2]. ANN-based classifiers were used in [56,99,101–108] for TSDR. In the experiment conducted in [56], the hit rate was 97.6%, and the computational time was 0.2 s. However, in [107] ANN-based methods were described to have some limitations, such as their slowness and the instability in the NN training due to too large a step. This method was compared with a template matching method in [108], concluding that NNs require a large number of training samples for real world applications. Examples of TSDR systems using an ANN method are shown Table 9.


**Table 9.** Examples of TSDR systems using an ANN method.

Another increasingly popular method in vision-based object recognition is the deep learning method. This method has acquired general interest in recent years owing to its high performance of classification and the power of representational learning from raw data [109,110]. Deep learning is a part of a broader family of machine learning methods. In contrary to task specific methods, deep learning focuses on data representations with supervised, weakly supervised or unsupervised learning. Deep learning methods use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. Higher level features are derived from lower level features to form a hierarchical representation [110]. Among the deep learning models, the convolutional neural networks (CNN) have acquired unique noteworthiness from their repeatedly confirmed superiorities [111]. According to [112], CNN models are the most widely used deep learning algorithms for traffic sign classification to date. Of the examples applied to traffic sign classification are committee CNN [113], multi-scale CNN [114], multi-column CNN [102], multi-task CNN [111,115], hinge-loss CNN [116], deep CNN [46,117], a CNN with diluted convolutions [118], a CNN with a generative adversarial network (GAN) [119], and a CNN with SVM [120]. Based on these studies, a simultaneous detection and classification can be achieved using deep learning-based methods. This simultaneousness results in improved performance, boosted training and testing speeds. Examples of TSDR systems using a deep learning method are shown in Table 10.


**Table 10.** Examples of TSDR systems using a deep learning method.

Adaptive boosting or AdaBoost is a combination of multiple learning algorithms that can be utilized for regression or classification [15]. It is a cascade algorithm, which was introduced by Freund and R. Schapire [122]. Its working concept is based on constructing multiple weak classifiers and assembling them into a single strong classifier for the overall task. As indicated in Table 11, the AdaBoost method was used for TSDR in [123–127]. Based on these studies, it can be concluded that the main advantage of the AdaBoost is its simplicity, high prediction power and capability to cascade an architecture for improving the computational efficiency. However, its main disadvantage is that if the input data have wide variations or abrupt changes in the background, then the training time increases and classifier accuracy decreases [121]. In addition, the AdaBoost trained classifier cannot be dynamically adjusted with new coming samples unless retrained from the beginning, which is time consuming and demands storing all historical samples [128]. Examples of TSDR systems using an AdaBoost method are shown in Table 11.

**Table 11.** Examples of TSDR systems using an AdaBoost method.


Support vector machine (SVM) is another classification method that contracts an N-dimensional hyper plane that optimally separates the data into two categories. More precisely, SVM is a binary classifier that separates two different classes by a subset of data samples called support vectors. It was implemented as a classifier for traffic sign recognition in [44,55,88,129–136]. This classification method is robust, highly accurate and extremely fast which is a good choice for large amounts of training data. In [129], a SVM-based classifier was applied for detecting speed limit signs and it was compared with the artificial neural network multilayer perceptron (MLP), k-nearest neighbors (kNN), least mean squares (LMS), least squares (LS) and extreme learning machine (ELM) based classifiers. Results of the comparison demonstrated that the SVM-based classifier obtained the highest accuracy and lowest standard deviation amongst all other classifiers. Similarly, in a recent study [3], a cascaded linear SVM classifier was used for detecting speed limit signs, and the result was a recall of 99.81% with a precision of 99.08% on the GTSRB dataset. In [55], a SVM-based classifier was used to detect

and classify red road signs in 1000 test images, and the accuracy rate was over 95%. In [88,131], SVM was used with Gaussian kernels for the recognition of traffic signs, and the success rate was 92.3% and 92.6%, respectively. In [136], an advanced SVM method was proposed and tested with binary pictogram and gray scale images; the result was achieving high accuracy rates of approximately 99.2% and 95.9%, respectively. SVM has also shown great effectiveness in extracting the most relevant shots of an event of interest in a video, where a new SVM-based classifier called nearly-isotonic SVM classifier (NI-SVM) was proposed in [137] for prioritizing the video shots using a novel notion of semantic saliency. The proposed classifier exhibited higher discriminative power in event analysis tasks. The main disadvantage of SVM is lack of transparency of results. Transparency means how the results were obtained by the kernel and how the results should be interpreted. In SVM such things are unknown and cannot be known due to the high dimensional vector space. Examples of TSDR systems using a SVM method are shown in Table 12.


**Table 12.** Examples of TSDR systems using a SVM method.



In addition to these conventional methods, researchers have used other methods for recognition. In [138], the SIFT matching method was used for recognizing broken areas of a traffic sign. This method adjusts the traffic sign to a standard camera axis and then compares it with a reference image. Sebanja et al. in [139] used principal component analysis (PCA) for both TSDR and the accuracy rate was approximately 99.2%. In [140], the researchers used improved fast radial symmetry (IFRS) for detection and a pictogram distribution histogram (PDH) for recognition. Soheilian et al. in [141] used template matching followed by a three dimensional (3D) reconstruction algorithm to reconstruct the traffic signs obtained from video data and to improve the visual angle for detecting traffic signs. In [142], Pei et al. used low rank matrix recovery (LRMR) to recover the correlation for classification with a hit rate of 97.51% in less than 0.2 s. Gonzalez-Reyna et al. [143] used oriented gradient maps for feature extraction, which is invariant in illumination and variable lighting. For classification, they used Karhunen–Loeve transform and MLP. They reported an accuracy of 95.9% and processing time of 0.0054 s per image. In [35], Miguel et al. used a self-organizing map (SOM) for recognition, where in every level, a pre-processor extracts a feature vector characterizing the ROI and passes it to the SOM. The accuracy rate was very high, approximately 99%. Examples of TSDR systems using the other methods are shown in Table 13.

#### **5. Current Issues and Challenges**

TSDR is the essential part of the ADAS. It is mainly designed to operate in a real-time environment for enhancing driver safety through the fast acquisition and interpretation of traffic signs. However, there are a number of external non-technical challenges that may face this system in the real environment degrading its performance significantly. Among the many issues that needed to be addressed while developing a TSDR system are the following issues outlined in Figure 13.

**Figure 13.** Some of TSDR challenges.

Variable lighting condition: Variable lighting condition is one of the key issues to be considered during TSDR system development. As aforementioned, one of the main distinguishing features of traffic sign is its unique colors which discriminate it from the background information, thus facilitating its detection. However, in outdoor environments illumination changes greatly affects the color of traffic sign, making the color information become completely unreliable as a main feature for traffic sign detection. To cope with such challenge, a method based on adaptive color threshold segmentation and high efficient shape symmetry algorithms has been recently proposed by Xu et al. [26]. This method is claimed to be robust for a complex illumination environment, exceeding a detection rate of 94% on GTSDB dataset.

Fading and blurring effect: Another important difficulty for a TSDR system is the fading and blurring of traffic signs caused by illumination through rain or snow. These conditions can lead to increase in false detections and reduce the effectiveness of a TSDR system. Using a hybrid shape-based detection and recognition method in such conditions can be very useful and may give more superior performance [146].

Affected visibility: Light emitted by the headlamps of the incoming vehicles, shadows, and other weather-related factors such as rains, clouds, snow and fog can lead to poor visibility. Recognizing traffic signs from a road image taken in such cases is a challenging task, and a simple detector may fail to detect these traffic signs. To resolve this problem, it is necessary to enhance the quality of taken images and make them clear by using an image pre-processing technique. A pre-processing makes image filtration and converts input information into usable format for further analysis and detection [147].

Multiple appearances of sign: While detecting traffic signs mainly in city areas, which are more crowded by signs, multiple traffic sign appearing at a time and similar shape man-made objects can cause overlapping of signs and lead to a false detection. The detection process can also be affected by rotation, translation, scaling and partial occlusion. Li et al. in [33], used HSI transform and fuzzy shape recognizer which is robust and unaffected by these problems and its accuracy rate in different weather condition is; sunny 94.66%, cloudy 92.05%, rainy 90.72%.

Motion artifacts: In the ADAS application, the images are captured from a moving vehicle and sometimes using a low resolution camera, thus, these images usually appear blurry. Recognition of blurred images is a challenging task and may lead to false results. In this respect, a TSDR system that integrates color, shape, and motion information could be a possible solution. In such a system, the robustness of recognition is improved through incorporating the detection and classification with tracking using temporal information fusion [73]. The detected traffic signs are tracked, and individual detections from sequential frames (t−t0, ... , t) are temporally fused for a robust overall recognition.

Damaged or partially obscured sign: The other distinctive feature of traffic sign is its unique shape. However, traffic signs could appear in various conditions including damaged, partly occluded and/or clustered. These conditions can be very problematic for the detection systems, particularly shape-based

detection systems. In order to overcome these problems, hybrid color segmentation and shape analysis based methods are recommended [15].

Unavailability of public database: A database is a crucial requirement for developing any TSDR system. It is used for training and testing the detection and recognition methods. One of the obstacles facing this research area is the lack of large, properly organized, and free available public image databases. According to [12], for example, the most commonly used database (GTSDB database) contains only 600 training images and 300 evaluation images. Of the seven categories classified in the Vienna convention, GTSDB covers only three categories of traffic signs for detection: prohibitory, mandatory and danger. All included images are only German traffic signs, which are substantially different from other parts of the world. To resolve the database scarcity problem, perhaps one of the ideas is to create a unified global database containing a large number of images and videos for road scenes in various countries around the world. These scenes must contain all categories of traffic signs under all possible weather conditions and physical states of the signs.

Real-time application: The detection and recognition of traffic signs are caught up with the performance of a system in real-time. Accuracy and speed are surely the two main requirements in practical applications. Achieving these requirements requires a system with efficient algorithms and powerful hardware. A good choice is convolutional neural networks-based learning methods with GPGPU technologies [2].

In brief, although lots of relevant approaches have been presented in the literature, no one can solve the traffic sign recognition problem very well in conditions of different illumination, motion blur, occlusion and so on. Therefore, more effective and more robust approaches need to be developed [12].

#### **6. Conclusions and Suggestion**

The major objective of the paper was to analyze the main direction of the research in the field of automatic TSDR and to categorize the main approaches into particular sections to make the topics easy to understand and to visualize the overall research for future directions. Unlike most of the available review papers, the scope of this paper has been broadened to cover all recognition phases: Detection, tracking and classification. In addition, this paper has tried to discuss as many studies as possible, in an attempt to provide a comprehensive review of the various alternative methods available for traffic sign detection and recognition; including along with methods categorization, current trends and research challenges associated with TSDR systems. The overall summary is presented in Figure 14.

**Figure 14.** Summary of the paper.

The conducted review reveals that research in traffic sign detection and recognition has grown rapidly, where the number of papers published during the last three years was approximately 280 papers, which represents about 41.69% of the total number of papers published during the last decade as a whole. With regard to the methods used, it was observed that the subject of traffic sign detection and recognition incorporates three main steps: Detection, tracking and classification; and in each step, many methods and algorithms were applied, each has its own merits and demerits. In general, the methods applied in detection and recognition consider either color or shape information of the traffic sign. However, it is well known that the image quality in real-world traffic scenarios is usually poor; due to low resolution, weather condition, varying lighting, motion blur, occlusion, scale and rotation and so on. In addition, traffic signs are usually in a variety of appearances, with high inter-class similarity, and complicated backgrounds. Thus, proper integration of color and shape information in both detection and classification phases is a very promising and exciting task that is in need of much more attention. For tracking, the Kalman filter and its variations are the most common methods. For classification, artificial neural network and support vector machine-based methods were found to be the most popular methods, with a high detection rate, high flexibility and easy adoptability. Despite the recent improvements in the overall performance of TSDR systems, more research is still needed to achieve a rigorous, robust and reliable TSDR system. It is believed that TSDR system

performance can be enhanced by merging the detection and classification tasks into one step rather than performing them separately. By doing so, classification can improve detection and vice versa. Another idea for further improvement of TSDR is by using standard, sufficient and large databases for learning, testing and evaluation of the proposed algorithms. In this way, the TSDR system will be able to recognize the eight different categories of the traffic signs in the real environment with different conditions. This paper will be a useful reference for researchers looking for an understanding of the current status of research in the field of TSDR and finding the related research problems in need of solutions.

**Author Contributions:** Conceptualization, M.A.H. and A.H., S.A.S.; methodology, M.A.H., M.A.A. and S.B.W.; software, M.A.A.; validation, S.B.W.; formal analysis, M.A.A.; investigation, M.A.H., and M.B.M.; resources, M.A.H., S.B.W.; data curation, M.A.A. and S.B.W.; writing—original draft preparation, M.A.H., M.A.A., S.B.W., P.J.K.; writing—review and editing, M.A.H., P.J.K., A.H., S.A.S. and M.B.M.; visualization, M.A.A.; supervision, M.A.H., S.A.S.; project administration, M.A.H.; funding acquisition, M.A.H.

**Funding:** This research was funded by the TNB Bold Strategic Grant J510050797 under the Universiti Tenaga Nasional and the Universiti Kebangsaan Malaysia under Grant DIP-2018-020.

**Acknowledgments:** This work is supported by the collaboration between Universiti Tenaga Nasional and Universiti Kebangsaan Malaysia.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Liver Tumor Segmentation in CT Scans Using Modified SegNet**

#### **Sultan Almotairi 1, Ghada Kareem 2, Mohamed Aouf 2, Badr Almutairi 3and Mohammed A.-M. Salem 4,5,\***


Received: 5 December 2019; Accepted: 4 March 2020; Published: 10 March 2020

**Abstract:** The main cause of death related to cancer worldwide is from hepatic cancer. Detection of hepatic cancer early using computed tomography (CT) could prevent millions of patients' death every year. However, reading hundreds or even tens of those CT scans is an enormous burden for radiologists. Therefore, there is an immediate need is to read, detect, and evaluate CT scans automatically, quickly, and accurately. However, liver segmentation and extraction from the CT scans is a bottleneck for any system, and is still a challenging problem. In this work, a deep learning-based technique that was proposed for semantic pixel-wise classification of road scenes is adopted and modified to fit liver CT segmentation and classification. The architecture of the deep convolutional encoder–decoder is named SegNet, and consists of a hierarchical correspondence of encode–decoder layers. The proposed architecture was tested on a standard dataset for liver CT scans and achieved tumor accuracy of up to 99.9% in the training phase.

**Keywords:** deep learning; CT images; convolutional neural networks; hepatic cancer

#### **1. Introduction**

The liver is the largest organ, located underneath the right ribs and below the lung base. It has a role in digesting food [1,2]. It is responsible for filtering blood cells, processing and storing nutrients, and converting some these nutrients into energy; it also breaks down toxic agents [3,4]. There are two main hepatic lobes, the left and right lobes. When the liver is viewed from the undersurface, there are two more lobes, the quadrate and caudate lobes [5].

Hepatocellular carcinoma (HCC) [6] may occur when the liver cells begin to grow out of control and can spread to other areas in the body. Primary hepatic malignancies develop when there is abnormal behavior of the cells [7].

Liver cancer has been reported to be the second most frequent cancer to cause death in men, and sixth for women. About 750,000 people got diagnosed with liver cancer in 2008, 696,000 of which died from it. Globally, the rate of infection of males is twice than that of females [8]. Liver cancer can be developed from viral hepatitis, which is much more problematic. According to the World Health Organization, WHO, about 1.45 million deaths a year occur because of this infection [9]. In 2015, Egypt was named as the country with the highest rate of adults infected by viral hepatitis C (HCV), at 7% [9]. Because the treatment could not be reached by all infected people, the Egypt's government launched a

"100 Million *Seha*" (*seha* is an Arabic word meaning "health") national campaign between October 2018 and April 2019. At the end of March 2019, around 35 million people had been examined for HCV [10].

Primary hepatic malignancy is more prevalent in Southeast Asia and Africa than in the United States [11,12]. The reported survival rate is generally 18%. However, survival rates rely on the stage of disease at the time of diagnosis [13].

Primary hepatic malignancy is diagnosed by clinical, laboratory, and imaging tests, including ultrasound scans, magnetic resonance imaging (MRI) scans, and computed tomography (CT) scans [14]. A CT scan utilizes radiation to capture detailed images around the body from different angles, including sagittal, coronal, and axial images. It shows organs, bones, and soft tissues; the information is then processed by the computer to create images, usually in DICOM format [15]. Quite often, the examination requires intravenous injection of contrast material. The scans can help to differentiate malignant lesions from acute infection, chronic inflammation, fibrosis, and cirrhosis [16].

Staging of hepatic malignancies depends on the size and location of the malignancy [16,17]. Hence, it is important to develop an automatic procedure to detect and extract the cancer region from the CT scan accurately. Image segmentation is the process of partitioning the liver region in the CT scan into regions, where each region represents a semantic part of the liver [18,19]. This is a fundamental step to support the diagnosis by radiologists, and a fundamental step to create automatic computer-aided diagnosis (CAD) systems [20–22]. CT scans of the liver are usually interpreted by manual or semi-manual techniques, but these techniques are subjective, expensive, time-consuming, and highly error prone. Figure 1 shows an example where the gray level intensities of the liver and the spleen are too similar to be differentiated by the naked eye. To overcome these obstacles and improve the quality of liver tumors' diagnosis, multiple computer-aided methods have been developed. However, these systems have not been that great at the segmentation of the liver and lesions due to multiple challenges, such as the low contrast between the liver and neighboring organs and between the liver and tumors, different contrast levels in tumors, variation in the numbers and sizes of tumors, tissues' abnormalities, and irregular tumor growth in response to medical treatment. Therefore, a new approach must be used to overcome these obstacles [23].

**Figure 1.** Example of the similarity in gray levels between the liver and the spleen in computed tomography (CT) images. Imported from the Medical Image Computing and Computer Assisted Intervention (MICCAI) SLIVER07 workshop datasets [24,25].

In this work, a review of wide variety of recent publications of image analysis for liver malignancy segmentation is introduced. In recent years, extensive research has depended on supervised learning methods. The supervised method use inputs labeled to train a model for a specific task—liver or tumor segmentation, in this case. On top of these learning methods are the deep learning methods [26,27]. There are many different models of deep learning that have been introduced, such as stacked auto-encoder (SAE), deep belief nets (DBN), convolutional neural networks (CNNs), and Deep Boltzmann Machines (DBM) [28–31]. The superiority of the deep learning models in terms of accuracy has been established. However, it is still a challenge to find proper training dataset, which should be huge in size and prepared by experts.

CNNs are considered the best of deep learning methods used. Elshaer et al. [13] reduced the computation time of a large number of slices by using two trained deep CNN models. The first model was used to get the liver region, and the second model was used for avoiding fogginess from image re-sampling and for avoiding missed small lesions.

Wen Li et al. [28] utilized a convolutional neural network (CNN) that uses image patches. It considers an image patch for each pixel, such that the pixel of interest is in the center of that patch. The patches are divided into normal or tumor liver tissue. If the patch contains at least 50 percent or more of tumor tissue, the patch is labeled as a positive sample. The reported accuracy reached about 80.6%. The work presented in [12,13] reported s more than 94% accuracy rate for classifying the images either as normal or abnormal if the image showed a liver with tumor regions. The CNN model has different architectures—i.e., Alex Net, VGG-Net, ResNet, etc. [32–34]. The work presented by Bellver et al. [5] used VGG-16 architecture as the base network in their work. Other work [11,16,29,32,35,36] has used two-dimensional (2D) U-Net, which is designed mainly for medical image segmentation.

The main objective of this work is to present a novel segmentation technique for liver cross-sectional CT scans based on a deep learning model that has proven successful in image segmentation for scene understanding, namely SegNet [37]. Memory and performance efficiency are the main advantages of this architecture over the other models. The model has been modified to fit two-class classification tasks.

The paper is organized as follows. In the next section, a review is presented on recent segmentation approaches for the liver and lesions in CT images, as well as a short introduction to the basic concepts addressed in this work. Section 3 presents the proposed method and the experimental dataset. Experimental results are presented in Section 4. Finally, conclusions are presented and discussed in Section 5.

#### **2. Basic Concepts**

Convolutional neural networks are similar to traditional neural networks [20,38,39]. A convolutional neural network (CNN) includes one or more layers of convolutional, fully connected, pooling, or fully connected and rectified linear unit (ReLU) layers. Generally, as the network becomes deeper with many more parameters, the accuracy of the results increases, but it also becomes more computationally complex.

Recently, CNN models have been used widely in image classification for different applications [20,34,40–42] or to extract features from the convolutional layers before or after the down sampling layers [41,43]. However, the architectures discussed above are not suitable for image segmentation or pixel-wise classifications. VGG-16 network architecture [44] is a type of CNN model. The network includes 41 layers. There are 16 layers with learnable weights: there are 13 convolutional layers and three fully connected layers. Figure 2 shows the architecture of VGG-16 as introduced by Simonyan and Zisserman [44].

**Figure 2.** VGG-16 network architecture [44].

Most pixel-wise classification network architectures are of encoder–decoder architecture, where the encoder part is the VGG-16 model. The encoder gradually decreases the spatial dimension of the images with pooling layers; however, the decoder retrieves the details of the object and spatial dimensions for fast and precise segmentation of images. U-Net [45,46] is a convolutional encoder-decoder network used widely for semantic image segmentation. It is interesting because it applies a fully convolutional network architecture for medical images. However, it is very time- and memory-consuming.

The semantic image segmentation approach uses the predetermined weights of the pertained VGG-16 network [45]. Badrinarayanan et al. [37], have proposed an encoder–decoder deep network, named SegNet, for scene understanding applications tested on road and indoor scenes. The main parts of the core trainable segmentation engine are an encoder network, a decoder network, and a pixel-wise classification layer. The architecture of the encoder network is similar to the 13 convolutional layers in the VGG-16 network. The function of the decoder network is mapping the features of encoder with low to full-input resolution feature maps for pixel-wise classification. Figure 3 shows a simple illustration of the SegNet model during the down sampling (max-pooling or subsampling layers) of the encoder part. Instead of transferring the pixel values to the decoder, the indices of the chosen pixel are saved and synchronized with the decoder for the up-sampling process. In SegNet, more shortcut connections are presented. The indices are copied from max pooling instead of copying the features of encoder, such as in FCN [47], so the memory and performance of SegNet is much more efficient than FCN and U-Net.

**Figure 3.** SegNet network architecture [37].

#### **3. Materials and Method**

This section discusses the steps and the implementation of the proposed method for segmentation of a liver tumor. The proposed method follows the conventional pattern recognition scheme: preprocessing, feature extraction and classification, and post-processing.

#### *3.1. Dataset*

The 3D-IRCADb-01 database is composed of three-dimensional (3D) CT-scans of 20 different patients (10 females and 10 males), with hepatic tumors in 15 of those cases. Each image has a resolution of 512 × 512 width and height. The depth or the number of slices per patient ranges between 74 and 260. Along with patient images in DICOM format, labeled images and mask images are given that could be used as ground truth for the segmentation process. The place of tumors is exposed by Couinaud segmentation [48]. This shows the main difficulties in segmentation of the liver via software [49].

#### *3.2. Image Preprocessing*

In the preprocessing steps, the DICOM CT images were subject to file format conversion to portable network graphics (PNG). The PNG file format was chosen to preserve the image quality, as it is a lossless format. In DICOM format, the pixel values are in Hounsfield, in the range [−1000, 4000]. In this format, the images cannot be displayed, and many image processing operations will fail. Therefore, the color depth conversion, and hence the range of the pixel's values mapping to the positive 1 byte integer, is necessary. The mapping is done according to the following formula:

$$\lg = \frac{h - m\_1}{m\_2 - m\_1} \ast 255 \tag{1}$$

where *h* is the pixel value in Hounsfield, *g* is the corresponding predicted gray level value, and *m*<sup>1</sup> and *m*<sup>2</sup> are the minimum and maximum of the Hounsfield range, respectively.

The second step is to put the images in an acceptable format for the SegNet model [37]. The images have been converted to three channels, similar to the RGB color space, by simply duplicating the slice in each channel and resizing each to be the dimension 360 × 480 × 3. Figure 4 shows three samples of the input images before color depth correction. The images in this format have too low contrast and are not suitable for use by the deep learning model.

**Figure 4.** Raw CT slices in DICOM format for three different patients imported from the IRCAD dataset [50].

In order to increase the performance of the system, the training images were subject to data augmentation, where the images are transformed by a set of affine transformations, such as flipping, rotation, and mirroring, as well as augmenting the color values [38,51,52]. Perez et al. [53] discuss the effectiveness of data augmentation on the classification results when deep learning is used, and showed that the traditional augmentation techniques can improve the results by about 7%.

#### *3.3. Training and Classification*

The goodness of CNN features was compared to other traditional feature extraction methods, such as LBP, GLCM, Wavelet and Spectral. The feature extractors, which give good performance in comparison with the other texture extractor features, are a CNN. CNN training consumes some time; however, features can be extracted from the trained convolutional network, compared to other complex textural methods. CNNs have proven to be effective in classification tasks [26]. The training data and data augmentation are combined by reading batches of training data, applying data augmentation, and sending the augmented data to the training algorithm. The training is started by taking the data source, which contains the training images, pixel labels, and their augmentation forms.

#### **4. Experimental Results**

#### *4.1. Evaluation Metrics*

The output results of classification were compared against the ground truth given by the dataset. The comparison was done on a pixel-to-pixel basis. To evaluate the results, we applied the evaluation metrics given below. Table 1 represent the confusion matrix for binary class classification.


**Table 1.** Terms used to define sensitivity, specificity, and accuracy.

Out of the confusion matrix, some important metrics were computed, such as

1. Overall Accuracy: this represents the percentage of correctly classified pixels to the whole number of pixels. This could be formulated as in Equation (2):

$$accuracy = \frac{TN + TP}{TN + TP + FN + FP} \tag{2}$$

while the mean accuracy is the mean of accuracies reported across the different testing folds.

2. Recall (Re) or true positive rate (TPR): this represents the capability of the system to correctly detect tumor pixels relative to the total number of true tumor pixels, as formulated in Equation (3):

$$R\varepsilon = \frac{TP}{TP + FN} \tag{3}$$

3. Specificity of the true negative rate (TNR): this represents the rate of the correctly detected background or normal tissue, as formulated in Equation (4):

$$Specificity = \frac{TN}{TN + FP} \tag{4}$$

Since most of image is normal or background, the percentage of global accuracy is significantly influenced by the TNR. Therefore, some other measures for the tumor class are computed.

4. Intersection over union (IoU): this is the ratio of correctly classified pixels relative to the union of predicted and actual number of pixels for the same class. Equation (5) shows the formulation of the IoU:

$$IoU = \frac{TP}{TP + FP + FN} \tag{5}$$

5. Precision (Pr): this measures the trust in the predicted positive class, i.e., prediction of a tumor. It is formulated as in Equation (6):

$$Pr = \frac{TP}{TP + FP} \tag{6}$$

6. F1 score (F1): this is a harmonic mean of recall (true positive rate) and precision, as formulated in Equation (7). It measures whether a point on the predicted boundary has a match on the ground truth boundary or not:

$$F1 = \frac{2\left(Pr\*Re\right)}{\left(Pr+Re\right)}\tag{7}$$

#### *4.2. Data Set and Preprocessing*

As mentioned before, the dataset used to test the proposed algorithm is 3D-IRCADb. The 3D-IRACDb dataset is offered by the French Research Institute against Digestive Tract, or IRCAD [50]. It has two subsets: the first one, 3DIRACDb-01, is the one appropriate for liver tumor segmentation. This subset consists of publicly available 3D CT scans of 20 patients, half of them for women patients and half for men, with hepatic tumors in 75% of the cases. All the scans are available in DICOM format with axial dimensions of 512 × 512. For each case, tens of 2D images are available, together with labeled images and masked images prepared by radiologists. In this work, we have considered all 15 cases with a total of 2063 images for training and testing. The dataset is used widely and recently, as in [54–57].

All image slices were subject to preprocessing, as discussed above. The labeled images provided by the dataset are preprocessed by the same procedure, except the step of range mapping, since they are given as binary images in the range [0,255]. Figures 5 and 6 show the examples of the preprocessing steps on input images. Associated with the input (patient) images are the labeled images, which are labeled by experts and are fed to the system as ground truth for the segmentation process.

**Figure 5.** Samples of the slices after color range mapping to [0, 255]. The three images correspond respectively to the images in Figure 4.

**Figure 6.** Liver tumor labeled images from the ground truth given by the dataset. The three images correspond respectively to the images in Figures 4 and 5.

#### *4.3. Training and Classification*

Three of the 15 cases of the dataset were used for testing and evaluation, with a total of 475 images. Among these, 454 images were used for training and validation, and 45 images were used for testing.

The first training and testing experiments were carried out using the U-Net model in [45]. The U-Net model is trained to perform semantic segmentation on medical images. It is based on VGG-16, as discussed before. The results were near perfect to extract the liver region. However, it failed completely when tested to extract the tumor regions from the image. In this case, the tumor region was almost missed or predicted as others.

The proposed architecture is based on the SegNet model [37], which is an encoder network, and a corresponding decoder network connected to a 2D multi-classification layer for pixel-based semantic segmentation. However, the final classification layer was replaced by 2D binary classification. The VGG-16 trained model was imported for the encoder part. Figure 7 shows an illustration of the proposed network architecture. To improve the training, class weighting was used to balance the classes and calculate the median frequency class weights.

**Figure 7.** The proposed architecture.

For testing, a semantic segmentation was returned from the input image with the classification scores for each categorical label, in order to run the network for one image from test set.

#### *4.4. Testing and Evaluation*

The proposed method was trained on a machine with NVIDIA GTX 1050 4GB RAM GPU on an Intel Core i7-7700HQ 2.20 GHz 16 GB RAM, and developed with MATLAB 2018b software, which offers a Neural Network Toolbox and an Image Processing Toolbox.

The images of the tested cases were divided randomly into two groups for training and testing by the ratio 9:1. The results of the training are normally higher than that achieved by testing. Figure 8 shows three samples of testing output, where the resulted binary segmentation is augmented on the input gray-level images. At this stage, an almost perfect segmentation was achieved. In Table 2 are the evaluation metrics for the three cases. The network training performed by 1000 iterations per epoch for 100 epochs on a single GPU with a constant learning rate was 0.001. It is clear from Table 2 that as the number of training images increases, the segmentation quality increases up to perfect results, as in case 3.

**Figure 8.** Samples of the results of testing. The predicted tumor position is highlighted with the red color.

**Table 2.** Evaluation metrics for the training of the three test cases.


For testing, a semantic segmentation is returned for the input image, with the classification scores for each categorical label. Figure 9 shows an illustration of the evaluation method, where the resulted segmented images are superimposed over the ground truth image. The correctly classified tumor pixels, known as true positive, are colored in white. It is clear from this figure that the results of the first are the one with the least accuracy, while the results of case 3 are perfect in terms of tumor detection; however, the tumor appears larger than it actually is.

**Figure 9.** Samples of the resulting segmented image superimposed over the ground truth image. The correctly classified tumor pixels (known as true positive (TP)) are colored in white. The missed tumor pixels are colored in purple. The pixels that are predicted to belong to the tumor, but actually are pixels representing normal tissue or the background, are colored in green. The black color represents pixels that are correctly classified as normal or background.

The experimental results are presented in confusion matrices in Tables 3–5 for the test cases 1,2 and 3, respectively. The results displayed are normalized.


**Table 3.** Normalized confusion matrix for test case 1.



**Table 5.** Normalized confusion matrix for test case 3.


In order to increase the insight on the presented results, Table 6 presents a comparison between the overall accuracy of the proposed method compared to some chosen work from the literature, according to the results reported in their papers. From this work, we have achieved higher accuracy than the work in the comparison.



#### **5. Conclusions**

This paper presents experimental work to adopt a deep learning model, used for semantic segmentation of road scene understanding, for tumor segmentation in CT Liver scans in DICOM format.

SegNet is recent encoder–decoder network architecture that employs the trained VGG-16 image classification network as encoder, and employs corresponding decoder architecture to transform the features back into the image domain to reach a pixel-wise classification at the end. The advantage of SegNet over standard auto-encoder architecture is in the simple yet very efficient modification where the max-pooling indices of the feature map are saved, instead of saving the feature maps in full. As a result, the architecture is much more efficient in training time, memory requirements, and accuracy.

To facilitate binary segmentation of medical images, the classification layer was replaced with binary pixel classification layer. For training and testing, the standard 3D-IRCADb-01 dataset was used. The proposed method correctly detects most parts of the tumor, with accuracy above 86% for tumor classification. However, by examining the results, there were few false positives that could be improved by applying false positive filters or by training the model on a larger dataset.

As a future work, we propose using a new deep learning model as an additional level to increase the localization accuracy of the tumor, and hence reduce the FN rate and increase the IoU metric, like the work introduced in [20].

**Author Contributions:** Conceptualization and methodology, M.A.-M.S. and S.A.; software and validation, G.K. and M.A.; formal analysis, M.A. and M.A.-M.S.; writing—original draft preparation, M.A.-M.S.; writing—review and editing, G.K., M.A., and B.A.; supervision, M.A.-M.S.; project administration, S.A.; funding acquisition, S.A. and B.A. Reviewed by all authors. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received external funding for publication by the Deanship of Scientific Research at Majmaah University, which is funding this work under project RGP-2019-29.

**Acknowledgments:** The authors would like to thank Mamdouh Mahfouz, Professor of Radiology at Kasr-El-Aini School of Medicine, Cairo University, Egypt, for his help in revising the introduction in order to correct any misused medical terms.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. Al-Shaikhli, S.D.S.; Yang, M.Y.; Rosenhahn, B. Automatic 3d liver segmentation using sparse representation of global and local image information via level set formulation. *arXiv* **2015**, arXiv:abs/1508.01521.


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **HEMIGEN: Human Embryo Image Generator Based on Generative Adversarial Networks**

#### **Darius Dirvanauskas 1, Rytis Maskeliunas ¯ 1, Vidas Raudonis 2, Robertas Damaševiˇcius 3,4,\*and Rafal Scherer <sup>5</sup>**


Received: 26 July 2019; Accepted: 14 August 2019; Published: 16 August 2019

**Abstract:** We propose a method for generating the synthetic images of human embryo cells that could later be used for classification, analysis, and training, thus resulting in the creation of new synthetic image datasets for research areas lacking real-world data. Our focus was not only to generate the generic image of a cell such, but to make sure that it has all necessary attributes of a real cell image to provide a fully realistic synthetic version. We use human embryo images obtained during cell development processes for training a deep neural network (DNN). The proposed algorithm used generative adversarial network (GAN) to generate one-, two-, and four-cell stage images. We achieved a misclassification rate of 12.3% for the generated images, while the expert evaluation showed the true recognition rate (TRR) of 80.00% (for four-cell images), 86.8% (for two-cell images), and 96.2% (for one-cell images). Texture-based comparison using the Haralick features showed that there is no statistically (using the Student's t-test) significant (*p* < 0.01) differences between the real and synthetic embryo images except for the sum of variance (for one-cell and four-cell images), and variance and sum of average (for two-cell images) features. The obtained synthetic images can be later adapted to facilitate the development, training, and evaluation of new algorithms for embryo image processing tasks.

**Keywords:** deep learning; neural network; generative adversarial network; synthetic images

#### **1. Introduction**

Deep neural networks (DNN) have become one of the most popular modern tools for image analysis and classification [1]. One of the first accurate implementations of DNN, AlexNet [2], was quickly bested by a deep convolutional activation feature (DeCAF) network, which extracted features from AlexNet and evaluated the efficacy of these features on generic vision tasks [3]. VGGNet demonstrated that increasing the depth of convolutional neural network (CNN) is beneficial for the classification accuracy [4]. However, the deeper neural network is more difficult to train. ResNet presented a residual learning framework to ease the training for very deep networks [5]. The ResNet architecture, which combined semantic information from a deep, coarse layer with appearance information from a shallow network, was adapted for image segmentation [6]. Region CNN (R-CNN) was proposed as a combination of high-capacity CNNs with bottom-up region proposals in order to localize and segment images [7].

The breakthrough of DNNs in quality led to numerous adaptations in solving medical image processing problems such as the analysis of cancer cells [8] and cancer type analysis [9]. Deep max-pooling CNNs were used to detect mitosis in breast histology images using a patch centered on the pixel as context [10]. The U-Net architecture with data augmentation and elastic deformations achieved very good performance on different biomedical segmentation applications [11]. A supervised max-pooling CNN was trained to detect cell pixels in regions that are preselected by a support vector machine (SVM) classifier [12]. After a pre-processing step to remove artefacts from the input images, fully CNNs were used to produce the embryo inner cell mass segmentation [13]. A set of Levenberg–Marquardt NNs trained using textural descriptors allowed for predicting the quality of embryos [14].

Image analysis methods were also applied to embryo image analysis. Techniques for extracting, classifying, and grouping properties were used to measure the quality of embryos. Real-time grading techniques for determining the number of embryonic cells from time-lapse microscope images help embryologists to monitor the dividing cells [15]. Conditional random field (CRF) models [16] were used for cell counting and assessing various aspects of the developing embryo and predicting the stage of the embryonic development. In addition to grading, embryonic positioning was achieved using linear chain Markov model [17]. Different solutions were developed for segmenting and calculating the number of cells: ImageJ, MIPAV, VisSeg [18]. Moreover, cell segmentation by marking its center and edges makes it possible to determine their shapes and quantities more quickly and accurately [19]. Two-stage classifier for embryo image classification was proposed in [20].

The rising progress of generative adversarial networks (GANs) applications on medical imaging state that most of the research is focused on synthetic imaging, the reconstruction, segmentation, and classification and proves the importance of the sector. Wasserstein-based GANs were applied for the synthesis of cells imaged by fluorescence microscopy capturing these relationships to be relevant for biological application [21]. The quality of generated artificial images also improved due to the improvements to deep neural networks [22] as well as on progressive growing of GANs for image data augmentation [23]. Variational autoencoders (VAE) and generative adversarial networks (GANs) are currently the most distinguished according to the quality of their results. The VAE models are more often used for image compression and recovery. Both methods have drawbacks, as recovering a part of the information loses some data and often introduces image fading or blurring. This effect could be reduced by matching the data as well as the loss distributions of the real and fake images by a pair of autoencoders used as the generator and the discriminator in the adversarial training [24]. Utilization of both generator and discriminator growing progressively from a low-resolution standpoint and adding new layers for an increase in fine details allowed for achieving a current state-of-the-art inception score of 8.80 on CIFAR10 dataset [25]. Network modifications can be applied to minimize fading of the resulting images [24]. Generative stochastic networks (GSN) were used to learn the transition operator of a Markov chain whose stationary distribution estimates the data distribution [26]. Deep recurrent attentive writer (DRAW) neural network architecture [27] was used to generate highly realistic natural images such as photographs of house numbers or other digits in the MNIST database. Dosovitskiy et al. [28] introduced an algorithm to find relevant information from the existing 3D chair models and to generate new chair images using that information. Denton et al. [29] introduced a generative parametric model capable of producing high-quality samples of natural images based on a cascade of convolutional networks within a Laplacian pyramid framework. Radford and Metz [30] introduced a simplification of training by a modification called deep convolutional GANs (DCGANs). GANs can also recover images from bridging text and image modelling, thus translating visual concepts from characters to pixels [31]. Zhang et al. [32] proposed stacked GANs (StackGAN) with conditioning augmentation for synthesizing photo-realistic images. GANs could also be applied for generating motion images, e.g., the motion and content decomposed GAN (MoCoGAN) framework for video generation by mapping a sequence of random vectors to a sequence of video frames [33]. The GAN for video (VGAN) model was based on a GAN with a spatial-temporal convolutional architecture that untangles the scene's foreground from the background and can be used at predicting plausible futures of static images [34]. Temporal GAN (TGAN) was used to learn a semantic representation of unlabeled videos by using different types of generators via Wasserstein GAN, and a method to train it stably in

an end-to-end manner [35]. MIT presented a 3D generative adversarial network (3D-GAN) model to recreate 3D objects from a probabilistic space by leveraging recent advances in volumetric CNNs and GANs [36]. Li et al. [37] used multiscale GAN (DR-Net) to remove rain streaks from a single image. Zhu [38] used GANs for generating synthetic saliency maps for given natural images. Ma et al. [39] used background augmentation GANs (BAGANs) for synthesizing background images for augmented reality (AR) applications. Han et al. presented a system based on a Wasserstein GANs for generating realistic synthetic brain MR images [40].

All these results are achieved by using large expert-annotated and ready-made databases and also exposing the problem of lacking good, core training datasets. In this work, we aimed to develop a method to generate realistic synthetic images that could later be used for classification, analysis, and training, thus resulting in the creation of novel synthetic datasets for research areas lacking data such as human embryo images. Here we used human embryo images obtained during cell development processes for training a DNN. We propose an algorithm for generating one-, two-, and four-cell images (the selection was based on the initial dataset provided by our medical partners) to increase the overall number of unique images available for training. For generating images, we have developed a generative adversarial network for image recovery, filling, and improvement. The significance of our approach would be that the method was applied to embryonic cell images as a type of image. It was very important for the GAN to accurately reproduce the outline of the cell (often poorly visible even in microscopy picture), since the whole image was almost the same (gray) and the cell itself is translucent. Our focus was not only to generate the generic image of a cell as such, but to make sure that it has all necessary attributes of a real cell image to provide a fully realistic synthetic version. We believe that as a large number of real embryo images required for training neural networks are difficult to obtain due to ethics requirements, the synthetic images generated by the GAN can be later adapted to facilitate the development, training, and evaluation of new algorithms for embryo image-processing tasks.

#### **2. Materials and Methods**

#### *2.1. Architecture of the Generative Adversarial Network (GAN)*

The generative adversarial network (GAN) consists of two parts: A generator *G* and a discriminator *D* (Figure 1).

**Figure 1.** A pipeline of a generative adversarial network (GAN).

The discriminator tries to distinguish true images from synthetic, generator-generated images. The generator tries to create images with which the discriminator could be deceived. The discriminator is implemented as a fully connected neural network with dense layers that represents an image as a probability vector and classifies it into two classes, a real or a fake image. The generator is a reverse model that restores the former image from a random noise vector. During the training, the discriminator is trained to maximize the probability of assigning the correct class to the training images and images generated by the generator. At the same time, the generator is trained to minimize classification error

log(1 − *D*(*G*(*z*))). This process can be expressed in terms of game theory as a two-player (discriminator and generator) minimax game with cost function *V*(*G*, *D*) [41]:

$$\max\_{G} \min\_{D} V(D, G) = E\_{x \sim P(x)}[\log(D(x))] + E\_{z \sim P(z)}[\log(1 - D(G(z)))].\tag{1}$$

The discriminator evaluates the images created by the training database and the generator. The architecture of the discriminator network is shown in Figure 2. The network consists of six layers. The monochrome 200 × 200 pixel image in the first layer is transformed and expanded into a single line vector. The dense layer is used in the second, fourth, and sixth layers. The LeakyReLU [42] function, where α = 0.2, is used in the third and fifth layers as it allows a small gradient when the unit is not active:

$$\text{LReLu}(x) = \max(x, 0) + a \min(x, 0). \tag{2}$$

**Figure 2.** Architecture of the discriminator network.

At the network output, we get a one if the network guesses that the image is real or zero if the network guesses that the image is fake.

The generator is composed of 11 layers (Figure 3). The input layer contains 1 × 100 vector of randomly generated data. These data are transformed using the dense, LeakyReLU, and batch normalization layers. The ReLU layer uses α = 0.2. The batch normalization layer was used to normalize the activations of the previous layer at each batch, by applying a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1 [43]:

$$\mathfrak{X}\_{i} \leftarrow \frac{\mathfrak{x}\_{i} - \mu\_{\beta}}{\sqrt{\sigma\_{\beta}^{2} + \varepsilon}}.\tag{3}$$

**Figure 3.** Architecture of the generator network.

The batch normalization momentum value was set to 0.8, while all other values were used as per default Keras model values. The output produces a 200 × 200 pixel sized black-and-white image.

#### *2.2. Training*

We trained one (single) network to generate all types of cell images. For the training, we have used the Adam optimization algorithm [44] with a fixed training speed of 0.00002, and β<sup>1</sup> = 0.7 for training the generator. During training, the Adam algorithm calculates the gradient as follows:

$$
\xi\_l g\_t \leftarrow \Delta\_\theta f\_l(\theta\_{t-1}).\tag{4}
$$

The biased first moment estimate is updated as follows:

$$
\mu m\_t \gets \beta\_1 m\_{t-1} + (1 - \beta\_1)\mathbf{g}\_t. \tag{5}
$$

The biased second moment estimate is update as follows:

$$
\upsilon\_t \leftarrow \beta\_2 \upsilon\_{t-1} + (1 - \beta\_2) g\_t^2. \tag{6}
$$

Then we compute the bias corrected first moment estimate as follows:

$$
\hat{m}\_t \leftarrow \frac{m\_t}{1 - \beta\_1^t}.\tag{7}
$$

The bias corrected second moment estimate is computed as follows:

$$
\psi\_t \leftarrow \frac{\upsilon\_t}{1 - \beta\_2^t}.\tag{8}
$$

This gives the update parameters as follows:

$$
\partial\_t \leftarrow \frac{\partial\_{t-1} - a\mathfrak{n}\_t}{\sqrt{\mathfrak{d}\_{t}^{\prime\prime} + \varepsilon}}.\tag{9}
$$

The binary cross-entropy function is used to evaluate the error as follows:

$$-\frac{1}{N}\sum\_{i=1}^{N} \left[ y\_i \log(\mathfrak{H}\_i) + (1 - y\_i)\log(1 - \mathfrak{H}\_i) \right].\tag{10}$$

#### *2.3. Evaluation of Image Quality*

The problem with using the GAN model is that there are no clear ways to measure it qualitatively. There are few most commonly used criteria to evaluate GAN network: Average log-likelihood, classifier classification [41], and visual fidelity of samples [45]. These methods have both advantages and disadvantages. Log-likelihood and classifiers attribute the image to a certain class with a matching factor. However, they only indicate whether the generated image is like the average of one of the classes but does not value a qualitatively generated image. Such a qualitative assessment requires a human evaluator, but with a large amount of data, such estimates can change independently and be biased [46]. Likewise, a different expert person could evaluate the same data somewhat differently.

We use a combined method to evaluate the results obtained in our study, which includes: (1) Human expert evaluation, (2) histogram comparison, and (3) texture feature comparison. The generated images are evaluated by human experts (three medical research professionals) to determine if they are being reproduced qualitatively. If we can determine its class from the received image, the image is quality-restored. The expert based scoring was calculated using a visual Turing test [47] as this method has already been proven effective in the evaluation of synthetic images generated by GANs [48].

The histogram comparison method checks whether the histogram of the generated images corresponds to the histogram of the images in the training database. For comparison, we used the normalized average histogram of training data *H*1, and the normalized average histogram of generated images *H*2. Further, we apply four different methods for comparison as it is recommended in [49]. The correlation function determines the likelihood of two image histograms. Its value of two identical histograms equals one:

$$\mathbb{C}(H\_1, H\_2) = \frac{\sum\_{I} \left( H\_1(I) - \overline{H}\_1 \right) \left( H\_2(I) - \overline{H}\_2 \right)}{\sqrt{\sum\_{I} \left( H\_1(I) - \overline{H}\_1 \right)^2 \sum\_{I} \left( H\_2(I) - \overline{H}\_2 \right)^2}}. \tag{11}$$

The Chi-square function takes the squared difference between two histograms. The squared differences are divided by the number of samples, and the sum of these weighted squared differences is the likelihood value:

$$\chi^2(H\_1, H\_2) = \sum\_{I} \frac{\left(H\_1(I) - H\_2(I)\right)^2}{H\_1(I)}.\tag{12}$$

The histogram intersection calculates the similarity of two discretized histograms, with a possible value of the intersection lying between no overlap and identical distributions. It works well on categorical data and deals with null values by making them part of the distribution.

$$I(H\_1, H\_2) = \sum\_{I} \min(H\_1(I), H\_2(I))\tag{13}$$

The Bhattacharyya distance approximates the normalized distance between the histograms using the maximum likelihood of two object histograms as follows:

$$B(H\_1, H\_2) = \sqrt{1 - \frac{1}{\sqrt{H\_1 H\_2 N^2}}} \sum\_I \sqrt{H\_1(I) H\_2(I)}.\tag{14}$$

The texture-based comparison uses the texture analysis features based on grey level co-occurrence matrix (GLCM), which are related to second-order image statistics that were introduced by Haralick [50] as follows: (1) Angular second moment (energy), (2) contrast, (3) correlation, (4) variance, (5) inverse difference moment (homogeneity), (6) sum average, (7) sum variance, (8) sum entropy, (9) entropy, (10) difference variance, (11) difference entropy, (12) information measure of correlation, (13) information measure of correlation II, and (14) maximal correlation coefficient.

#### *2.4. Ethics Declaration*

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Faculty of Informatics, Kaunas University of Technology (No. IFEP201809-1, 2018-09-07).

#### **3. Experiment and Results**

#### *3.1. Dataset and Equipment*

We have used the embryo photos taken with Esco Global incubator series, called Miri TL (time lapse). The embryo image sets were registered in the German, Chinese, and Singapore clinics. No identity data were ever provided to the authors of this paper. The Esco's embryo database consists of three image classes that together have 5000 images with a resolution of 600 × 600 pixels. Number of different images in classes: One-cell—1764 images, two-cell—1938 images, four-cell—1298 images. Images are obtained from 22 different growing embryos in up to five stages of cell evolution. The database was then magnified by rotating each photo at 90, 180, and 270 degrees. Resulting number of different images in classes: (one-cell images—1764 × 4 = 7056; two-cell images—1938 × 4 = 7752; four-cell images—1298 × 4 = 5192). The images were taken in the following culturing conditions: Temperature of 37 ◦C, stable level of 5% CO2, and the controllable values of nitrogen and oxygen mixture. The embryos were photographed in a culture coin dish made from polypropylene with a neutral media pH = 7. The inverted microscopy principle was used, with 20x lenses, without zoom, with focusing and with field of view of 350 um. A camera sensor used was IDS UI-3260CP-M-GL (IDS Imaging Development Systems GmbH, Obersulm, Germany). An example of embryo images in the Esco dataset is presented in Figure 4. The evaluation and training of GANs was done on an Intel i5-4570 CPU with a GeForce 1060 GPU and 8 GB of RAM.

**Figure 4.** Sample images from Esco embryo image dataset.

#### *3.2. Results*

The training of GAN was carried out in 200,000 iterations, with a sample size of 256 images per batch. The duration of training using GeForce 1060 GPU was about 15 h 45 min.

From the final generated images, one can easily count the number of embryonic cells and determine the class (Figure 5). During subsequent iterations, the generator restored the embryo image and reduced the amount of noise in the generated images.


**Table 1.** Evolution of embryo cell images during GAN training.

**Figure 5.** Example of final generated embryo cell images using the proposed algorithm. From left to right: One, two, and four cells. Images were filtered from "salt and pepper" noise using a median filter. See Table 1 for raw outputs.

During the training, embryo cell images were generated at every 25,000 iterations. An example of these images is shown in Table 1.

After 25,000 iterations, we can see that he managed to restore the plate's flair and release the background, but the embryo itself was not restored. After 100,000 iterations, the generator has been able to clearly reproduce the image of the embryo cell.

The error function values for both generator and discriminator are shown in Figures 6 and 7. During an initial training up to 20,000 iterations, the error function value of the generator increases rapidly. Further increase of the error function results in a slower pace. Once the training number reaches 75,000 iterations, the value of the generator error function begins to increase with the training of two-cell and four-cell images. The generator becomes able to produce more complex images, which is also visible from the generated sample images (see Table 1). The one-cell image can be identified from 50,000 iterations, whereas two-cell and four-cell images can be recognized from 75,000–125,000 iterations. From the discriminator error feature, we can see that starting from 20,000 iterations it becomes difficult to separate real images from artificially generated images, and the value of the error function is less than 0.15. A value of discriminator log loss increases, as discriminator is unable to differentiate between a real image and a generated image.

**Figure 6.** Loss log graph for the generator network.

**Figure 7.** Loss log graph for the discriminator network.

For the expert-based evaluation (see Table 2), 1500 images (500 in each class) were generated. The number of artificial one-cell images, where one cell was clearly recognized, was 96.2%. The quality of two-cell and four-cell image generation of more complex images deteriorated. Of the two-cell images, the number of cells could be clearly determined from 86.8% of the images. When evaluating the four-cell images, 80% accuracy was obtained, i.e., one out of five images was generated inaccurately (as decided by an expert).

**Table 2.** Evaluation of the generated embryo image cells by human experts.


Additionally, the image similarity was evaluated by comparing their histograms. In Figure 8, one can see the comparison of image histograms, where the blue curve is the average histogram of the training image dataset and the red is the average histogram of the artificially generated images. In all three classes, we can see that the generated image is brighter than the real one. This could be explained due to the slight "salt-and-pepper"-type noise in the generated image. The highest histogram match was obtained in the single-cell-generated images. This shows that the generator was better able to reproduce images of a simpler structure. Note that Figure 8 shows an evaluation of a generated cell image with a maximum epoch number of 200,000. The particular histogram cannot be identical as a single network generates images of different cells. This comparison was done by evaluating the average histogram of all (training and generated) image histograms.

**Figure 8.** Comparison of histograms of real vs. generated images (real images—blue, generated images—red): (**a**) one-cell images, (**b**) two-cell images, (**c**) four-cell images.

Table 3 shows the results of a comparison of normalized histograms using correlation, Chi-square, intersection, and Bhattacharyya distance formulas. In all four cases, the largest histogram coincidence was shown by the one-cell generated images, while the two-cell and four-cell images performed worse.

**Table 3.** Evaluation of generated artificial embryo cell images using histogram comparison criteria.


We also compared the values of the textural (Haralick) features for the original and generated embryo images and evaluated the statistical difference between the values of features using the Student's t-test. The results were significant (*p* < 0.01) for all Haralick features except the sum variance feature (one-cell images), variance and sum average (two-cell images), and the sum variance feature (four-cell images).

To compare the Haralick features from both (original and generated) image datasets, we also used the Pearson correlation between the principal components of feature sets extracted using principal component analysis (PCA). For both feature sets, the first principal component (PC1) explains more than 98% of variance, so we use only PC1 for further comparison. The results are presented in Table 4 and show a very high correlation between the PC vectors.

**Table 4.** Results of principal component analysis (PCA) of Haralick features from original and synthetic embryo images.


#### **4. Discussion**

As the correct comparison with other algorithms not possible due to very different source datasets used, we tried to compare within the misclassification rate, by comparing percentage of the images from all generated groups were assigned to correct class or not at all. Our HEMIGEN method was rated at 12.3%.

Table 5 provides a comparison of the misclassification rate of synthetic images when compared to the results obtained by other authors. Gaussian mixture deep generative network (DGN) demonstrated 36.02% misclassification rate [51]. DGN with auxiliary variables and two stochastic layers and skip connections achieved the 16.61% misclassification rate [52]. Semi-supervised classification and image generation with four-layer generator demonstrated 8.11% misclassification rate on house number image generation [48]. The adversarial learned inference (ALI) model, which jointly learns a generation network and an inference network using an adversarial process, reached a misclassification rate of 7.42% using CIFAR10 test set (tiny images) [53]. The WGAN-based approaches ranged from 6.9% [21] to 50% [40] depending on the application. The methods indicated are only "loosely" comparable, taking into account the differences (importance) of features of synthetic image targeted and scopes of the works by other researchers.


We also checked for the problem of mode collapse where the generator would produce the same image over and over, thus fooling discriminator, while no new image would be generated in real-life. We have not found this case in our approach, possibly due to an adequate number of different embryos used. Some generated images had embryos, which overlap one another, but this was not considered a failure, but a realistic case of cell division.

#### **5. Conclusions**

We used generative adversarial networks trained on real human embryo cell images to generate a dataset of synthetic one-, two-, and four-cell embryo images. We have achieved the highest quality of generated images for single-cell embryo images, where 96.2% of the synthetic embryo images were recognized as accurate and usable by human experts. The worst accuracy was achieved for the synthetic four-cell images, of which only 80% could be identified correctly. These results were confirmed by the histogram comparison, which achieved the highest scores for synthetic single-cell images (an average correlation of 0.995 was achieved when comparing histograms of real and synthetic one-cell embryo images), as well as by comparison of image textures analyzed using the Haralick features.

As our algorithm allows us to manipulate the size, position, and number of the artificially generated embryo cell images, these images can then be used to train and validate other embryo image processing algorithms, when the real embryo images are not available, or the number of available real embryo images is too small for training neural networks.

**Author Contributions:** Conceptualization, R.M. and V.R.; data curation, D.D.; funding acquisition, R.S.; investigation, D.D, R.M., and V.R.; methodology, R.M.; software, D.D; supervision, R.M.; validation, D.D and R.M.; visualization, D.D; writing—original draft, D.D, R.M., and R.D.; writing—review and editing, R.M. and R.S.

**Funding:** The project partially financed under the program of the Polish Minister of Science and Higher Education under the name "Regional Initiative of Excellence" in the years 2019–2022 project number 020/RID/2018/19, the amount of financing 12,000,000.00 PLN.

**Acknowledgments:** The authors also would like to thank Esco Global for kindly provided embryo image dataset. **Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### **Image Thresholding Improves 3-Dimensional Convolutional Neural Network Diagnosis of Di**ff**erent Acute Brain Hemorrhages on Computed Tomography Scans**

#### **Justin Ker 1, Satya P. Singh 2, Yeqi Bai 2, Jai Rao 1, Tchoyoson Lim 3and Lipo Wang 2,\***


Received: 1 April 2019; Accepted: 4 May 2019; Published: 10 May 2019

**Abstract:** Intracranial hemorrhage is a medical emergency that requires urgent diagnosis and immediate treatment to improve patient outcome. Machine learning algorithms can be used to perform medical image classification and assist clinicians in diagnosing radiological scans. In this paper, we apply 3-dimensional convolutional neural networks (3D CNN) to classify computed tomography (CT) brain scans into normal scans (N) and abnormal scans containing subarachnoid hemorrhage (SAH), intraparenchymal hemorrhage (IPH), acute subdural hemorrhage (ASDH) and brain polytrauma hemorrhage (BPH). The dataset used consists of 399 volumetric CT brain images representing approximately 12,000 images from the National Neuroscience Institute, Singapore. We used a 3D CNN to perform both 2-class (normal versus a specific abnormal class) and 4-class classification (between normal, SAH, IPH, ASDH). We apply image thresholding at the image pre-processing step, that improves 3D CNN classification accuracy and performance by accentuating the pixel intensities that contribute most to feature discrimination. For 2-class classification, the F1 scores for various pairs of medical diagnoses ranged from 0.706 to 0.902 without thresholding. With thresholding implemented, the F1 scores improved and ranged from 0.919 to 0.952. Our results are comparable to, and in some cases, exceed the results published in other work applying 3D CNN to CT or magnetic resonance imaging (MRI) brain scan classification. This work represents a direct application of a 3D CNN to a real hospital scenario involving a medically emergent CT brain diagnosis.

**Keywords:** 3D convolutional neural networks; machine learning; CT brain; brain hemorrhage

#### **1. Introduction**

Intracranial hemorrhage is a medical emergency that can have high morbidity and mortality if not diagnosed and treated immediately. This condition affects 40,000 to 67,000 patients in the United States annually and up to 52% of patients die within one month [1]. Three commonly-encountered sub-types of intracranial hemorrhage are subarachnoid hemorrhage (SAH), intraparenchymal hemorrhage (IPH), acute subdural hemorrhage (ASDH). In a severe brain trauma, various permutations of SAH, IPH, and ASDH can be seen, which we have termed brain polytrauma hemorrhage (BPH) in this work. The common causes of SAH are trauma and cerebral aneurysmal rupture, while IPH can be caused

by hypertension, amyloid angiopathy, brain tumor hemorrhage or trauma. ASDH and BPH appear because of head trauma.

When patients with intracranial hemorrhage present to the emergency department, a computed tomography (CT) scan of the brain is done to diagnosis intracranial hemorrhage, so that medical or surgical treatment can follow. CT scans work by exploiting the differential absorptive properties of body tissues placed between an x-ray emitter and detector. Brain tissue, blood, muscle, and bone give rise to different levels of x-ray attenuation, expressed in Hounsfield units. By moving the x-ray emitter and detector circumferentially around a subject, a three-dimensional image of the subject's internal tissues is obtained.

Limitations in the availability or experience of clinicians, especially in rural or resource-strapped health systems, to diagnose CT brains quickly can cause treatment delays. Automating the diagnosis of CT brain scans, or assisting the clinician in triaging critical from normal scans, would help patients by expediting their treatments and improve outcomes. Figure 1 shows the axial slices of five CT brain scans showing a normal brain, SAH, IPH, ASDH, and BPH. The outer rim of uniform white skull bone surrounds the dark grey brain tissue, and areas of acute hemorrhage, appear as patchy, light-grey areas of varying shapes.

**Figure 1.** Computed tomography (CT) brain scans. From left, normal (N), subarachnoid hemorrhage (SAH), intraparenchymal hemorrhage (IPH), acute subdural hemorrhage (ASDH), brain polytrauma hemorrhage (BPH). Each image represents an individual image slice. One patient's complete stack of CT images contained between 24 to 34 image slices in our dataset.

The artificial neuron was first described by McCulloch and Pitts in 1943 [2]. This has evolved through the symbolic, rule-based artificial intelligence (AI) paradigms, to manual feature-handcrafting algorithms, and to modern multi-layered or "deep" neural networks, which perform feature detection and classification automatically. Convolutional neural networks (CNN) owe their inception to Fukushima's Neocognitron model in 1982 [3], and their popularity to Lecun et al. [4] and Krizhevsky et al. [5] The latter employed a CNN to win the 2012 Imagenet Large Scale Visual Recognition Challenge, and since then CNNs have been used for many image classification tasks. The advantage of CNNs in image classification is the ability to perform feature-extraction and learn high-level image features automatically without feature-handcrafting, leading CNNs to become the dominant machine learning architecture in image recognition tasks. CNNs have been widely used in machine vision to perform a variety of tasks, such as image classification, object detection, and semantic segmentation. In the medical image analysis space, CNNs have been used for the classification and diagnosis of 2-dimensional medical images, such as chest x-rays, retinal photographs, skin dermoscopic images, and histology images, with performance comparable to or exceeding human clinicians [6–9].

In analyzing volumetric magnetic resonance imaging (MRI) or CT data, various machine learning approaches including 2D CNN have been attempted. These efforts have involved manual slice selection, extensive manual pre-processing, feature hand-crafting, and segmentation, before classification [10–12]. A number of these approaches classify individual images from a volumetric image stack one image at a time, turning a 3D classification problem into a 2D task. Other authors have used employed a combination of 2D CNN in the three planes (axial, coronal, sagittal) that define a 3D volume [13–15]. Roth et al. [14] detected abnormal lymph nodes on thoracic CT scans, by decomposing a 3D volume of interest into re-sampled 2D views, and then training their CNN on augmented variations of these 2D views. Their CNN achieved a satisfactory sensitivity of 90%, at six false positives per patient. The disadvantages of these previous strategies include the manual time and effort in hand-crafting, feature segmentation, and stripping, and the potential loss of spatial contextual information when a 3D volume is analyzed using 2D slices.

Three-dimensional CNNs are an emerging architecture, and have been used mainly in analyzing video or 3-dimensional volumetric medical images. Previously, the use of 3D CNN was limited as it was computationally expensive and lengthy to process 3-dimensional kernels and entire volumes of images. However, more papers on 3D CNN have appeared in the scientific literature with their adoption likely aided by decreasing computational hardware costs. In medical image analysis, 3D CNNs have been applied to detecting abnormalities (tumors, hemorrhage, ischemia) in brain, heart, lung, and liver organs on CT or MRI imaging [16–18]. As a measure of the interest in the clinical problem of automatically detecting intracerebral hemorrhage on MRI or CT, there is also a growing number of publications on this topic using 3D CNNs [19,20]. Dou et al. [20] analyzed cerebral microbleeds on MRI brain scans using a two-stage 3D CNN that screened and then detected microbleeds. Their method achieved 93% sensitivity, at a cost of 44% precision and 2.7 false positives per patient.

We propose a 3D CNN that classifies CT brain scans into normal (N), subarachnoid hemorrhage (SAH), intra-parenchymal hemorrhage (IPH), acute subdural hemorrhage (ASDH), and brain polytrauma hemorrhage (BPH).

In this work, we aim to create a 3D CNN that can automatically detect and diagnose SAH, ASDH and IPH on CT brain scans, and distinguish them from normal scans. This work would have direct clinical application in emergency departments and acute-care hospitals worldwide. We propose a simple and fast 3D CNN that is effective and accurate. It is hoped that the straightforward implementation of this 3D CNN will lead to its widespread adoption. Specifically, we modify well-known 2D CNN architectures into 3D CNN. The novel aspects of our work are as follow:

(1) To our knowledge, this is the first demonstration of a 3D CNN on volumetric CT brain data that classifies patient scans into different acute hemorrhagic variants, which impacts subsequent medical treatment. Previous work has been limited to detecting normal versus abnormal scans. We also demonstrate that our network performance gives state-of-the-art results in classification accuracy.

(2) We present a novel image thresholding method optimized for the detection of acute hemorrhage on CT brain scans, which mimics the visual analysis of radiological images by human radiologists. We demonstrate that the application of our method improves the classifier performance.

This paper is organized into the following sections. Section 2 describes our methods, with details on the dataset, network architectures, and training protocols. Section 3 reports our results and experimentation with various network parameters. Section 4 analyzes the impact, limitations and future work stemming from our results. Section 5 summarizes and concludes this paper.

#### **2. Methods**

#### *2.1. 3-Dimensional Convolutional Neural Networks*

The feature map of a convolution layer is formed by convolving the feature map of the previous layer with learnable kernels, adding a bias term, and then applying an activation function [21]. Specifically, we can define *h p <sup>k</sup>* as the *<sup>k</sup>th* feature map of the *<sup>p</sup>th* layer, and *<sup>h</sup> p*−1 *<sup>j</sup>* as the *j th* feature map of

the previous layer, *<sup>p</sup>* <sup>−</sup> 1. *<sup>W</sup><sup>p</sup> <sup>j</sup>*,*<sup>k</sup>* is a learnable kernel, and *bp* is the bias term. <sup>σ</sup> is the activation function, commonly a rectified linear activation unit (ReLu). This is written as:

$$h\_k^p = \sigma \left( \sum\_j h\_j^{p-1} \times \mathcal{W}\_{j,k}^p + b\_k^p \right) \tag{1}$$

A 2-dimensional convolution can be defined for convolving an input image *I* with a kernel *K*. Extending the equation for a 2-dimensional convolution [22] into a 3-dimensional convolution, we obtain:

$$S(\ell, m, n) = (\mathbb{K} \times I)(l, m, n) = \sum\_{a} \sum\_{b} \sum\_{c} I(\ell - a, m - b, n - c) \text{ K}(a, b, c) \tag{2}$$

Equation (2) may be expressed as:

$$S(\ell, m, n) = \sum\_{a, b, c} h\_j^{p-1} \left( \ell - a, m - b, n - c \right) \mathcal{W}\_{j,k}^{p} (a, b, c) \tag{3}$$

In Equation (3), *h p*−1 *<sup>j</sup>* is the *j th* 3-dimensional feature map of the previous layer *<sup>p</sup>* <sup>−</sup> 1, and *<sup>W</sup><sup>p</sup> <sup>j</sup>*,*<sup>k</sup>* is the 3-dimensional kernel. Substituting Equations (1) and (3), we obtain the equation for *h p k* , which is a 3-dimensional feature map:

$$\mathcal{H}\_k^p = \sigma \left( S\_{j,k}^p + b\_k^p \right) \tag{4}$$

These 3-dimensional convolution layers were stacked with max-pooling and fully-connected layers. Kernels were initialized with Gaussian distribution and parameters were tuned with standard stochastic gradient descent and cross-entropy loss minimization.

#### *2.2. Image Thresholding to Detect Acute Hemorrhage*

The CNN extracts features from the input images, and most of the features are associated with edges, shape, and curves present in the input images. It can also be seen by visualizing the different layers of CNN that the first layer mostly picks up edges present in the images, while the second layer picks up curves, and the third layer picks up shapes. The human eye is similar in that it also observes an image for its constituent edges, curves, and shapes, as suggested by the presence of ocular dominance and orientation columns in the primary visual cortex of the mammalian brain. In detecting anomalies on medical images, a region of interest (ROI) may be subtle or not apparent to visual inspection by either human or CNN. Just as thresholding aids a human radiologist to emphasize possibly abnormal areas, we propose a threshold operator to accentuate individual sharp edges, such that this improves the likelihood of a CNN detecting a feature, and, therefore, performing a correct classification.

To improve the discriminatory ability of our model, we propose a novel thresholding method to detect acute hemorrhage. We build on the work of Zhang et al. [23] who used spatial histograms to detect cars in images. There is a range of pixels intensities which is common in both normal and abnormal CT scans. Therefore, we can discard these intensities before feeding the image to the CNN without any loss of information. This is similar to the dimensionality reduction process. In our proposed method, we generate the average pixel intensity histograms of 2 classes of CT scans (such as Normal and SAH), to overlap and search for an optimal pixel intensity threshold that accentuates their difference. Pixel intensities below this threshold are then disregarded. This process creates sharp edges and shapes around a ROI which helps CNN in feature extraction. An added benefit is that a CNN will require less time for training.

The intuition underlying this approach is two-fold. First, we observed that normal CT brain scans, even across different patients are largely homogeneous, which makes a histogram representation of the class meaningful. Abnormal scans can be thought of as the addition of extraneous blood signal to normal scans. Second, we also observed that radiologists adjust image contrast levels when reading CT scans, to accentuate subtle amounts of blood, and to downplay the appearance of normal surrounding

tissue. Our method attempts to model this behavior in human visual cognition. Using our method improves the overall performance of the classifier across all classes of acute brain hemorrhage.

#### *2.3. Dataset and Pre-Processing*

The dataset consists of non-contrast CT brain images from the National Neuroscience Institute at Tan Tock Seng Hospital, Singapore. After institutional review board approval, a search of electronic discharge summaries with the diagnoses matching head injury or intracranial hemorrhage was performed. The resultant list of patient identifiers was used to query and retrieve relevant scans from the hospitals' Picture archiving and communication system (PACS) servers. Each scan was anonymized and manually checked by the authors to ensure ground truth. The final dataset consisted of 399 unique patient volumetric CT brain scans representing approximately 12,000 images, summarized by the different classes in Table 1. These scans had varying numbers of image slices and slice thickness, owing to variability in CT scanner model and scanning protocols. We prepare the data for 5 five-fold cross-validation. Training, validation, and test CT scan images were augmented eight-fold by flipping along the vertical axis, and rotation up to 45 degrees.

**Table 1.** Number of original unique patient computed tomography (CT) scans.


#### *2.4. Network Architecture*

Table 2 shows the model architecture used in our 3D CNN. We experimented with various model architectures including VGGNet and GoogLeNet, to optimize for the trade-off between classification accuracy and computational time. We aimed to have a model with a straightforward design for easy trouble-shooting and to facilitate real-world implementation. Like Dou et al. [20], we were concerned with the impact of processing large files of volumetric brain images on computation time. However, while they opted for a two-part ensembled screening and discrimination stages, we opted for a single throughput architecture for simplicity and performance. Figure 2 is a pictorial diagram of our proposed 3D CNN. After the necessary preprocessing steps, input volumetric data of 3D CT scans are passed through a pre-defined threshold operator as discussed in Section 2.2 and becomes the input to the 3D CNN.

**Table 2.** Model architecture of the 3-dimensional convolutional neural networks (3D CNN) used in this work.


**Figure 2.** Proposed architecture for binary and multi-class classification of CT scans. The features are visualized using 3D deconvolution visualization methods at each pooling layer.

#### *2.5. Training and Implementation*

Training was performed on a computer with two Intel Xeon E5-2630 CPU processors, four NVIDIA GTX 1080 Ti Graphical processing units, and 128 GB of DDR4 Random access memory. The project was implemented using the Python programming language and the Google Tensorflow library. We used the rectified linear unit (ReLu) as the activation function, the Adam optimizer, and cross-entropy as the loss function. We used the grid search approach for optimizing the learning rate, dropout, and kernel size of the convolution and pooling layers. We varied learning rates from (0.1, 0.001, 0.0001), and dropout from (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6) after each convolution layer and fully connected layer. The kernel sizes of convolution layers were varied from (1 × 1 × 1, 2 × 2 × 2, 3 × 3 × 3, 5 × 5 × 5), and pooling kernel sizes ranged from (1 × 1 × 1, 2 × 2 × 2). Eventually, we set the learning rate at 0.001 and set the dropout after the fully connected layer at 0.2. We employed a convolutional kernel size of 3 × 3 × 3 at each layer, with a 2 × 2 × 2 pooling kernel after each convolutional layer. We set β1 = 0.001, β2 = 0.999; ε = 10<sup>−</sup>8.

#### Metrics

To evaluate our network performance, we measured the Sensitivity (S), Precision (P) and F1 scores across each classification task. TP, FP, FN refer to true positive, false positive, and false negative, respectively. The F1 score is defined as the harmonic average of Sensitivity and Precision and is a measure of a test's accuracy. An F1 score of 1 indicates perfect Sensitivity and Precision, while a score of 0 indicates the opposite.

$$S = \frac{TP}{TP + FN}, \quad S = \frac{TP}{TP + FP} \tag{5}$$

$$F\_1 = \frac{2}{\frac{1}{5} + \frac{1}{7}}\tag{6}$$

The confusion matrix for this 4 Class classification problem is tabulated in Table 3. Table 4 summarizes the Sensitivity, Precision, and F1 scores, re-casting the multi-class problem into a 2-class problem (Normal versus a specific class) to calculate these metrics.


**Table 3.** Multi-Class Classification for Normal and Abnormal CT Scans.



Highest F1 scores are in bold.

#### **3. Results**

We performed experiments involving binary classification (Normal versus SAH, IPH, ASDH, BPH) and multi-class classification. In the latter, BPH was left out as BPH contains features of Normal, SAH, IPH, and ASDH. For the binary classification experiments, we implemented the experiments with and without the thresholding method described in Section 2.2.

For the multi-class classification experiments, the model discriminated between four classes with an overall F1 score of 0.684. Table 3 presents the confusion matrix for the multi-class classification. The actual number for each class represents the augmented test set after the original test set was augmented eight-fold. The respective F1 scores for each class are Normal: 0.819, SAH: 0.639, IPH: 0.427, ASDH: 0.829. Thresholding was not implemented for the multi-class classification as it is optimized for binary classification. The model performed well for Normal and ASDH scans, but only moderately well for SAH. Interestingly, although the training dataset for ASDH was the smallest, the model was able to discriminate this the best. This may be due to the fact that ASDH images are visually grossly asymmetrical compared to the other classes. This is due to brain compression from the significant subdural hemorrhage (see Figure 1), and may be a strongly activated discriminatory feature. Surprisingly, a significant number of IPH scans were misinterpreted as SAH, and we hypothesize that this may be due to subtle SAH traces that may appear on some of the IPH scans.

Table 4 summarizes the results from the 2-class classification experiments. Overall, our model was able to discriminate between normal and each of the classes with satisfactory results, matching or exceeding previously published results for similar work (Table 5). We demonstrate that for every class, the implementation of the thresholding technique improves all the evaluation metrics. The largest increase was seen in the ASDH class, where the F1 score increased from 0.706 to 0.952, despite a small training set. Various model architectures, input image, and filter sizes were modified to optimize for

accuracy and computational time. The original image size of the CT scans was 512 × 512 × 28 pixels (at 5 mm slice thickness), and the input size to our model was 50 × 50 × 28 pixels. Further decreasing the input size resulted in deteriorating model performance. We posit that the thresholding technique improves the signal-to-noise ratio for each input image by downplaying the extraneous image features, thereby accentuating the presence of acute hemorrhage. We also found that the thresholding technique decreased training time for the respective models significantly. For example, in the Normal versus SAH classification task, the model training time was decreased from over 10 h to 1 h 32 min. 3D CNNs are computationally intensive to train, and any decrease in computational cost and time is advantageous in clinical deployment.


**Table 5.** Comparison of results involving brain hemorrhage detection in volumetric brain scans.

Dou et al. expressed their evaluation metric as Sensitivity, Precision, and False positives per subject.

#### **4. Discussion**

Acute brain hemorrhage is a common neurosurgical emergency which can result in severe patient morbidity and mortality. It is the result of myriad causes, including trauma, hypertension, cerebral aneurysm rupture, and the treatment of each is different. Depending on the clinical condition, the patient may require close observation in a high dependency or intensive care setting, or immediate neurosurgical operation. It is imperative to minimize the time from diagnosis to treatment, to give the patient the best chance of recovery.

We propose an automated 3D CNN to classify volumetric CT brain data into various hemorrhagic variants, to assist doctors in expediting patient treatment. We trained and tested our model on 399 CT brain images from our hospital to classify CT brain scans into normal, SAH, IPH, ASDH and BPH. These classes were chosen as the neurosurgical treatment for each class is different.

We also proposed and implemented a novel pixel thresholding method to detect acute hemorrhage on CT brain scans. This method improved classifier performance on our dataset and can be conceivably exported for use in other datasets for other anatomical regions, where acute hemorrhage detection is required. Potentially, aside from detecting acute blood, this method can also be generalized for the detection of other abnormalities such as tumors. Figure 3 demonstrates an image slice of a CT brain with acute subdural hemorrhage. The hemorrhage is the white crescent on the left of the image, which is putting pressure on the grey areas of brain, and pushing it to the right of the image. To visualize the activations in the 3D CNN better, we used the deconvolution technique described by Yosinski et al. [25] to visualize this single image. The top row of the image (boxes B and C) represent convolution layer 1 and pooling layer 1, respectively. The bottom row (boxes D and E) represent the same layers with thresholding applied. The difference with thresholding is that the edges of the target hemorrhagic lesion appear more distinct and sharper, which may account for why thresholding improves classification accuracy.

**Figure 3.** To demonstrate the effect of thresholding, a single slice of an ASDH CT Brain scan is shown, with the corresponding activations of convolution and pooling layers. **A**, original image. **B**, 1st convolution layer. **C**, 1st pooling layer. **D**, 1st convolution layer with thresholding applied. **E**, 1st pooling layer with thresholding applied. D and E appear sharper than B and C, demonstrating how thresholding can accentuate abnormal areas, and improve classifier performance.

In this paper, we demonstrate state-of-the-art 3D CNN classifier performance for different classes of acute brain hemorrhage. In addition, the application of our thresholding method for acute hemorrhage enables our model to achieve further additional improvement in classification. In the 4-class classification task, our model achieved an overall F1 score of 0.684. To our knowledge, there has been no published work involving multi-class classification of different classes of brain hemorrhage for comparison. In the 2-class classification task, the best performance was achieved in differentiating normal from ASDH scans, with a F1 score of 0.952 using our proposed thresholding method. The largest improvement with thresholding was also seen in the ASDH class, as the initial F1 score was only 0.706.

There has been a long history in the attempt to analyze hemorrhage in volumetric CT and MRI data. Before the widespread use of CNNs in image analysis, methods used included Hopfield networks [26], support vector machines [27], and segmentation masking with logistic classifiers [28]. Although these simple classifiers performed well, they were often applied to single image slices with obvious pathology, that were often manually chosen, which, therefore, represents an unrealistic problem scenario. Hybrid 2D CNN methods exemplified by Roth et al. [14] were a bridge to the 3D CNN training of networks. Fully 3D CNN model local and contextual spatial dependencies and extract features in all three dimensions of image voxels. Kamnitsas et al. [16] exploited the dense inference technique and small kernels, to segment lesional areas on brain MRI scans. Of note, they used a dual 11-layer 3D CNN pathway to process images at multiple scales, and Conditional Random Fields to decrease their false positive rate.

There are two points worth noting in our network architecture design. First, we deliberately kept our network architecture simple with relatively few layers, and did not leverage on other techniques, such as ensembling and transfer-learning. This was for both practical and theoretical considerations. The application of full 3D CNN to volumetric medical images is nascent, and by purposely keeping the model architecture straightforward, we are able to assess the effect size of various hyperparameter-tuning quickly, experiment with various architectures efficiently, and to troubleshoot network errors expediently. Running 3D CNN is computationally intensive, and a simpler network mitigates lengthy run-times.

From a more theoretical perspective, a straightforward architecture also allows us to grasp a sense of the baseline performance of a 3D CNN, and to describe our experimentations with a clear mathematical description. We believe that a firm theoretical framework will help in directing further areas for exploration, rather than blindly tuning hyperparameters, or implementing network boosting ensembling architectures without understanding.

The second point is that with a straightforward network architecture design, we were able to demonstrate the clear improvement that our thresholding method brought for detecting acute hemorrhage. We arrived at this thresholding method by observing radiologists as they scrolled through various patient CT scans. We noted that in situations where areas of acute brain hemorrhage were subtle, the radiologists would increase and decrease the image contrast to accentuate the appearance of the abnormality, which in this case was blood. We did the same thing while studying our dataset, and, therefore, explored how this optimization of human visual cognition or analysis could be implemented algorithmically. We took inspiration from the work of Zhang et al. [23] who used pixel intensity 'spatial histograms' for object detection within an image. Our underlying assumption was similar, that objects of a certain class would have similar patterns of pixel intensities. Where we differ is that while Zhang et al. were concerned with the spatial location of the object, we are concerned more with the actual presence of a particular pattern of pixel intensity, and the point of divergence from the pattern denoting a normal scan. This is because in CT brain scans depicting a hemorrhage, the blood even within the same class of brain hemorrhage can appear in several different areas of the brain.

The proposed architecture in this work has important clinical significance. The different abnormal diagnoses studied in this work all exert a significant epidemiological and social-economic toll on healthcare systems and societies. SAH, IPH, ASDH, and BPH are all neurosurgical emergencies that require immediate but different treatments to maximize the likelihood of patient survival and to achieve a good long-term functional outcome. Our work has a role in helping clinicians minimize the time between diagnosis and treatment, especially in hospitals that may not have a radiologist after hours, or in remote rural settings where no clinicians are available. Even in large tertiary care hospitals with 24 h-radiologists, our proposed architecture framework can assist radiologists by triaging important abnormal scans from the large numbers of normal scans that are read sequentially as patients are scanned.

One limitation of this work is the relatively small and un-balanced dataset that we worked with, which is a common issue in medical image analysis, with a bias towards normal samples. At 399 images, our dataset size is comparable to those used in many other works [20,24] but smaller than the almost 40,000 CT scans used by Jnawali et al. [19]. Despite this, data augmentation techniques have resulted in performance comparable to Jnawali et al.'s much larger dataset. This may be due to the fact that medical data is relatively homogeneous in appearance, compared to natural image processing tasks. It would be interesting to study what is the optimal dataset size for processing specific volumetric image classification tasks, in CT, MRI, 3D ultrasound images for different lesions. We addressed concerns of overfitting with known mitigating techniques, such as dropout, which were applied to the early layers in our network.

There are many avenues for further exploration in volumetric medical image analysis. CNNs and 3D CNNs have been the dominant network architecture in image analysis, but unsupervised learning methods for medical image analysis are emerging for 3D object generation [29,30], and they have been largely unexplored in the context of 3D medical image analysis. Generative architectures, such as variational autoencoders and generative adversarial networks, have not been applied to volumetric medical data, and these techniques may potentially mitigate the need for large well-labelled datasets. Specific to our work, we intend to explore if adding a screening stage to a 3D CNN, or multi-scale receptive fields can improve performance, as some authors have demonstrated [16,20] for MRI brain scans. The addition of a memory or attention-based component to model long term dependencies in 3D medical image analysis [24,31,32] is also interesting for further investigation, as there is evidence for a strong biological correlate with the mammalian visual system [33].

#### **5. Conclusions**

This work presented the implementation of a 3D CNN to classify and diagnose volumetric CT brain data. Normal CT brains and a variety of abnormal scans constituting different types of brain hemorrhage were classified by our 3D CNN. We also implement a novel optimization method to detect acute hemorrhage on CT scans. The proposed 3D CNN can automatically detect important normal and abnormal features of cerebral anatomy without handcrafting, or significant data pre-processing. Computational costs were also modest, which will add to straightforward implementation. Results from classification experiments demonstrated that the 3D CNN outperforms previously published methods in detecting abnormal brain scans with hemorrhage, with higher sensitivity and recall. Our 3D CNN can be applied to other volumetric medical data and can be used to expedite and improve patient care.

**Author Contributions:** Conceptualization, L.W., J.R., T.L., J.K.; software S.S. and Y.B.; method J.K. and S.S.; validation, S.S., J.K. and Y.B.; writing J.K.; supervision L.W., J.R. and T.L.

**Funding:** This research was funded by the National Neuroscience Institute-Nanyang Technological University Neurotechnology Fellowship Grant.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### **Rapid Multi-Sensor Feature Fusion Based on Non-Stationary Kernel JADE for the Small-Amplitude Hunting Monitoring of High-Speed Trains**

#### **Jing Ning 1,2,\*, Mingkuan Fang 1, Wei Ran 1, Chunjun Chen <sup>1</sup> and Yanping Li <sup>1</sup>**


Received: 29 March 2020; Accepted: 16 June 2020; Published: 18 June 2020

**Abstract:** Joint Approximate Diagonalization of Eigen-matrices (JADE) cannot deal with non-stationary data. Therefore, in this paper, a method called Non-stationary Kernel JADE (NKJADE) is proposed, which can extract non-stationary features and fuse multi-sensor features precisely and rapidly. In this method, the non-stationarity of the data is considered and the data from multi-sensor are used to fuse the features efficiently. The method is compared with EEMD-SVD-LTSA and EEMD-JADE using the bearing fault data of CWRU, and the validity of the method is verified. Considering that the vibration signals of high-speed trains are typically non-stationary, it is necessary to utilize a rapid feature fusion method to identify the evolutionary trends of hunting motions quickly before the phenomenon is fully manifested. In this paper, the proposed method is applied to identify the evolutionary trend of hunting motions quickly and accurately. Results verify that the accuracy of this method is much higher than that of the EEMD-JADE and EEMD-SVD-LTSA methods. This method can also be used to fuse multi-sensor features of non-stationary data rapidly.

**Keywords:** high-speed trains; hunting; non-stationary; feature fusion; multi-sensor fusion

#### **1. Introduction**

Hunting motion is a self-excited vibration that is a serious obstacle to the safety of high-speed trains [1]. Monitoring systems are designed to detect hunting only after it has developed to a specific degree. Besides, in most cases, the recognition result is only obtained using a single observation. Therefore, the accuracy and real-time performance of monitoring systems need to be further improved [2]. With the increasing performance requirements of high-speed trains, it is important to establish an accurate and rapid feature extraction method for hunting detection through multi-characterizations before the phenomenon has developed to any significant degree.

The structure of high-speed trains is very complicated and their working conditions are very poor, resulting in non-stationary vibration signals [3]. In the existing feature extraction research, a variety of data extraction algorithms are utilized. These algorithms can be roughly divided into two categories: those designed for stationary data and those for non-stationary data. Feature extraction methods for stationary data include Singular Value Decomposition (SVD), Linear Discriminant Analysis (LDA) [4,5], Principal Component Analysis (PCA) [6], Locality Preserving Projection (LPP) [7], and so on. However, it is hard to extract features of non-stationary data using these methods. For non-stationary data, manifold learning is a good feature extraction method [8–11]; however, it incurs a high computational cost and has extremely long calculation times, which means that the diagnostic information cannot be

fed back to the system in time. Therefore, a rapid yet precise method to extract the non-stationary signal in practical engineering applications is needed. Cardoso [12,13] proposed a method named Joint Approximate Diagonalization of Eigen-matrices (JADE) in the field of blind source separation, which is used to quickly separate multiple features. Because the method is simple and effective, it is also widely used in pattern recognition. Liu [14] improved the JADE method and achieved a good feature fusion performance by simplifying the calculations, and then developed a method based on kernel JADE to identify the rolling bearing fault [15]. The method based on JADE was adopted to diagnose the faulty parts of the rolling bearing in [16], and a similar method was applied to predict coaxial bearing performance degradation in [17]. However, in this application, the non-stationary and rapid character of the signal is not considered.

The structure of high-speed trains is complex. The vibration signals from high-speed trains are affected by many factors, such as the propagation path and the measuring point location, which lead the signal to couple with different characteristic information. Pattern recognition based on single sensor vibration signals has limitations, which cannot completely render the evolutionary characteristics of high-speed train vibrations [18]. Therefore, it is necessary to utilize a multi-sensor fusion method to extract the features of the train's operating state.

Furthermore, the evaluation parameters of lateral stability of railway passenger trains in different countries are different. Lateral force on the rail and on the wheel axis, lateral acceleration of the bogie frame, acceleration of the axle-box, and lateral acceleration of the vehicle body can be used as evaluation parameters, respectively [19–22]. However, only one parameter is used in each standard, and this parameter has different limits. For example, the influence of frequency is not considered when lateral acceleration of the bogie frame is used. In fact, the acceleration of the axle-box is a particularly important index in hunting monitoring. Therefore, in this paper, we try to fuse the signal from the bogie frame and axle box to extract the hunting motion features. To monitor the state of hunting of high-speed trains, four basic states are classified: normal, small convergence, small divergence, and hunting. Our aim is to identify the small divergence state before hunting occurs. Thus, in this paper, the monitoring of small amplitude hunting enables us to rapidly distinguish between these four basic states online [23].

Moreover, for real-time classification, it is extremely important to ensure that once the small amplitude appears to diverge, the entire calculation process can be completed immediately. Therefore, the time of the whole calculation process must be short enough for real-time classification.

Considering the above problems, a rapid multi-sensor feature fusion method based on the Non-stationary Kernel JADE method is proposed. The JADE method is a fast and accurate feature fusion algorithm, but it is generally used in stable environments. In order to use the algorithm in a non-stationary environment, the whole time series is divided into *M* time periods [24] and the kernel function is introduced. Then, *M* kernel matrices are obtained by the *M* time periods, which are jointly decomposed. After this, Jacobian rotation is used to obtain the unitary matrix by diagonalizing multiple kernel matrices simultaneously to extract the non-stationary fusion features. In addition, in order to visualize the data features, the extracted fusion features are expressed in three dimensions. The between-class indicator and within-class indicators [25] are also employed to describe the clustering performance of the features quantitatively.

In this paper, a multi-sensor data feature extraction framework is provided, in which a rapid feature fusion method using Non-stationary Kernel JADE (NKJADE) is proposed. This framework consists of the following series of steps. First, the Ensemble Empirical Mode Decomposition (EEMD) method is utilized to decompose the preprocessed signals to Intrinsic Mode Functions (IMFs). Then, the energy matrices are obtained using the IMFs and the fusion features are obtained through NKJADE, followed by inputting the extracted features to an LSSVM [26,27] for training and recognition.

Data from Case Western Reserve University (CWRU) are used to verify the performance of this method against that of the SVD and JADE methods. In this paper, the proposed method is also applied to identify the evolutionary trend of the hunting motion quickly and accurately. The results verify that the accuracy of this method is much higher than that of the JADE method.

The remainder of this paper is organized as follows: In Section 2, the theoretical backgrounds of Ensemble Empirical Mode Decomposition (EEMD) [28] and a separability metric are introduced. The proposed method of non-stationary Kernel JADE is also described. The framework of the proposed method is introduced in Section 3. In Section 4, data from Case Western Reserve University (CWRU) are used to test the proposed method, and the operational data of multiple states of high-speed trains are used to verify the accuracy and rapidity of the proposed method. Finally, the conclusion is presented in Section 5.

#### **2. Theoretical Background**

#### *2.1. EEMD*

The basic idea of the EEMD method is to use the sifting process to decompose the signal into several intrinsic mode functions, when Gaussian white noise is added. For the original signal, the specific steps to use EEMD to decompose it into IMFs are as follows:

(1) Obtain the overall signal by adding Gaussian white noise to the original signal:

$$a'(t) = a(t) + a(t) \tag{1}$$

(2) The overall signal is decomposed to obtain the IMF components of each order, where *i* represents the *i*-th component, *r* is the residual term, and *n* is the number of IMFs:

$$a'(t) = \sum\_{i}^{n} c\_i + r \tag{2}$$

(3) Repeat step (1) and (2), each time adding different white noise sequences of the same amplitude:

$$a\_j'(t) = \sum\_{i=1}^n c\_{ij} + r\_j \tag{3}$$

In Equation (3), *cij* is the *i*-th IMF component of the decomposition to which white noise at for the *j*-th time, while *rj* is the residual value of the decomposition.

(4) Using the zero-mean principle of the Gaussian white noise frequency, the effect of white noise can be eliminated, and the IMF component corresponding to the original signal can be expressed as:

$$c\_i(t) = \frac{1}{N} \sum\_{j=1}^{N} c\_{ij}(t) \tag{4}$$

where *n* represents the number of times the white noise is added and *ci* represents the *i*-th IMF component obtained by the EEMD decomposition of the original signal.

*2.2. IMF Energy Matrix*

For the IMF component *c <sup>i</sup>* = [*c i* (1), ...... , *c i* (*M*)], the energy feature can be expressed as:

$$c\_i = \sum\_{j=1}^{M} \left[ c'\_i(j) \right]^2 \tag{5}$$

All IMF components of a vibration signal form a feature energy vector *e* = [*e*1,*e*2, ......*en*], where *n* is the dimension of vector e and represents the number of IMF components.

#### *2.3. Proposed Non-Stationary Kernel JADE*

Since the original JADE algorithm is based on stationary signal analysis, considering the non-stationary nature of high-speed train signal, the kernel JADE method [15] is applied to a non-stationary environment.

The main idea of the kernel is to map the input matrix into the nonlinear space Φ Suppose *<sup>X</sup>* <sup>=</sup> {*x*1, *<sup>x</sup>*2, ... , *xm*}; then, the processing can be defined as follows:

$$\{\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_m\} \to \{\Phi(\mathbf{x}\_1), \Phi(\mathbf{x}\_2), \dots, \Phi(\mathbf{x}\_m)\} \tag{6}$$

During implementation, we need to calculate the inner product of two eigenvectors which have been mapped into a nonlinear space using a kernel function, and a kernel matrix will be calculated using Equation (7):

$$K\_{\dot{i}\dot{j}} = k(\mathbf{x}\_i, \mathbf{x}\_{\dot{j}}) = \prec \Phi(\mathbf{x}\_i), \Phi(\mathbf{x}\_{\dot{j}}) > \tag{7}$$

where *xi*, *xj* are vectors. The commonly used kernel function includes the following [29]:

$$k(\mathbf{x}\_i, \mathbf{x}\_j) = \exp(-\frac{\left\|\mathbf{x}\_i - \mathbf{x}\_j\right\|^2}{2\delta^2})\tag{8}$$

$$k(\mathbf{x}\_{i\prime}, \mathbf{x}\_{j}) = \left(\alpha \mathbf{x}\_{i}^{T} \mathbf{x}\_{j} + \mathbf{c}\right)^{d} \tag{9}$$

$$k(\mathbf{x}\_i, \mathbf{x}\_j) = \tanh(a\mathbf{x}\_i^T \mathbf{x}\_j + \mathbf{c}) \tag{10}$$

Similar to KPCA [30], the centered kernel matrix *K* can be calculated through Equation (11):

$$K\_{ij} = K\_{ij} - \frac{1}{M} \sum\_{r=1}^{M} K\_{ir} - \frac{1}{M} \sum\_{r=1}^{M} K\_{rj} + \frac{1}{M^2} \sum\_{r,s=1}^{M} K\_{rs} \tag{11}$$

The whole time series is divided into *M* segment intervals *T***1**, ...... , *TM*, and consequently *M* covariance matrices **S***T*<sup>1</sup> , ...... , **S***TM* can be generated. These *M* covariance matrices are jointly diagonalized to find a unitary matrix *U* which can diagonalize *M* covariance matrices simultaneously; then, the energy features can be extracted.

**Step 1:** For *M* segment intervals *T*1, ...... , *TM*, the covariance matrices of signal *x*(*t*) can be expressed as:

$$\text{Cov}\_{T\_k} = \mathbf{S}\_{T\_k} = \frac{1}{|T\_m|} \sum\_{t \in T\_m} \left[ k(\mathbf{s}\_{t\prime} \mathbf{s}\_t) \right] \tag{12}$$

where *st* = *xt* − *E*(*xt*), *E*(*xt*) denotes the mean of *xt*.

**Step 2:** The most common method to diagonalize matrices **S***T*<sup>1</sup> , ...... , **S***TM* is to diagonalize the first matrix, and then transform the remaining *M*−1 matrices into diagonalization. **W** is the diagonalized matrix of the covariance matrix **S***T*<sup>1</sup> :

$$\mathbf{W} = \mathbf{S}\_{T\_1}^{-1/2} = \mathbf{V}^H \boldsymbol{\Lambda}^{-1/2} \mathbf{V} \tag{13}$$

In Equation (13), **V** is the eigen-matrix of **S***T*<sup>1</sup> , while **Λ** is the eigenvector of **S***T*<sup>1</sup> . For the remaining *M*−1 matrices, the diagonalization matrix can be defined respectively:

$$\mathbf{S}\_{T\_m}^\* = \mathbf{S}\_{T\_1}^{-1/2} \mathbf{S}\_{T\_m} (\mathbf{S}\_{T\_1}^{-1/2})^H, m = 2, \dots, M \tag{14}$$

**Step 3:** The approximate joint diagonalization problem is equivalent to finding an orthogonal matrix **U** that minimizes:

$$\sum\_{m=2}^{M} \left\| \text{off}(\mathbf{U} \mathbf{S}\_{T\_m}^\* \mathbf{U}^H) \right\|^2 = \sum\_{m=2}^{M} \sum\_{b \neq d} \left( \mathbf{U} \mathbf{S}\_{T\_m}^\* \mathbf{U}^H \right)\_{bd}^2 \tag{15}$$

where off(**US**∗ *Tm* **<sup>U</sup>***H*) has the same off-diagonal elements as **US**<sup>∗</sup> *Tm* **<sup>U</sup>***H*, and the diagonal element is zero, while *b* and *d* represent the *b*-th row and *d*-th column of the matrix, respectively.

Since the sum of squares remains the same when multiplied by an orthogonal matrix, the problem is equivalent to maximizing the sum of squares of the diagonal elements:

$$\sum\_{m=2}^{M} \| \text{diag} (\mathbf{U} \mathbf{S}\_{T\_m}^\* \mathbf{U}^H) \|\ \ \ = \sum\_{m=2}^{M} \sum\_{b=1}^{p} (\mathbf{U} \mathbf{S}\_{T\_m}^\* \mathbf{U}^H)\_{bb}^2 \tag{16}$$

where *p* represents the dimension to which the feature is extracted.

**Step 4:** Givens rotation is used to transform the set of matrices to a more diagonal form, two rows and two columns at a time. The Givens rotation matrix is given by:

$$G(i,j,\theta) = \begin{pmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & \cos(\theta) & \cdots & -\sin(\theta) & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & \sin(\theta) & \cdots & \cos(\theta) & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{pmatrix} \tag{17}$$

where

$$\theta = \frac{1}{2} \text{arccot} \left[ \frac{(\mathbf{S}\_{T\_m}^\*)\_{22} - (\mathbf{S}\_{T\_m}^\*)\_{11}}{2(\mathbf{S}\_{T\_m}^\*)\_{12}} \right] \tag{18}$$

The initial value for the orthogonal matrix *U* is the identity matrix *I*. Matrices **ST**<sup>2</sup> **,** ... **,STm** are then updated using:

$$\begin{array}{c} \mathbf{S}\_{T\_m}^\* \leftarrow \mathbf{G}(1,2,\theta) \mathbf{S}\_{T\_m}^\* \mathbf{G}(1,2,\theta), m = 2, \dots, M\\ \mathbf{U} \leftarrow \mathbf{U} \mathbf{G}(1,2,\theta) \end{array} \tag{19}$$

When the values of all non-diagonal elements are less than a given threshold ε, the iteration is completed, and the joint approximation diagonalization is achieved. The unitary matrix **U**ˆ is obtained so that multiple matrices are diagonalized jointly.

**Step 5**: The transform matrix *A* can be calculated as **A**=**UW**ˆ **#** , where superscript # denotes the pseudo-inverse.

As such, the original signal *s*(*t*) can be expressed as:

$$\mathbf{s}(t) = \mathbf{A} \cdot \mathbf{K} \tag{20}$$

Since the NKJADE method can extract the nonlinear relationships hidden in the high-dimensional feature space, the fusion features can be estimated through the joint feature decomposition using multiple inputs. The fusion features can express the nonlinear and non-stationary relationships hidden in the inputs well, so they can represent the characteristic relationship of different states, and then distinguish different feature states quickly and accurately.

#### *2.4. Separability Evaluation*

To illustrate the merit of our proposed algorithm, the separability *J* is utilized to demonstrate the algorithm's ability to form distinct classes. The capability of the feature extraction in pattern classification can be described quantitatively using between-class scatter *Sb*, within-class scatter *Sw* [31], and separability [32]. Assuming that the data have a total of C classifications, the vector of the *i*-th classification is:

$$\mathbf{x}^{i} = \left(\mathbf{x}\_{1'}^{i}, \mathbf{x}\_{2'}^{i}, \dots, \mathbf{x}\_{n\_i}^{i}\right) \tag{21}$$

where *ni* is the number of *i*-th classification.

Between-class scatter *Sb* and within-class scatter *Sw* can be calculated as follows:

$$\mathcal{S}\_{\mathbf{b}} = \sum\_{i=1}^{\mathcal{C}} p\_i (m\_i - m) \left( m\_i - m \right)^T \tag{22}$$

$$\mathbf{S}\_{\mathbf{i}\sigma} = \sum\_{i=1}^{\mathbb{C}} \left[ \frac{p\_i}{n\_i} \sum\_{k=1}^{n\_i} \left( \mathbf{x}\_k^i - m\_i \right) \left( \mathbf{x}\_k^j - m\_i \right)^T \right] \tag{23}$$

where *pi* = *ni*/ *C <sup>j</sup>*=<sup>1</sup> *nj*, *mi* = *mean*(*xi ni* ), *m* = *<sup>C</sup> <sup>i</sup>*=<sup>1</sup> *pimi*.

The between-class scatter *Sb* describes how far different classes are separated, and the within-class scatter *Sw* indicates how compactly each class of samples is distributed. Based on between-class scatter and within-class scatter, separable evaluation *J* is introduced to describe the clustering ability of different methods. Separable evaluation *J* could be calculated as follows:

$$J = \text{trace}(\mathbb{S}\_b/\mathbb{S}\_w) \tag{24}$$

where function *trace* refers to the sum of diagonal elements.

#### **3. Methodology**

For the different characteristics of vibration signals, a multi-sensor data feature extraction framework is provided, in which a rapid feature fusion method using NKJADE is proposed. The framework of the feature extraction method is shown in Figure 1. The method can quickly and accurately extract different features of information contained in non-stationary vibration signals. The main steps of this framework are as follows:


**Figure 1.** Method framework.

#### **4. Experiment Results and Analysis**

In order to verify the valid of the method, we applied it to bearing fault identification (Case I) and small amplitude hunting monitoring of high-speed trains (Case II). Some traditional approaches, such as Ensemble Empirical Mode Decomposition Joint Approximate Diagonalization of Eigen-matrices (EEMD-JADE) and Ensemble Empirical Mode Decomposition Singular Valuable Decomposition Learning Technology Systems Architecture (EEMD-SVD-LTSA), were compared with the proposed method.

#### *4.1. Case I—Case Western Reserve University Data*

#### 4.1.1. Data Description

The bearing test data for normal and faulty bearings were from the Case Western Reserve University (CWRU). The test signal contained normal state, ball fault, and inner and outer race faults; for the latter three, the fault diameters were 0.07, 0.14 and 0.21 inches, respectively.

The data used in this paper were collected from the drive end. The sampling frequency was 12 KHZ and the speed of the shaft was 1725 r/min, corresponding to 400 points collected per revolution. In order to reduce the influence of equipment fluctuations, each sample contained 800 points.

As shown in Table 1, two datasets were selected. Dataset A had the same fault location (inner race fault), but the fault size was different. Dataset B had different fault locations, but the fault size was the same, and the outer race fault was at the location of 3 o'clock. Each of the datasets was divided into three categories.



#### 4.1.2. Signal Processing Results

The scatter plots obtained by applying the EEMD-SVD-LTSA, EEMD-JADE method and EEMD-NKJADE methods on dataset A are shown in Figure 2a–c, while the corresponding scatter plots for dataset B are shown in Figure 2d–f. The results of selecting parameters for kernel are shown in Figure 3. The results of the application of the three algorithms in dataset A are shown in Table 2, while the results for dataset B are shown in Table 3.

**Figure 2.** Scatter plots of faults. (**a**) EEMD-SVD-LTSA on dataset A; (**b**) EEMD-JADE on dataset A; (**c**) EEMD-NKJADE on dataset A; (**d**) EEMD-SVD-LTSA on dataset B; (**e**) EEMD-JADE on dataset B; (**f**) EEMD-NKJADE on dataset B.

**Figure 3.** The results of selecting parameters for the Gaussian kernel. (**a**) Parameter selection results of Gaussian kernel of EEMD-NKJADE method on dataset A; (**b**) Parameter selection results of Gaussian kernel of EEMD-NKJADE method on dataset B.


**Table 2.** Results from dataset A.

#### **Table 3.** Results from dataset B.


#### 4.1.3. Discussion

We found that the generalization ability of the Gaussian kernel is better than that of the other kernels. Therefore, we focused on the parameters of the Gaussian kernel.

According to the results of selecting parameters in [15], we tried to set the range of σ to (1, 10). The parameter was incremented step by step (parameter step is 0.5, shown in X-axis), and the separability *J* was calculated (shown in Y-axis). The larger *J* is, the better the classification result will be. Therefore, for the parameter selection of EEMD method, the optimal parameter of the Gaussian kernel on dataset A and dataset B is 2.

From Figure 2 and Table 2, we see that all three methods could extract the features with satisfactory results. The accuracy of the three methods was nearly 100% in all cases. The result may be attributed to the CWRU bearing fault data having great differences, which are quite easy to classify.

However, the separability *J* in Table 2 is different, which is consistent with (a)–(c) in Figure 2. In this respect, the classification ability of EEMD- NKJADE algorithm is obviously better than other algorithms. In addition, the results of the running time in Table 2 show that the time required for EEMD-JADE calculation was relatively short (the PC configuration was as follows: Intel Core i5-4460, 12GB of memory, NVIDIA GeForce GT720).

In order to further verify the robustness of the algorithm, we also tested the results of the algorithm on dataset B.

In Table 3, the accuracy of the three methods in dataset B was almost the same as that in dataset A. However, the separability of dataset B was the highest after being processed using the EEMD-SVD-LTSA algorithm, while in dataset A, the separability achieved using EEMD-SVD-LTSA was the lowest, which indicates that the selection of dataset had a greater impact on EEMD-SVD-LTSA. Compared with EEMD-SVD-LTSA, the results from the EEMD-NKJADE and EEMD-JADE algorithms were less affected by the different dataset, and therefore seem to be more robust. Therefore, the method of EEMD-NKJADE offers superior performance with respect to the classification effect and the robustness, but its calculation time is longer than EEMD-JADE.

#### *4.2. Case II—Small Amplitude Hunting Monitoring of High-Speed Trains*

#### 4.2.1. Problem Description

The stability of hunting has always been a key problem in the study of vehicle lateral motion stability [33]. Small amplitude hunting is a sign of hunting instability. In China, hunting phenomena are considered to occur when the amplitude of lateral acceleration signals from the bogie frame reaches or exceeds 8–10 m/s<sup>2</sup> more than 5 times continuously after a 10 Hz low-pass filter [34]. In Figure 4, the lateral acceleration of the bogie frame signals will sometimes go through a normal operation/small amplitude hunting/normal operation periodic cycle, which is a gradually convergent

process. The hunting amplitude in this situation is small and convergent. Sometimes, the signals will go through a normal operation/small amplitude hunting/critical hunting process, which is a gradually divergent process. The hunting amplitude in this situation starts small and then diverges.

**Figure 4.** Lateral acceleration signal of the bogie frame when small-amplitude hunting motion occurs during an online test. (**a**) Bogie frame acceleration; (**b**) Small amplitude convergent hunting; (**c**) Small amplitude divergent hunting.

Therefore, it is necessary to extract the features of different states rapidly and accurately, especially the small amplitude divergent hunting states, to guarantee the system is alerted in time to ensure the safe operation of the train.

#### 4.2.2. Data Acquisition

The data used in this paper were lateral acceleration signals of the bogie frame and axle box from an online tracking experiment. The CRTS II ballastless track and seamless rail were used on the whole line. The speed of the train was 320–350 km/h. The sampling frequency was 2500 Hz. All the data were acquired in accordance with China's Railway Passenger Traffic Safety Monitoring Standard [35]. Although in China the amplitude of lateral acceleration signals from the bogie frame is used as the testing parameter to monitor hunting motion, research has proven that many other testing parameters are also important for hunting monitoring. As such, in this paper, acceleration signals of the bogie frame and the axle box were used.

The installation locations of the accelerometers are shown in Figure 5, where two accelerometers are installed in the diagonal direction on the H-shaped bogie frame. The lateral accelerometers from the bogie are respectively denoted as S1 and S2. Also, considering that the vibration of the axle box is important for hunting [36], a sensor located on the axle box was used. In Figure 6, the lateral accelerometer on the axle box is denoted as S3. Figure 5 shows a photograph of the installation site.

**Figure 5.** Installation locations of the accelerometers. S1, S2: Accelerometer on the bogie frame; S3: Accelerometer on the axle box.

**Figure 6.** Site installation drawing.

Considering the applications of other researchers, the sampling frequency of the original signal was set to 2500 Hz. However, for hunting, the characteristic frequency of the lateral acceleration of the frame is only 3–7 Hz. Therefore, in the preprocessing stage, a 250 Hz resampling method was applied [37], and a low-pass filter of 0–10 Hz was applied. Then, a zero-mean smoothing process was used for preprocessing to eliminate trend terms. In accordance with commonly followed practices, parameters such as the amplitude of the lateral acceleration signals from the bogie frame and axel box were used to obtain a synthetic assessment of the lateral stability of the high-speed train tested. In this paper, the filtered lateral acceleration signals were divided into four states: normal, small amplitude convergent hunting, small amplitude divergent hunting, and hunting. Ten groups of sample data were used for each of the four states and each sensor, yielding a total of 120 sample groups. The length of each sample was 500 points, corresponding to a sample time of 2 s.

Nonlinear factors [38,39] have been proven to affect the bifurcation evolution of small amplitude hunting and, according to [23], all the values of the Lyapunov exponent of the lateral acceleration are greater than 1, which means that the lateral acceleration signals from the bogie frame have non-stationary characteristics.

#### 4.2.3. Feature Fusion

First, EEMD was applied on each signal. Then, 7 IMFs and 1 residue were obtained using the EEMD method. Figure 7 shows an EEMD decomposition view of the three signals at different points during the same time period.

**Figure 7.** Ensemble Empirical Mode Decomposition (EEMD) diagram of signals at different positions. (**a**) S1 position (bogie frame); (**b**) S2 position (bogie frame); (**c**) S3 position (axle box).

The samples of the three sensors were processed, and the three corresponding energy matrices E1, E2, and E3, with a size of 40 × 8 were obtained. The NKJADE method was used to fuse the three high-dimensional energy matrices and the data dimensionality was reduced.

#### *4.3. Result and Discussion*

#### 4.3.1. Single Sensor Classification Using NKJADE

A scatter plot of the features extracted using a single sensor and NKJADE is shown in Figure 8. The separability *J* and accuracy of the NKJADE method are shown in Table 4.

**Figure 8.** Scatter plot of feature extraction from single sensor using Non-stationary Kernel JADE (NKJADE). (**a**) S1 (bogie frame); (**b**) S2 (bogie frame); (**c**) S3 (axle box).


**Table 4.** Separability *J* running time and accuracy of NKJADE method.

From Table 4, the accuracy achieved using S1 data only was 97.23%, which was greater than that attained at the other sensor locations. However, when the three sensors were used together, the accuracy rate became 100%, and *J* reached a much greater value.

#### 4.3.2. Multi-Sensor Fusion Using NKJADE

The EEMD-SVD-LTSA and EEMD-JADE methods were used to compare the identification accuracy and calculation speed of the feature fusion method in multi-sensor conditions. The results of the methods are shown in Figure 9. The separability *J* and accuracy obtained using the different sensor fusion methods with data from all three sensors are shown in Table 5.

The JADE and SVD-LTSA methods were compared with the proposed method in the multi-sensor feature fusion conditions. From Table 5, the accuracy rate of the SVD-LTSA method was 93.75%, which used the non-stationary method LTSA. The accuracy rate of the JADE method was only 30.12%, which used the stationary method. The accuracy rate of NKJADE (Gaussian kernel, δ = 0.6) was 100%, while the separability *J* of NKJADE was greater than those of the other methods. Considering that data from high-speed trains have more typical no-stationary characteristics than the bearing fault data from CWRU, this result shows that the proposed method may more be suitable for non-stationary data. The separability *J* of the JADE method was only 26.67, which is even lower than that achieved using single sensor S2 or S3 data. The reason for this is probably that in the JADE method, the non-stationary condition is not considered.

Besides, because real-time processing is a significant factor for the small amplitude hunting monitoring of high-speed trains, the calculation time is a very important factor. If the calculation time is too long, the diagnostic information cannot be fed back to the system in time. Compared with the SVD-LTSA method, fusion features can be extracted very quickly using the proposed NKJADE method. The run time of the NKJADE method was nearly the same as that of the JADE method, and the separability of *J* of NKJADE was greater than JADE. This shows that NKJADE is a rapid multi-sensor feature fusion method based on non-stationary condition, which outperforms the SVD-LTSA and JADE methods. It can be applied to the small amplitude hunting bifurcation evolution monitoring in high-speed trains.

**Figure 9.** Scatter plot of feature extraction from multi-sensor data. (**a**) EEMD-SVD-LTSA; (**b**) EEMD-JADE; (**c**) EEMD-NKJADE.

**Table 5.** The separability *J*, running time, and accuracy using different sensor fusion methods.


As shown in Figure 10, the parameter is incremented step by step (parameter of step is 0.01, shown in X-axis), and the separability index *J* is calculated (shown in Y-axis). The larger *J* is, the better the classification result will be.

**Figure 10.** Parameter selection results of Gaussian kernel of EEMD-NKJADE method.

Because the raw data collected are mainly distributed in the range of −1–1 g (gravity). In Equation (8), σ is the width parameter of the kernel function, so we tried to set the range of σ to (0, 1). As shown in Figure 10, with the increase of δ*, J* increases at first and then decreases. It indicates that the range of parameter selection is reasonable. When δ = 0.6, the classification result is the best.

#### **5. Conclusions**

In this paper, a rapid multi-sensor feature fusion method based on NKJADE is proposed, with which the features of multiple sensors can be extracted quickly and accurately. In order to use the algorithm in a non-stationary environment, the whole time series is divided into *M* time periods, and the kernel function is introduced. Then, Jacobian rotation is used to obtain the unitary matrix by diagonalizing multiple kernel matrices simultaneously to extract the non-stationary fusion features.

Our main findings are that:


**Author Contributions:** Data analysis and funding acquisition, J.N.; Writing—original draft preparation, M.F.; data collection: W.R.; experiment design, C.C.; detection equipment, Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundations of China, grant number 51975486 and 51975487.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Cognitive Method for Automatically Retrieving Complex Information on a Large Scale**

#### **Yongyue Wang 1, Beitong Yao 1, Tianbo Wang 2, Chunhe Xia 1,3 and Xianghui Zhao 4,\***


Received: 19 April 2020; Accepted: 26 May 2020; Published: 28 May 2020

**Abstract:** Modern retrieval systems tend to deteriorate because of their large output of useless and even misleading information, especially for complex search requests on a large scale. Complex information retrieval (IR) tasks requiring multi-hop reasoning need to fuse multiple scattered text across two or more documents. However, there are two challenges for multi-hop retrieval. To be specific, the first challenge is that since some important supporting facts have little lexical or semantic relationship with the retrieval query, the retriever often omits them; the second challenge is that once a retriever chooses misinformation related to the query as the entities of cognitive graphs, the retriever will fail. In this study, in order to improve the performance of retrievers in complex tasks, an intelligent sensor technique was proposed based on a sub-scope with cognitive reasoning (2SCR-IR), a novel method of retrieving reasoning paths over the cognitive graph to provide users with verified multi-hop reasoning chains. Inspired by the users' process of step-by-step searching online, 2SCR-IR includes a dynamic fusion layer that starts from the entities mentioned in the given query, explores the cognitive graph dynamically built from the query and contexts, gradually finds relevant supporting entities mentioned in the given documents, and verifies the rationality of the retrieval facts. Our experimental results show that 2SCR-IR achieves competitive results on the HotpotQA full wiki and distractor settings, and outperforms the previous state-of-the-art methods by a more than two points absolute gain on the full wiki setting.

**Keywords:** information retriever sensor; multi-hop reasoning; evidence chains; complex search request

#### **1. Introduction**

Most previous work about retrievers focus on searching the relative lexicon from a single paragraph, and some easy search queries may be satisfied reluctantly using a single paragraph or document by the traditional retrieval methods of search engines, such as matching lexical or semantic similarity or relatedness. However, a very challenging retrieval task for complex search requests, called multi-hop reasoning retrieval, requires combining evidence from multiple sources, as shown in Figure 1. Single-hop retrieval methods are almost incompetent for intricate search tasks. The objective of a multi-hop retriever is to obtain and fuse scattered information from the internet step by step [1,2]. Figure 1 provides an example of a search request requiring multi-hop reasoning. In response to the search request, only when one first infers from the first context in the search request can the Internet information corresponding to the query be inferred with any confidence from the second context.

**Figure 1.** An example of a multi-hop retrieval task. The method in this paper mimics the human reasoning process while searching for information online. The retrieval sensor will finally provide users with reasoning chains, which will help obtain structured retrieval information.

The multi-hop IR is of great difficulty, because it requires reading and aggregating information over multiple pieces of textual evidence. Although the Reading Comprehension model is applied to candidates of the paragraph for prediction (e.g., QFE [3], MINIMAL [4]), the aggregation of evidence from sources for supporting fact prediction is beset with critical difficulty. One of the challenges is that since some pivotal supporting facts have little lexical or semantic relationship with the original retrieval query, the retriever often omits these crucial facts, as illustrated in Figure 2. The reason for this challenge is that multi-hop IR is open domain retrieval on a large scale, where search requests are given without any accompanying contexts. In this case, one is required to locate relevant contexts to queries from a large internet source before extracting the most relevant information. Retrieving a fixed list of documents independently does not capture the relationships between evidence documents through bridge entities required for multi-hop reasoning. They often fail to retrieve the required evidence for the multi-hop query. The recent open domain IR methods [5] do not capture lexical information in entities, but are encountered with challenges for entity-centric problems. The complex cognitive process in humans seems to provide more inspiration for this retrieval process. Recently, some researchers began to mimic the human reasoning process to search for complex information. Although Ding's Cognitive Graph [6] takes the entity relations, this method cannot find the reasoning path automatically. Multi-Step Reasoner introduces a multi-step reasoning method [7], but it only repeats the retrieval process for a fixed number, resulting in ineffectively integrated information. Qi et al. propose GoldEn IR [8], which cannot use the information of relations between documents during the iterative process. Most state-of-the-art approaches [2,9–11] for the open-domain retrieval leverage non-parameterized models to retrieve a fixed set of documents, where an information span is extracted by a neural model. However, they often fail to retrieve the required evidence needed by multi-hop reasoning. Some research on the reasoning process do not separate the information from the cognitive graph. If helpful information for users is not selected on the cognitive graph, the retriever may fail. In addition, the explosion of online textual content with unknown verification raises a vital

concern about misinformation, such as fake news, rumors, and online "water army" opinions [12–14]. While some misinformation may have semantic relations with the retrieval query, they cannot be efficaciously identified by some methods [15–17] with ideal performance. If the scattered information does not have stronger interactions with neighbors on the cognitive graph, a retriever is apt to select this misinformation in comparison with the whole nodes. While RE<sup>3</sup> and Multi-Passage [11,18] introduce an extract-then-select framework to re-rank candidates to improve these interactions, they overlook the information between different evidence candidates.

**Figure 2.** A complex example of an open-domain retrieval task from HotpotQA. Paragraph 2 is difficult to be retrieved using traditional retrievers due to little lexical overlap with the given query.

In this paper, we study the problem of multi-hop retrieval methods, which requires multi-hop reasoning among evidence scattered around multiple raw documents on a large scale. Despite the given query utterance and a set of accompanying documents, not all of them are relevant. The information can be obtained by selecting two or more pieces of evidence from documents and reasoning among them. The models are expected to search the most useful information for search request in open domains. We propose a sub-scope cognitive reasoning information retrieval (2SCR-IR) sensor, a novel method to address the above-mentioned difficulties for multi-hop retrieving, as shown in Figure 3 (Nodes are fundamental entities, with the color denoting the identical entity in paragraphs. The blue edges are the relationship between the different entities in the same paragraph, while the red ones are the relationship between the same entities in different paragraphs. One search request and three paragraphs are given. Our 2SCR-IR sensor conducts multi-step reasoning over the supporting evidence by constructing a cognitive graph from the supporting information, automatically adjusting the sub-scope to select a subgraph, propagating information through the graph, verifying the reliability of the information, and finally producing supporting evidence chains. The circles, denoting the sub-scope, are chosen by 2SCR-IR in each step). For the first challenge, the 2SCR-IR sensor constructs a self-adjusting sub-scope cognitive graph based on entities referred to in the search request and web documents, which is iterated to accomplish multi-hop reasoning. In each step, the 2SCR-IR sensor is produced and reasons on a dynamically adjusted graph, with unrelated entities left out and reasoning information exclusively preserved in a scope prediction process. To solve the second difficult problem, we use an information fusion process in the 2SCR-IR sensor to eliminate misinformation. The process of iteratively expanding with clues in sub-scope can discover the weak paragraphs relative to the query in our framework, which can also play a pivotal role in filtering out misinformation. For the purpose of further verifying the authenticity of retrieved information, we introduce pageview verification to our sensor. Overall, our work incorporates a four-fold contribution:


#### **2. Related Works**

#### *2.1. Search Engine and Information Retrieval*

The search engine is a system that assists users in retrieving information that they wish to obtain after submitting searching queries. One search engine, in response to a retrieval query, typically compares keywords of the query with an index generated from a sea of web sources, such as text files, image files, video files, and other content items. Based on this comparison, it will provide users with the most relevant content items [19]. In our paper, only textual information retrieved from the Internet is taken into account. With the rapid development of deep learning, researchers have an increasing interest in bringing neural networks into IR tasks. Although IR tasks frequently use sentence-similarity, as discussed by Guo et al. [20], they pay more attention to relevance-matching, in which the match of specific content plays a significant part. Some researchers [21] have confirmed that IR is more about retrieving sentences with the identical semantic meaning, while traditional methods based on relevance matching show a serious deficiency in semantic understanding. Therefore, we used natural language reasoning techniques, instead of relevance-focused IR methods, to solve this issue. In many cases, the paragraph containing the information corresponding to the searching query has great lexical overlap with the query, adding to its difficulty for the retrieval of common search engines from a large open scope. For instance, the accuracy of a BM25 retriever for finding all supporting evidence for a query diminishes from 57.3% to 25.9% on the "easy" and "hard" subsets, respectively [2].

(**a**)

**Figure 3.** *Cont*.

**Figure 3.** An example of a multi-hop text-based information retrieval task. (**a**). Search request and corresponding information source. (**b**). Reasoning process.

#### *2.2. Single-Hop and Multi-Hop QA*

Relying on the complexity in underlying reasoning, information retrieval tasks can be categorized into single-hop and multi-hop ones. The former only needs one piece of evidence extracted from the underlying information, e.g., "which city is the capital of China". On the contrary, a multi-hop retrieval task requires recognizing multiple relevant evidence and reasoning about them, e.g., "what is the capital city of the largest province in China". Many IR techniques that are able to reply the single-hop searching query [22] are hardly introduced in multi-hop tasks, since single evidence can only partially satisfy its searching query.

#### *2.3. Open-Domain Retrieval Task*

Open-Domain retrieval task refers to the setting where the search scope of the supporting evidence is extremely large and complex. Recent open domain IR systems based on deep learning methods follow a two-step process, namely selecting potentially relevant documents from a large corpus and extracting useful sources from these selected documents. As the first to successfully apply this framework to Chen et al. [23], IR [24], introduced new benchmarks, [11,25] improved this framework by introducing feedback signals, and [26] proposed a method that can dynamically adjust the number of retrieved documents. However, these mainstream methods are weak in mining the information loosely related to the searching queries and fail to identify the misinformation.

#### *2.4. Similarity Matching Based Method*

Traditional frequency-based approaches, such as the "bag of words" model [27], have been employed to extract features in the search engine. However, these frequency-based techniques do not preserve the text sequence, bringing about the lack of understanding of the context's full meaning [28]. While other text-similarity-matching approaches, such as tree kernels [29] and high order n-grams [30], may have the perception of the word order and semantics, they cannot master the context meaning completely, thus heavily affecting the accuracy of recognition.

#### *2.5. Multi-Hop Reasoning for Retrieval Task*

Previous research on popular GNN frameworks has shown promising results in retrieval tasks requiring multi-hop reasoning. Coref-GRN aggregates information in disparate references from scattered paragraphs [31]. The different mentions of the same entity are combined with a graph recurrent neural network (GRN) to produce aggregated entity representations [32]. Based on Coref-GRN, MHQA-GRN [33] refines the graph construction procedure with more connections, which shows further improvements. Entity-GRN [34] proposes a method to distinguish different relations in the cognitive graphs through the convolutional neural network. Besides, Core-GRN [35] explores the cognitive construction problem. Nonetheless, it is rarely investigated by researchers about how to effectively reason on the constructed cognitive graphs, which is the primary problem studied in this paper. Other frameworks, such as Memory Networks [36], deal with multi-hop reasoning. These frameworks develop representations for queries and retrieval sources, between which interactions are then made in a multi-hop manner. The IR-Net [37] generates an entity state and a relation state at each step, computing the similarity degree between all entities and relations given by the dataset. However, these frameworks perform reasoning on simple datasets with a limited number of entities and relations, which is quite different from our work with the large-scale retrieval and intricate search queries.

#### *2.6. Pageview*

The pageview is a collection of Web objects or resources representing a specific user event, such as clicking on a link or viewing a product page [38]. Besides, pageview frequency has been shown to help improve the performance of evidence retrieval results [39].

#### *2.7. Peasoning Chains*

A reasoning chain is a sequence that logically connect users' search requests to supporting facts [40]. Reasoning chains should be intuitively related: they should exhibit a logical structure [41], or some other kind of textual relation that would allow human readers to quickly obtain what they really need.

#### **3. System Architecture**

#### *3.1. Overview*

In this section, we describe our new graph-based reasoning method that learns to find supporting evidence as reasoning paths for complex retrieval tasks. Our inspiration is motivated by the human cognitive process for searching an open question online, and our method starts from selecting the critical entity in the retrieval query. In order to make better use of the current entity's surrounding information, our method is to connect the start entity to some related entities either searched in the neighborhood or connected by the same surface mentioned. The abovementioned process is repeated to construct reasoning chains.

The main part to implement the 2SCR-IR framework is comprised of two components in our proposed system (Figure 4): a system to construct the cognitive graph, on which the other system reasons explicitly. Our method uses a strong interconnection between evidence to prevent important documents from being omitted in the reasoning paths.

**Figure 4.** Model architecture of the proposed 2SCR-IR sensor.

We used Wikipedia to retrieve relevant information for an open-domain search request, in which each article falls into paragraphs, resulting in billions of paragraphs in total. Any paragraph *p* can be perceived as our retrieval target. Given a query *q*, our framework aims at deriving its relevant textual sources to make up evidence chains by retrieving and reasoning paths, each of which is represented with a sequence of paragraphs.

#### *3.2. Gold Paragraph Selection*

To ease the computational burden of following steps, we first applied paragraph selection to the given context. Documents with titles matching the whole query were our priority. The selector will calculate the relevance score of the query and each paragraph between 0 and 1. We can set a threshold parameter τ, and paragraphs with predicted scores greater than τ are chosen, which are concatenated together as the context C. Since not every piece of text is relevant to the query, we further search for paragraphs that contain entities appearing in the searching query. If the two previous ways fail, a BERT-based paragraph ranker [42] will be used to select the paragraph with the scores higher than τ. Q and C are further processed by upper layers. After the first-hop paragraphs are identified, the next step is to search for evidence within paragraphs leading to other relevant paragraphs. Unlike the dependence on entity linking, we use hyperlinks in the first-hop paragraphs to find second-hop paragraphs to avert introducing noise. After all the links are chosen, we add edges between the evidence with these links and hyperlinks. When the two-hop paragraph is selected, we can obtain some candidate paragraphs. In order to further lower noise, we use the paragraph ranking model based on the BERT encoder to select the top-*N* paragraphs in each step. Besides, a retrieval query *q* and each of the 10 paragraphs *pi* are introduced into a BERT model [42]. Furthermore, a soft-max function is utilized to calculate the probability *P* (*q*, *pi*) of *pi*. A paragraph *pi* is chosen as a gold paragraph for the query *q* if *P* (*q*, *pi*) is larger than the setting τ. To collect most of the gold paragraphs for each retrieval query, the threshold value of 97.0% for recall and 69.0% for precision was set.

#### *3.3. Encoder and Attention Module*

We introduced a BERT model to acquire the representation of the query *<sup>Q</sup>* <sup>=</sup> [*q*1, *<sup>q</sup>*<sup>2</sup> ... *qk*] <sup>∈</sup> <sup>R</sup>*k*×*<sup>d</sup>* and that of the context *<sup>C</sup>* <sup>=</sup> [*C*1, ... *Cl*] <sup>∈</sup> <sup>R</sup>*l*×*d*, where *<sup>k</sup>* and *<sup>l</sup>* are the lengths of the searching query and the context, and *d* is the size of the BERT hidden layer.

To gain more semantic information, following [43], we use the bidirectional attention module to update the representation for each word and a weighted self-aligned vector to represent the query.

The representation of the context *C* is

$$\mathcal{C} = [\mathcal{C}; \mathcal{C}Q; \mathcal{C} \odot \mathcal{C}Q; \mathcal{QC} \odot \mathcal{C}Q] \in \mathbb{R}^{l \times 4d\_b} \tag{1}$$

where is elementwise multiplication, and *dh* is the hidden dimension.

The representation of the searching query is

$$
\varphi = \operatorname{softmax}(w\_{\varphi}h) \tag{2}
$$

$$q = w\_a \\ \tanh\left(\mathcal{W}\_a \sum\_k q \rho\_k l\_k\right) \tag{3}$$

where *h* represents the query embedding; *hk* denotes the *k*-th word in the query; and *w*ϕ, *wa*, and *Wa* are linear projection parameters.

#### *3.4. Construction of Cognitive Graph*

To perform the multi-hop reasoning, we first needed to construct a graph of the information source to cover all the Wikipedia paragraphs. The Stanford CoreNLP Toolkit [44] was employed in an attempt to generate semantic graphs from the source text. The cognitive graph was constructed with entities as nodes and edges, while the number of extracted entities was denoted as *N*. We constructed the entity graph based on Entity-GCN [34] and DFGN [16], signifying that all mentions of the candidates searched in the documents are used as nodes in this cognitive graph. We used hyperlinks in Wikipedia to develop the direct edges, while the undirected edges were defined in line with the positional properties of every pair of nodes. The edges in this graph were split into two categories:


It needs to be emphasized that we do not apply the co-reference resolution to pronouns because it will introduce complex and unnecessary links.

#### *3.5. Multi-Step Reasoning*

After context encoding, the 2SCR-IR sensor performs reasoning over the cognitive graph. With the embedding process of the query Q and the context C, there is a huge challenge: to identify support entities and the text span of potential evidence paragraphs, as well as capturing relationships between evidence documents with little lexical overlap or semantic relationship to the original query. Based on the Cognitive Graph [6] and the DFGN method [16] for the QA system, an advanced knowledge-fusion module is adopted to mimic humans' step-by-step thinking and reasoning behavior—starting from *Q*<sup>0</sup> as well as *C*<sup>0</sup> and finding one-step supporting entities. The module achieves the following:


After obtaining the entity embeddings from the input context *Ct*, we applied a graph neural network to propagate the node information to its neighbors. Furthermore, a dynamic graph attention mechanism, which is a hierarchical and top-down process, was utilized to imitate humans' step-by-step exploring and reasoning behavior. It first attentively reads all knowledge graphs and then all triples in each graph for the final word generation. Yan et al. [45] show that the higher relevance given to the query can help the neighbor entity obtain more information from nearby. In order to calculate the degree of relation between the context and the query, 2SCR-IR calculates the attention between them. If one entity is more relative to the query, the entity is more pivotal. We followed [16] to multiply this relevance score:

$$\mathbf{v}\_{i}^{(t)} = \operatorname{sigmoid}\left(\frac{\tilde{\mathbf{q}}^{(t-1)}\mathbf{v}^{(t)}\mathbf{e}\_{i}^{(t-1)}}{\sqrt{d\_{2}}}\right) \tag{4}$$

$$
\tilde{\mathbf{E}}^{(t-1)} = \mathbf{y}^{(t)} \odot \mathbf{e}^{(t-1)} \tag{5}
$$

where <sup>V</sup>(*t*) is a linear projection matrix; *<sup>d</sup>*<sup>2</sup> represents the dimension of each entity; and <sup>γ</sup>(*t*) *<sup>i</sup>* is the relevance score of the entity *ei* in step *t*. The entity graph is updated in a loop.

As a result, this step of the information propagation is confined to a dynamic scope on the cognitive graph. The next step is to disseminate information across the dynamic scope.

We introduce how to integrate the output of the entity into the context. To identify which sentences are more crucial to answer the question, we used a weighted self-aligned vector to represent the context. For each sentence *si*, we compute weight ε*<sup>i</sup>* by

$$\varepsilon\_{i} = \operatorname{softmax}(w\_{\psi}s\_{i})\tag{6}$$

The context vector is computed by

$$\mathbf{C}\_{V} = w\_{a} \tanh\left(\mathcal{W}\_{a} \sum\_{i} \varepsilon\_{i} \mathbf{s}\_{i}\right) \tag{7}$$

where *<sup>w</sup>*ϕ, *wa* are linear projection parameters, the context vector *CV* <sup>∈</sup> <sup>R</sup>*d*.

After the context vector is concatenated to each word in the context, the *i*-th word *ci* in the context C is updated as follows:

$$c\_i = [c\_i; C\_V] \tag{8}$$

After completing the contextual information, we used an LSTM layer to fuse the output of the entity graph and the context:

$$\mathbb{C} = LSTM([\mathbb{C}; E]) \tag{9}$$

The structure layer we used is the same as that used by Yang et al. [2]. This prediction process has three functions, including support sentences as well as the start and end position of the information needed, denoted as *Osup*, *Ostart*, *Oend*, respectively. We put three RNNs *Pi* layer by layer to solve the problem of output dependency. The information of the fusion block is sent to the first RNN *P*0, and the output of this module is as follows:

$$\mathbf{O}\_{\text{sup}} = P\_0 \left( \mathbf{C}^{(t)} \right) \tag{10}$$

$$\mathbf{O}\_{start} = P\_1(\left[\mathbf{C}^{(t)}, \mathbf{O}\_{sup}\right]) \tag{11}$$

$$\mathbf{O}\_{end} = P\_2(\left[\mathbf{C}^{(t)}, \mathbf{O}\_{sup}, \mathbf{O}\_{end}\right]) \tag{12}$$

Each *Pi* outputs a logit **<sup>O</sup>** <sup>∈</sup> <sup>R</sup>*M*×*d*<sup>2</sup> .

In order to optimize the combined effect of these three sections, we compute these three cross entropy losses:

$$L = \mu\_{\text{start}} L\_{\text{start}} + \mu\_{\text{end}} L\_{\text{end}} + \mu\_{\text{support}} L\_{\text{support}} \tag{13}$$

Our method learns to extract reasoning paths within the constructed cognitive graph as evidence chains. Besides, evidence resources for a complex searching query do not necessarily have lexical overlaps. We adopted an automatically adjusting sub-scope to collect those entities with some sort of relation. Once one of the resources is retrieved, its entity mentions and the query often entail another resource.

#### *3.6. Producing Plausible Evidence Chains*

Finally, the 2SCR-IR sensor will provide users with reasoning chains, which first verifies every reasoning path, and then extracts a retrieval span from the most reasonable paths using a common method [42]. Our retriever model learns to predict plausible reasoning paths by capturing the paragraph interactions through the BERT (CLS) representations, after independently encoding the paragraphs along with the searching query. This paragraph interaction is crucial for multi-hop reasoning [46], especially when faced with open-domain information. However, as there is invariably some noise or misinformation, the best evidence chain is not always enough to fully cover all information to be retrieved. Hence, we re-computed the probability of the path with the increase in the pageview to lower the uncertainty and make our framework more robust. In the end, we select the top-*m* evidence chains for users.

#### *3.7. Example*

After *N* paragraphs and the query are put in, the selector module will filter out irrelevant paragraphs. Then, BERT is used as the encoder module to represent the filtered content in line with Formulas (1)–(3). After the process of knowledge representation, a classification layer is applied to the prediction of relevance scores between the paragraphs and the query. One paragraph will be signed as 1 if it contains supporting facts. Otherwise, it will be signed as 0. In the inference process, the paragraph will be chosen if the relevance score exceeds τ. In the process of reasoning, in order to make inferences on information, we calculated the degree of the relationship between the entities and the query in the light of Formulas (4) and (5) and built the entity graph. After we got the output of the entity graph, it was fused into the context. We used Formulas (6)–(9) to recognize the relevance of the search request to sentences. Next, Formulas (10)–(12) were utilized to predict the supporting sentences, as well as the start and end position corresponding to each search request. Moreover, Formula (13) was employed to optimize the combined effect. Finally, we will get supporting facts and construct reasoning chains in correspondence with the search request.

#### **4. Experiments**

#### *4.1. Datasets*

We evaluate our method on an open-domain Wikipedia-sourced dataset named HotpotQA [2], which is a human-annotated large-scale dataset that needs multi-hop reasoning and provides annotations to evaluate the prediction of supporting sentences. Almost 84% of queries require multi-reasoning. We used the full-wiki and the distractor setting of HotpotQA to conduct our experiments, with our primary target being the full wiki setting. Due to its open-domain scenario, we used the distractor setting to evaluate the performance of our method in a closed scenario where the evidence candidates are provided. For the sake of optimization, we used the Adam Optimizer [47] with an initial learning rate of 0.0005 and a mini-batch size of 32.

In addition, a good information retriever should be robust enough to prevent noise from disturbing themselves. IR can be armed with the well-structured and large-scale knowledge graph DBPedia [48] to offer some robust knowledge between concepts. However, the biggest disadvantage of this approach lies in that our cognitive model cannot be trained end-to-end and the errors may be cascaded. Inspired by [23], our retriever is trained to recognize relevant and irrelevant paragraphs at each step. We therefore mixed the noise information (negative examples) and generally right paragraphs together; to be more specific, we used two types of negative examples: Term Frequency–Inverse Document Frequency-based (TF–IDF-based) and hyperlink-based ones. When it comes to the single-hop retriever, merely the former type is used. Regarding the multi-hop retriever, we used both sorts, with the latter type carrying more weight to prevent our retriever from being distracted by reasoning paths without correct answer spans. Generally, the number of negative examples is set to 50.

#### *4.2. Implementation Details*

We performed our experiment in the Co-lab. In the section paragraph sector, the setting of the threshold τ is 0.1, with the length of the context restricted to 512. Besides, the number of entities in the cognitive graph was set to be 100. The input dimension of the BERT, the hidden dimension, and the batch size were set to be 800, 310, and 10, respectively. The total train epoch was set to be 30 and the first 5 epochs.

#### *4.3. Baseline*

We used Yang's model [2] proposed in the original HotpotQA paper as a baseline model that follows the retrieval–extraction framework of DrQA [23] and introduces the advanced techniques, such as self-attention and bi-attention.

#### *4.4. Evaluation Metrics*

First, we introduced two common metrics to evaluation metrics: the Exact Match (EM) and the F1 score [2]. For better a performance evaluation of the multi-hop reasoning concerning our sensor, we also introduced the supporting fact retrieval based on the common metrics. We adopted Supporting

fact Prediction F1 (SPF1) and Supporting fact Prediction EM (SPEM) to evaluate the sentence-level supporting fact retrieval accuracy. It is worth noting that the paragraph-level retrieval accuracy matters for the multi-hop reasoning as well. Thus, we adopted Paragraph Recall (PR), which evaluates if at least one of the ground-truth paragraphs is included among the retrieved paragraphs. To evaluate whether both of the ground-truth paragraphs used for multi-hop reasoning are included among the retrieved paragraphs, we put Paragraph Exact Match (PEM) into use.

#### **5. Results**

#### *5.1. Overall Results*

Table 1 shows the performance of dissimilar IR models on the HotpotQA development set. From the table, we can see that the 2SCR-IR sensor outperforms all previous results under both the full wiki and distractor settings. Compared with the state-of-the-art model [15], 2SCR-IR achieves 1.1 F1 and 1 EM gains on the full wiki, as well as 1.2 F1 and 1.9 EM gains on the distractor wiki. The result shows that 2SCR-IR achieves improvement in predicting supporting facts both in the full wiki and distractor wiki settings.

**Table 1.** Primary results for HotpotQA development set results: SP results on the HotpotQA's full wiki and distractor settings. The highest value per column is marked in bold.


After comparing our 2SCR-IR sensor with other competitive retrieval methods on SQuAD datasets, our model is found to outperform the current state-of-the-art model [49] by 2.6 F1 and 3 EM scores, as shown in Table 2 At the beginning of the paper, we predicted that the lower lexical overlap between the search query and contexts will pose a challenge to the methods using lexical-based retrievers in finding relevant articles. The experiment proved that our prediction is valid.

**Table 2.** SQuAD results: We report the F1 and EM scores on SQuAD, following previous work. The highest value per column is marked in bold.


#### *5.2. Performance of Evidence Chain Retrieval*

Retrieval results in Table 3 display that our 2SCR-IR sensor generates the improvement of 1.3 PR, 7.2 PEM, and 7.8 EM, respectively. The conspicuous improvement from TF-IDF to Entity-centric IR demonstrates that exploring different granularity reasoning helps to retrieve the paragraphs with fewer term overlaps. Moreover, the comparison of our retriever with Entity-centric IR Retrieval shows the importance of explicitly retrieved reasoning paths in the cognitive graph, especially for the complex multi-hop searching query.

**Table 3.** Retrieval evaluation: Comparing our retrieval method with other methods across Paragraph Recall, Paragraph EM, and EM metrics. The highest value per column is marked in bold.


#### *5.3. Ablation Study*

To evaluate the performance of the disparate components in our 2SCR-IR sensor, we performed ablation studies, where we simply use golden supporting facts as the input context. Table 2 shows the ablation results of the multi-hop retrieval performances in the development set of HotpotQA.

From Table 4, we can observe that the sub-scope and pageview parts can help our sensor obtain from 3 to 7% relative and 5 to 8% gains for F1 and EM, respectively. "-attention module" means discarding the information of bidirectional attention. Similarly, the performance drops a lot in each metric, which proves the capability of the bidirectional module to comprehend semantics.

**Table 4.** Ablation results on dev set.


At the beginning of the paper, we predicted that the pageview and the sub-scope can improve the ability to find useful supporting facts. In the experiment, our observation and proposed scheme are proved valid.

#### **6. Case Study**

Our case study is presented in Figure 5, which signifies the reasoning process with a 2SCR-IR sensor. Firstly, our model produces scope 1 as the first entity of reasoning by comparing the searching query with entities, in which "Ran Paul" and "hotel" are selected as the first entities of two reasoning chains, the information of which is then passed to their neighbors on the cognitive graph. Secondly, mentions of the same entity "Galt house" are detected by scope 2, serving as a bridge for propagating information across two paragraphs. Thirdly, these two reasoning chains are linked together by the bridge entity "Galt house". Both reasoning chains and supporting facts are provided to users, which will assist them in obtaining things they want to retrieve very promptly and efficiently. Finally, users can acquire "The Ran Paul presidential campaign 2016 event was held at a hotel which is on the Ohio River", nearly without any complex manual retrieval process.

**Figure 5.** A case study of the development set. The number on the left side shows the importance scores of the predicted sub-scope. The text on the right side includes queries, predicted top-1 reasoning chains, and supporting facts.

Figure 6 shows another multi-hop reasoning example of 2SCR-IR from the HotpotQA development set. In this case, we need to search information about the actress who played *Corliss Archer* in *A Kiss for Corliss*. The information relevant to the search request may appear in multiple positions in texts, but they have little lexical or semantic relationship to the original retrieval query. It is generally difficult for common search engine system to know which entity mentions might eventually lead to texts containing what the user really needs. Our 2SCR-IR can provide users with a logically structured text. From the reasoning chain and three supporting facts, we can easily get the information: *The woman who portrayed Corliss Archer in the film A Kiss for Corliss held Chief of Protocol in the government*. As shown in Figure 7, our 2SCR-IR method has been successfully applied to a KLBNT Retrieval System. For complex problems, we can get the reasoning process and the logically structured documents.

**Figure 6.** Another case study in the development set. Several documents are given as supporting facts for the search query.

**Figure 7.** The KLBNT Retrieval System supported by 2RCR-IR technology.

#### **7. Conclusions**

We present a new framework 2SCR-IR sensor to tackle multi-hop retrieval problems on a large scale, which retrieves reasoning paths over the cognitive graph to provide users with useful explicit evidence chains. Our retriever model learns to sequentially retrieve evidence paragraphs to construct reasoning paths, which is subsequently re-ranked by the sensor that determines the final information presented as the one extracted from the best reasoning path. Our retriever obtains state-of-the-art results using the HotpotQA dataset, which shows the efficiency of our framework. The state-of-the-art performance on SQuAD is achieved, demonstrating the robustness of our method. Besides, our analysis shows that 2SCR-IR can produce reliable and explainable reasoning chains. In the future, we may incorporate new advances in building cognitive graphs from the web context to solve more difficult reasoning problems.

**Author Contributions:** X.Z. and C.X. planned and supervised the whole project; Y.W. developed the main theory and wrote the manuscript; B.Y. and T.W. contributed to doing the experiments and discussing the results. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China with the grant number of U1636208, NO 61862008 and NO. 61902013.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Multi-View Visual Question Answering with Active Viewpoint Selection**

#### **Yue Qiu 1,2,\*Yutaka Satoh 1,2, Ryota Suzuki 2, Kenji Iwata <sup>2</sup> and Hirokatsu Kataoka <sup>2</sup>**


Received: 31 March 2020; Accepted: 14 April 2020; Published: 17 April 2020

**Abstract:** This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.

**Keywords:** visual question answering; three-dimensional (3D) vision; reinforcement learning; deep learning; human–robot interaction

#### **1. Introduction**

Recent developments in deep neural networks have resulted in significant technological advancements and have broadened the applicability of human–robot interaction (HRI). Vision and language tasks, such as visual question answering (VQA) [1–6], and visual dialog [7,8], can be extremely useful in HRI applications. In a VQA system, the input is an image along with a text question that the system needs to answer by recognizing the image, interpreting the natural language in the question, and determining the relationships between them. VQA tasks play an essential role in various real-world applications. For example, in HRI applications, VQA can be used to connect robot perceptions with human operators through a question-answering process. In video surveillance systems, the question–answer process can serve as an interface to help avoid the manual checking of each video frame, significantly reducing labor costs.

Although VQA methods can be useful in real-world applications, there are numerous problems related to their implementation in real-world environments. For example, conventional VQA methods answer a given question based on a single image. However, in real-world environments, because it is challenging to take photographs continuously from optimal viewpoints, objects can be greatly occluded and thus answering questions based on single-view images could be difficult. Considering that multi-view observations are possible in HRI applications, this study discusses VQA

under multi-view settings. Qiu et al. [9] proposed a multi-view VQA framework that uses perimeter viewpoint observation for answering questions. However, perimeter viewpoints may be difficult to set up in real-world environments due to environmental constraints, making their method difficult to implement. In addition, observing each scene from perimeter viewpoints is relatively inefficient, especially for applications that require real-time processing. Moreover, the authors did not evalutate their method under real-world image settings.

Here, we propose a framework, shown in Figure 1, in which a scene is actively observed. Answers to questions are obtained based on previous observations of the scene. The overall framework consists of three modules, namely a scene representation network (SRN) that integrates multi-view images into a compact scene representation, a viewpoint selection network that selects the next observation viewpoint (or ends the observation) based on the input question and the previously observed scene, and a VQA network that predicts the answer based on the observed scene and question. We built a computer graphics (CG) multi-view VQA dataset with 12 viewpoints. For this dataset, the proposed framework achieved accuracy comparable to that of a state-of-the-art method [9] (97.11% (Ours) vs. 97.37%) that answers questions based on the perimeter observation of scenes from fixed directions, while using an average of just 2.98 viewpoints as a contrast with 12 viewpoints of the previous method. In addition, we found that our framework learned to choose viewpoints efficiently for answering questions by ending observation while the observed scene contains all objects needed to answer the question, or additionally observing the scene from viewpoints with large spacing to enhance accessible scene information, thereby lowering the camera movement cost. Furthermore, to evaluate the effectiveness of the proposed method in realistic settings, we also created a real image dataset. In experiments conducted with this dataset, the proposed framework outperformed the existing method by a significant margin (+11.39%).

**Figure 1.** Illustration of multi-view VQA with active viewpoint selection. Our framework selects the observation viewpoint (highlighted in red bounding boxes) to answer questions based on the observed scene information.

The contributions of our work are three-fold:

• we discuss the VQA task under a multi-view and interactive setting, which is more representative of a real-world environment compared to traditional single-view VQA settings. We also built a dataset for this purpose.


#### **2. Related Work**

#### *2.1. Visual Question Answering*

In recent years, a variety of VQA methods have been proposed. Holistic VQA methods, such as feature-wise linear modulation (FiLM) [10], combine image features with question features by concatenation [1], bilinear pooling [6], or the Hadamard product [3], and then predict answers from these combined multi-modal features.

A series of VQA datasets have been proposed. The VQA 2.0 dataset [2], one of the most popular datasets, contains real images sampled from the MS COCO dataset [11], and question–answer pairs collected using crowdsourcing (Amazon Mechanical Turk).Johnson et al. [12] pointed out that the VQA2.0 dataset has high dataset distribution biases and proposed the Compositional Language and Elementary Visual Reasoning (CLEVR) [12] dataset, in which images and questions are generated by pre-defined programs. The CLEVR dataset has been widely adopted for evaluating visual reasoning ability. Although significant progress related to VQA tasks has been made, there is relatively little discussion about multi-view VQA. Qiu et al. [9] proposed a multi-view VQA framework. However, the authors used a limited perimeter viewpoint setting, and their study did not conduct real-world environment experiments. Here, we propose a framework with high efficiency in terms of the number of observation viewpoints and discuss its application in real-world image settings.

#### *2.2. Learned Scene Representation*

Traditional approaches based on three-dimensional (3D) convolutional neural networks (CNNs) learn a 3D representation from 3D data, such as point cloud data [13,14], mesh [15], and voxel data [16]. These methods can extract representations that contain the underlying 3D structure of the input data. However, due to the discrete nature of 3D data formats, such as point clouds, the resolution of the learned representation is limited. In addition, learning from direct 3D data usually requires a massive amount of training data, high memory cost, and a long execution time.

In contrast, a series of continuous 3D scene representations have recently been proposed. Generative query network [17] is a network based on a conditional variational autoencoder [18] that learns a meaningful 3D representation described by the parameters of a scene representation network (SRN) from a massive amount of 3D scene data annotated with only camera viewpoint information. Deep signed distance function [19] is an auto decoder-based structure that learns continuous signed distance functions to map 3D coordinates into the signed distance of a group of objects. Scene representation networks [20] proposed by Sitzmann et al. predict 3D range maps along with a scene representation of the input scene and thus exhibit robustness to unknown camera poses.

A learned continuous scene representation contains the underlying 3D structure and object information of the input scene and usually requires a relatively small amount of annotation signals and relatively short execution costs. Therefore, in this work, we use a continuous SRN to extract scene information and use the representation to answer questions.

#### *2.3. Deep Q-Learning Networks*

Q-learning [21] is a type of value-based reinforcement learning framework that holds a Q-table of the values of taking actions *a* in environment states *s*. However, for a large number of environment states *s* or continuous state spaces, the time and space consumption of Q-learning can become rather high. Instead of holding a Q-table in memory, deep q-learning networks (DQNs) [22] use neural networks to predict action values from environment states *s*. Conventional DQN methods that involve image information learn policies from raw image inputs [23]. In the present work, instead of raw images, we use scene representations that contain underlying 3D structure and object information of scenes and encoded question information as input and then adopt a DQN to select observation viewpoints from the previously observed scene and question information.

#### *2.4. Embodied Question Answering*

The work most similar to ours is the embodied question answering (EQA) task defined by Das et al. [24]. The authors define an EQA task for which an agent is embodied within a 3D environment. The agent attempts to answer given questions by navigating and gathering needed image information using egocentric perceptions of the environment. The authors proposed an EQA baseline method that splits the EQA task into a recurrent neural network-based hierarchical planner-controller navigation policy for navigating to the target region in the environment and a VQA model that combines the weighted sum of the image features of five images and the question features to give a final answer. Yu et al. [25] proposed a multi-target EQA task, in which every question involves two targets. In [26], the authors extended the original CG EQA dataset setting to photorealistic environments and suggested the use of point cloud data.

Most EQA methods focus on the vision-language grounded navigation process and finding the shortest paths in a given environment. They usually use relatively simple VQA methods, such as in [24], where the authors simply concatenate five images into a VQA model for answering questions. In the above study, relatively little attention was given to the kind of information necessary to answer questions, and no evidence was given that the framework answers questions based on a 3D understanding of the given scene. In contrast, our work focuses on exploring necessary scene information to answer questions by selecting observation viewpoints. Therefore, our approach can avoid unnecessary observation and reduce camera movement costs. In addition, our framework is based on an integrated 3D understanding of the given scene observed from multiple viewpoints. Therefore, our framework can associate 3D information with viewpoints and determine the minimum necessary scene information required to answer questions via the selection of observation viewpoints.

#### **3. Approach**

In real-world HRI applications, it can be challenging to obtain photographs from perimeter viewpoints. In addition, it is efficient to end the observation when the input scene information is sufficient to answer the question. Based on the above, we propose a framework that actively observes the environment and decides when to answer the question based on previously observed scene information.

As shown in Figure 2, the proposed framework consists of three modules, namely an SRN, a viewpoint selection network, and a VQA network.

The inputs of the overall framework are the default viewpoint *v*0, the image *x*<sup>0</sup> of the scene observed from *v*0, and the question *q*. *v*<sup>0</sup> and *x*<sup>0</sup> are first processed by the SRN to obtain the original scene representation *s*0. The question is processed by an embedding layer and a two-layered long short-term memory (LSTM) [27] network to obtain *q* . Then, the viewpoint selection network predicts the next observation viewpoint (or selects the end action) based on *s*<sup>0</sup> and *q* . If viewpoint *vt* is chosen at time step *t*, the agent obtains the image *xt* from viewpoint *vt* of the scene (e.g., for a robot, a scene image from *vt* is taken). Next, the SRN updates the scene based on *xt* and *vt*. If the end action is chosen, the VQA network predicts an answer based on the *q* and the integrated scene representation at that time. In the following sections, we discuss these three networks in greater detail.

**Figure 2.** Overall framework. The viewpoint selection network (multi-layer perceptron (MLP)-structured) chooses the observation viewpoint (or ends the observation) iteratively based on the scene *st* and question *q* (blue arrow flow). When the end action is chosen, the scene representation up to that time step *sT* and the encoded question *q* will be processed by the VQA network to provide an answer (orange arrow flow).

#### *3.1. Scene Representation*

We use the SRN proposed by Eslami et al. [17] to obtain integrated scene representations *st* from viewpoints {*v*0, *v*1, ..., *vt*} and images {*x*0, *x*1, ..., *xt*}. For scene *i* observed from *K* viewpoints, the observation *oi* is defined as follows:

$$\rho\_i = \{ (\mathbf{x}\_i^k, \mathbf{v}\_i^k) \}\_{k=0,\dots,K-1} \tag{1}$$

The scene representation *<sup>s</sup>* = *fSRN*(*oi*) and *<sup>g</sup>*(*xm*|*vm*,*s*) is jointly trained for image rendering from arbitrary viewpoint *m* to maximize the likelihood between the predicted *x<sup>m</sup>* and the ground truth images. *fSRN* integrates multi-view information into a compact scene representation. We use the above framework to train the SRN.

#### *3.2. Viewpoint Selection*

For VQA in real-world environments, it is necessary to choose an observation based on the question and previously observed visual information. For example, for the question " is there a red ball?", if a red ball has previously been observed, the question can be answered instantly. Additionally, for highly occluded scenes, it may be necessary to make observations from a variety of viewpoints. Therefore, we propose a DQN-based viewpoint selection network *fVS* to actively choose actions.

More specifically, assuming that the scene can be observed from *K* viewpoints, we define an action set *A* = {*ai*}*i*=0,...,*K*, where *a*0, ..., *aK*−<sup>1</sup> denote the viewpoint selection actions and *aK* represents the end observation action.

The viewpoint selection network *fVS* predicts a *K*+one-dimensional, vector that represents the obtainable reward value of each action (after it is executed) from the input of the previously observed scene *st* and question *q* . Assuming that the *j*-th action *at* is chosen at time step *t*, we denote the predicted reward value *reval <sup>t</sup>* of action *at* under the environment state *st* as follows:

$$
\tau\_t^{\text{eval}} = \{f\_{VS}(s\_{t\prime}q^{\prime})\}\_{\dot{\jmath}} \tag{2}
$$

The real reward value of action *at* can be formulated as Equation (3), where *renv <sup>t</sup>* denotes the reward obtained from the environment, and *γ* is the discount factor of future rewards:

$$r\_t^{\text{real}} = r\_t^{\text{env}} + \gamma \arg \max\_{a\_{t+1}} f\_{VS}(a\_{t+1}|s\_{t+1}, q') \tag{3}$$

The overall objective of viewpoint selection is to minimize the distance between *reval <sup>t</sup>* and *rreal <sup>t</sup>* . We designed the reward *renv* based on the correctness of the question answering and the numbers of selected viewpoints. For each added new observation viewpoint or viewpoint that is chosen repeatedly, we assign penalties *psp* and *prt* (hyperparameters). We denote the VQA loss as *lossvqa* (normalized to [−1, 1]). We designed *<sup>r</sup>env* for three action types, as shown below.

$$r^{env} = \begin{cases} p\_{sp} + p\_{rt} - loss\_{vqa} & (i) \\ & p\_{sp} + p\_{rt} \quad (ii) \\ & -loss\_{vqa} \quad (iii) \end{cases} \tag{4}$$

For the final viewpoint selection action, the reward is (*i*); for the other viewpoint selection actions, the reward is (*ii*); for the end action, the reward is (*iii*).

#### *3.3. Visual Question Answering*

VQA predicts an answer based on integrated scene information *s* and the processed question *q* . We denote the VQA network to be trained as *fVQA*. The answer *ans* can be predicted by the following:

$$\text{ans} = \arg\max\_{\text{ans}} f\_{VQA}(\text{ans}|\mathbf{s}, q') \tag{5}$$

The network is optimized by minimizing the cross-entropy loss between the predicted *ans* and the ground truth answer. In this study, we used the state-of-the-art FiLM method [10] as the VQA network. However, it is noteworthy that the VQA framework can be arbitrary.

The proposed framework could not deal with ill-structured or non-English questions that are not included in the training dataset. However, the framework could be expanded for these situations by further integrating sentence structure checking or translation modules.

#### **4. Experiments**

We conducted experiments using settings with CG and real image datasets, respectively. In the following sections, we discuss the experimental details for these two settings.

#### *4.1. Implementation Details*

Our overall framework is composed of three modules, namely an SRN, a viewpoint selection network, and a VQA network. Here, we introduce the implementation details of these three modules. The implementation setting is used for both CG and real image experiments. We used PyTorch [28] as the implementation framework. rtwoWe also show the detailed network structure in Figure 3.

**SRN:** we adopted a 12-layered tower-structured SRN proposed in [17] to extract scene representations. The SRN transforms the input of 64 × 64 × 3-dimensional images and 1 × 1 × 7-dimensional camera information (a vector represents camera position and rotation) into a 16 × 16 × 256-dimensional scene representation. We used Adam optimizer [29] and an initial learning rate of 5 × <sup>10</sup>−<sup>4</sup> for training. In all experiments, we trained the SRN network for 140 epochs.

**Viewpoint Selection Network:** to extract question information, we first adopted an embedding layer (torch.nn.Embedding defined in PyTorch), which transforms the one-hot vector of each word into a 300-dimensional vector. Next, we adopted a two-layered LSTM with hidden dimensions of 1024 to encode each question sentence into a 1024-dimensional vector. Alternative structures, such as the gated recurrent unit [30] or transformer [31], can also be used for question feature extraction. Next, we combined the encoded question with the scene representation by concatenation, which resulted in a 1024 + 16 × 16 × 256-dimensional feature. Next, we processed the concatenated feature via three fully connected layers to predict a *K*+one-dimensional, action reward value vector (K indicates

the viewpoint number). We adopted a tangent activation function before the final fully connected layer. We set the hyperparameter *γ* of the viewpoint selection network to 0.8 and -greedy to choose actions. The initial  was set to 0.5 and multiplied by 1.04 every three epochs. The penalties for viewpoint action and repeated action were set to –0.25 and –0.05, respectively. This network was jointly trained with the VQA network.

**Figure 3.** Detailed network structure. Our overall framework is composed of a tower-structured SRN [17] (shown in blue), a viewpoint selection network (shown in yellow), and a VQA framework (shown in orange), which is modified from FiLM [10]. The viewpoint selection network and VQA share the question feature extraction structure (shown in green). Based on the previous observation, the viewpoint selection network chooses the next observation or ends observation (red dotted arrow flow). While the end observation is chosen, the scene representation to that time step, and the question feature (red arrow flows) will be processed by the VQA framework to give an answer prediction:

**VQA:** we modified the FiLM implementation code [32] for our VQA module. We removed the image feature extraction component used in [32] and used integrated scene representation as the input image feature. The question extraction component is shared with the viewpoint selection network during training. We jointly trained the modified FiLM network with the viewpoint selection network with an initial learning rate of 3 × <sup>10</sup>−<sup>4</sup> and Adam optimizer for 40 epochs in all experiments.

#### *4.2. Experiments with CG Images*

**Dataset Setting:** we modified the CLEVR dataset generation program by placing multiple virtual cameras in each scene and then generating a multi-view CLEVR dataset (Multi-view-CLEVR\_12views\_CG). The camera setup, which is widely used in multi-view object recognition tasks (e.g., [33,34]), is shown in Figure 4 (left). The cameras are elevated 30 degrees above the scene and placed at intervals of 30 degrees around the center of the scene. We place a relatively large object in the middle of the scene and three to five other objects around the center object. The setting was designed to make it difficult to observe all objects from a single viewpoint, thereby providing a simple testbed for viewpoint selection efficiency. We created CG objects and set the illumination, lighting, background of scenes, and virtual camera positions through a 3D creation software blender [35] and Python-blender source code based on the the CLEVR dataset generation program [36]. We set two types of questions: exist (query the existence of an object) and query color (query an object's color). Then, by considering the existence of relationship words (left of, right of, behind, in front of), the questions can be further separated into spatial and non-spatial types. Example images and questions of a scene from Multi-view-CLEVR\_12views\_CG dataset are shown in Figure 4 (right) and the detailed statistics of the dataset are shown in Table 1.

**Figure 4.** Virtual camera setup (left) and example images and questions from Multi-view-CLEVR\_12views\_**CG** dataset (right). The default view determines the spatial relationships of objects in each scene.


**Table 1.** Statistics for datasets used in this study.

**Quantitative Results:** the overall and per-question accuracies are shown in Table 2. Our framework (SRN\_FiLM\_VS) showed a slight drop in overall accuracy compared with that for SRN\_FiLM proposed in [9]. However, our framework used many fewer viewpoints on average (2.98 vs. 12). This indicates that the proposed framework learned to obtain the required information for answering questions, lowering the camera movement cost. This ability, which is especially crucial for applications with restricted observation viewpoints, makes our framework more efficient because it decreases the required number of camera (or robot) movements.

**Table 2.** Evaluation results for Multi-view-CLEVR\_12views\_**CG**.


**Qualitative Results:** the results for four example questions are shown in Figure 5. For Question 1, the queried object "tiny cube" appears in the default view, so our model ended viewpoint selection instantly and answered the question. For Questions 2 and 3, the queried objects do not appear in the default view, so our model selected an additional viewpoint for observation. These results show that our model learned how to make observations to obtain the necessary scene information for answering questions. For Question 4, our model used three additional viewpoints and gave an incorrect answer. In this case, the queried object was a purple cylinder. The shadow cast by the center object made it difficult for the model to recognize the color. It is believed that using higher-resolution input images could improve performance in such situations.

**Effect of Viewpoint Selection:** we conducted an experiment using the input of randomly sampled viewpoints (SRN\_FiLM\_Random) and equally sampled viewpoints (SRN\_FiLM\_Equal), in which all the viewpoints from Multi-view-CLEVR\_12views\_CG were evenly sampled. The results are shown in Table 3. We first evaluated the accuracy of answering questions based on a randomly picked viewpoint (SRN\_FiLM\_Random (1 view)). For this setup, an overall accuracy of 66.71% was achieved, which indicates that our dataset is challenging with single-view images. Moreover, because our viewpoint selection network predicts answers based on an average of 2.98 viewpoints, we sampled three viewpoints for equity. The results show that our viewpoint selection network greatly outperforms the randomly and equally sampled settings, indicating that the proposed viewpoint selection network makes the utilization of viewpoint information more effective through constructing scene representations needed for answering questions from the partial observation of scenes.

**Figure 5.** Example results of SRN\_FiLM\_VS for the Multi-view-CLEVR\_12views\_**CG** dataset. The default view and selected view images are highlighted in red bounding boxes. The incorrect answers are shown in red.



**Runtime information:** the test process was conducted on an Intel(R) Core(TM) i9-7920X CPU @ 2.90 GHz machine with a graphic processing unit of Nvidia TITAN V. The runtime is 0.043 s per scene on average with a single GPU and single process.

#### *4.3. Experiments with Real Images*

#### 4.3.1. Training on CG Images and Testing on the Real Images Dataset

Real-world environment applications, such as HRI applications, require the ability to sense and recognize real-world scenes. Thus, the ability to transfer the learned scene representation and understanding obtained in a simulated environment to a real-world scene is important for AI-based frameworks. To evaluate the transfer ability of our framework, we built a multi-view dataset with real images of objects similar to those in the CLEVR dataset. We then evaluated the performance of the model that learned in the simulated environment (multi-view CLEVR dataset) on the real image dataset.

**Dataset Setting:** to evaluate the model performance for real images, we built a dataset (Multi-CLEVR\_4views\_Real) that consisted of photographed real images. We printed out and colored models with attributes similar to those of the objects described in the previous section using a 3D printer. We then placed these models on a scene table of a laboratory with normal fluorescent lighting. We used a camera (Sony Alpha 7 III Mirrorless Single Lens Digital Camera [37]) to take photographs of scenes

from four viewpoints. The camera was elevated 30 degrees above the table and placed at 90-degree intervals around it with a distance of 1.1 m from the scene center.

An average of seven photographs were taken for each viewpoint, with the camera slightly moved and rotated. In total, we collected 100 scene photographs, which were augmented to obtain 500 scene photographs by sampling images taken randomly from each viewpoint. The question–answer pair settings were the same as that those used in the previous section. The statistics are shown in Table 1. Several scene examples are shown in Figure 6 (middle).

**Figure 6.** Example images from Multi-view-CLEVR\_4views\_**CG** (**left**) which is entirely built from simulated scenes, Multi-view-CLEVR\_4views\_**Real** (**middle**), which are created from photographing real table scenes composed of real object models, and a Multi-view-CLEVR\_4views\_**3DMFP** (**right**) dataset that is built from importing scanned real object models into simulated scenes and aimed for fine-tuning.

**Quantitative Results:** we first trained and tested the SRN\_FiLM and our framework on Multi-view-CLEVR\_4views\_CG (Figure 6 (left)) for 40 epochs (the statistics are shown in Table 1). The test split results are shown in Table 4. Our framework achieved performance comparable to that of SRN\_FiLM while using fewer viewpoints. We then evaluated the performance for the Multi-view-CLEVR\_4views\_Real, the results of which are shown in Table 5. Compared with the test results for CG images (Table 4), those for the real images dropped by nearly 30% for both methods without fine-tuning (first and fifth rows of Table 5).




SRN\_FiLM\_VS -

**Table 5.** Evaluation results for Multi-view-CLEVR\_4views\_**Real**.

#### 4.3.2. Fine-Tuning on the Semi-CG Dataset

This reduction in performance is likely caused by the domain gap between CG and real images, such as color distribution, lighting, and object texture differences. To obtain high performance for real

 -94.01% images with models that were directly trained using simulated images, a common approach is to fine-tune the models with real images. However, in our data setting, each scene is composed of objects with random attributes and locations and is observed from four viewpoints. Therefore, photographing a real scene under our dataset setting requires manually arranging objects and moving cameras, which is both labor- and time-consuming.

To solve the problem mentioned above, we introduce a method for creating a fine-tuning dataset that scans real objects and imports the obtained models into a simulation environment. In detail, the dataset construction process includes three steps. First, we placed the real objects used for building the Multi-view-CLEVR\_4views\_Real in a 3D photography studio (3D MFP, Ortery [38]) for scanning. The 3D MPF photographs object models from multiple viewpoints. Next, we used the software 3DSOM Pro V5 [39] to generate textured 3D models, which is accomplished by aligning and integrating multiview images obtained by the 3D MFP. Finally, we imported the obtained 3D object models into the CLEVR image generation process and created a dataset, which we denoted as Multi-view-CLEVR\_4views\_3DMFP, using the same scene and camera setup used for Multi-view-CLEVR\_4views\_Real. We use the Multi-view-CLEVR\_4views\_3DMFP dataset for fine-tuning (the statistics of which are shown in Table 1, example images are shown in Figure 6 (right)).

Next, we conducted ablation experiments on the fine-tuned network parts (freezing the VQA model, freezing the SRN model, and then fine-tuning both models). The results are shown in Table 5. We found that fine-tuning the models improved their performance by a large margin, especially when both the SRN and VQA models were fine-tuned (up to +27.19%). After fine-tuning, the proposed model achieved an accuracy of 94.01% on the unseen real image dataset, outperforming SRN\_FiLM by 11.39% (fourth and eighth rows of Table 5). This performance increase is attributed to the proposed SRN\_FiLM\_VS model being relatively less affected by noise in images from unnecessary viewpoints, which enables the model to adapt faster to unfamiliar scenes. Furthering discussion is left for future work. It is noteworthy that using the fine-tuning dataset Multi-view-CLEVR\_4views\_3DMFP helps significantly improve the performance of both models. This result shows that, by introducing the "semi-CG" dataset consisting of the scanned real object models and simulation environments for the fine-tuning process, real-world applications have the potential to improve the performance of models trained from pure simulation data.

**Qualitative Results:** Here, we discuss the qualitative results shown in Figure 7. The SRN\_FiLM method answered questions based on all four viewpoints; in contrast, our framework selected the observation viewpoints (highlighted in red bounding boxes). For Questions 1, 2, and 3, the proposed model made reasonable viewpoint selections and answered the questions correctly. For Question 4, the spatial reasoning was difficult, and both models failed to give the correct answer. Narrowing the gap between the CG and real images might improve performance.

**Figure 7.** Example results for Multi-view-CLEVR\_4views\_**Real** dataset. The default view and selected view images are highlighted in red bounding boxes. Incorrect answers are shown in red.

#### **5. Conclusions**

In this study, we proposed a multi-view VQA framework that actively chooses observation viewpoints to answer questions. VQA is defined as the task of answering a text question based on an image that is essential in various HRI systems. Existing VQA methods answer questions based on single-view images. However, in real-world applications, single-view image information can be insufficient for answering questions, and observation viewpoints are usually limited. Moreover, the ability to observe scenes efficiently from optimal viewpoints is crucial for real-world applications with time restrictions. To resolve these issues, we propose a framework that makes iterative observations under a multi-view VQA setting. The proposed framework iteratively selects additional observation viewpoints and updates scene information until the scene information is sufficient to answer questions. the proposed framework achieves performance comparable to that of a state-of-the-art method on a VQA dataset with CG images and, in the meantime, greatly reduces the number of observation viewpoints (a reduction of 12 to 2.98). in addition, the proposed method outperforms the existing method by a significant margin (+11.39% on overall accuracy) on a dataset with real images, which is closer to the real-world setting, making our method more promising to be applied to real-world applications. However, directly applying a model trained on CG images to real images results in a performance gap (a drop of −30.82% on accuracy). In the future, we will consider using various methods to narrow this gap.

**Author Contributions:** Y.Q. has proposed and implemented the approaches and conducted the experiments. She also wrote the paper, together with Y.S., R.S., K.I., and H.K. All authors have read and approved the final version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to acknowledge the assistance and comments of Hikaru Ishitsuka and Tomomi Satoh.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Sensors* Editorial Office E-mail: sensors@mdpi.com www.mdpi.com/journal/sensors

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18