**Facial Expressions Recognition for Human–Robot Interaction Using Deep Convolutional Neural Networks with Rectified Adam Optimizer**

#### **Daniel Octavian Melinte and Luige Vladareanu \***

Department of Robotics and Mechatronics, Romanian Academy Institute of Solid Mechanics, 010141 Bucharest, Romania; octavian.melinte@imsar.ro

**\*** Correspondence: luige.vladareanu@vipro.edu.ro

Received: 16 March 2020; Accepted: 21 April 2020; Published: 23 April 2020

**Abstract:** The interaction between humans and an NAO robot using deep convolutional neural networks (CNN) is presented in this paper based on an innovative end-to-end pipeline method that applies two optimized CNNs, one for face recognition (FR) and another one for the facial expression recognition (FER) in order to obtain real-time inference speed for the entire process. Two different models for FR are considered, one known to be very accurate, but has low inference speed (faster region-based convolutional neural network), and one that is not as accurate but has high inference speed (single shot detector convolutional neural network). For emotion recognition transfer learning and fine-tuning of three CNN models (VGG, Inception V3 and ResNet) has been used. The overall results show that single shot detector convolutional neural network (SSD CNN) and faster region-based convolutional neural network (Faster R-CNN) models for face detection share almost the same accuracy: 97.8% for Faster R-CNN on PASCAL visual object classes (PASCAL VOCs) evaluation metrics and 97.42% for SSD Inception. In terms of FER, ResNet obtained the highest training accuracy (90.14%), while the visual geometry group (VGG) network had 87% accuracy and Inception V3 reached 81%. The results show improvements over 10% when using two serialized CNN, instead of using only the FER CNN, while the recent optimization model, called rectified adaptive moment optimization (RAdam), lead to a better generalization and accuracy improvement of 3%-4% on each emotion recognition CNN.

**Keywords:** computer vision; deep learning; convolutional neural networks; advanced intelligent control; facial emotion recognition; face recognition; NAO robot

#### **1. Introduction**

Humans use their facial expressions to show their emotional states. In order to achieve an accurate communication between humans and robots, the robot needs to understand the facial expression of the person who it is interacting with. The aim of the paper is to develop an end-to-end pipeline for the interaction between a human and NAO robot using computer vision based on deep convolutional neural networks. The paper focuses on enhancing the performance of different types of convolutional neural networks (CNN), in terms of accuracy, generalization and inference speed, using several optimization methods (including the state-of-the-art rectified Adam), FER2013 database augmentation with images from other databases, asynchronous threading at inference time using the Neural Compute Stick 2 preprocessor, in order to develop a straightforward pipeline for emotion recognition on robot applications, mainly NAO robot. Thereby, we are using the pipeline to first localize and align one face from the input image using a CNN face detector for facial recognition as a preprocessing tool for emotion recognition. Then, the proposal from the face recognition (FR) CNN is used as input for the second CNN, which is responsible for the actual facial emotion recognition (FER CNN). The performance of this serialized-CNN model, hereafter referred to as NAO-SCNN, depends on many factors such as the number of images in the training dataset, data augmentation, the CNN architecture, loss function, hyperparameters adjusting, transfer learning, fine-tuning, evaluation metrics, etc., which leads to a complex set of actions in order to develop an accurate and real-time pipeline. The interaction between human and robot can be regarded as a closed loop interaction: the images captured by the robot are first preprocessed by the FR CNN module. The output of this module is a set of aligned face proposals, which are transferred to FER CNN. This second module is responsible for emotion recognition. There are seven emotions that are considered: happiness, surprise, neutral, sadness, fear, angriness and disgust. Based on the emotion detected from the images/videos the robot will adjust its behavior according to a set of instructions carried out using medical feedback from psychologists, physicians, educators and therapists in order to reduce frustration and anxiety through communication with a humanoid robot. In case there is no change in the user emotion, the robot will perform another set of actions Since, at this time, the research is focused on the artificial intelligence solution for emotion recognition the set of actions for the robot interaction is basic (the robot displays the user emotion using video, audio and posture feedback) and the conclusions will regard this matter. Based on the results of this study future social studies regarding the behavior of the persons the robot is interacting with will be developed. The expression recognition has been carried out on NAO robot using one Viola–Jones based CNN trained on AffectNet in order to detect facial expressions in children using the NAO robot with a test accuracy of 44.88% [1]. Additionally, a dynamic Bayesian mixture model classifier has been used for FER human interaction with NAO achieving an overall accuracy of 85% on the Karolinska Directed Emotional Faces (KDEF) dataset [2].

In terms of facial emotion recognition, the recent studies focused on developing and achieving neural models with high accuracy using deep learning. For static images different approaches have been developed and involved: pretraining and fine-tuning, adding auxiliary blocks and layers, multitask networks, cascaded networks and generative adversarial networks. In [3] a multitasking network has been developed in order to predict the interpersonal relation between individuals. It aims for high level interpersonal relation traits, such as friendliness, warmth and dominance for faces that coexist in an image. For dynamic image sequences there are other techniques: frame aggregation, expression intensity network and deep spatiotemporal FER network. Zhao et al. proposed a novel peak-piloted deep network that uses a sample with peak expression (easy sample) to supervise the intermediate feature responses for a sample of non-peak expression (hard sample) of the same type and from the same subject. The expression evolving process from non-peak expression to peak expression can thus be implicitly embedded in the network to achieve the invariance to expression intensities [4]. Fine-tuning works for pretrained networks on datasets that include face or human images but the face information remains dominant. The inference will focus on detecting faces rather than detecting expressions. H. Ding et al. proposed a new FER model that deals with this drawback. The method implies two learning stages. In the first training stage the emotion net convolutional layers are unfrozen, while the face net layers are frozen and provide supervision for emotion net. The fully connected layers are trained in the second stage with expression information [5]. In [6] a two stage fine-tuning is presented. A CNN pretrained on an ImageNet database is fine-tuned on datasets relevant to facial expressions (FER2013), followed by a fine-tuning using EmotiW. To avoid variations introduced by personal attributes a novel identity-aware convolutional neural network (IACNN) is proposed in [7]. An expression-sensitive contrastive loss was also developed in order to measure the expression similarity. There are studies that research other aspects of emotion recognition, such as annotation errors and bias, which are inevitable among different facial expression datasets due to the subjectivity of annotating facial expressions. An inconsistent pseudo annotations to latent truth (IPA2LT) framework has been developed by Jiabei Zeng et al. to train a FER model from multiple inconsistently labeled datasets and large scale unlabeled data [8]. A novel transformations of image intensities to 3D spaces was designed to simplify the problem domain by removing image illumination variations [9]. In [10]

a novel deep CNN for automatically recognizing facial expressions is presented with state-of-the art results, while in [11] a new feature loss is developed in order to distinguish between similar features.

There are different types of databases for FER such as Japanese Female Facial Expressions (JAFFE) [12], Extended Cohn–Kanade (CK+) [13], FER 2013 [14], AffectNet [15], MMI (MMI, 2017) [16,17], AFEW [18] and Karolinska Directed Emotional Faces (KDEF) [19].

The CK+ is the most used dataset for emotion detection. The performance of CNN models trained on this dataset are greater than 95% due to the fact that the images are captured in a controlled environment (lab) and the emotions are overacted. The best performance was reached by Yang et al. with a performance of 97.3% [20]. The dataset contains 593 video frames from 123 subjects. The frames vary from 10 to 60 and display a translation from neutral to one of the six desired emotion: anger, contempt, disgust, fear, happiness, sadness and surprise. Not all the subjects provide a video sequence for each emotion. The images in the dataset are not divided into training, validation and testing so it is not possible to perform a standardized evaluation.

The Jaffe is a laboratory-controlled dataset with fewer video frames than CK+. The CNN models tested using this database have performances over 90%, with Hamester et al. reaching 95.8% [21]. The KDEF is also a laboratory-controlled database that focuses on the emotion recognition from five different camera angles. The FER2013 is the second widely used dataset after CK+. The difference is that FER 2013 images are taken from the web and are not laboratory-controlled, the expressions are not exaggerated, thus harder to recognize. There are 28,079 training images, 3589 validation images and 3589 test images that belong to seven classes. The performances using this dataset do not exceed 75.2%. This accuracy has been achieved by Pramerdorfer and Kampel [22], while other researches reached 71.2% using linear support vector machines [23], 73.73% by fusing aligned and non-aligned face information for automatic affect recognition in the wild [24], 70.66% using an end-to-end deep learning framework, based on the attentional convolutional network [25], 71.91% using three subnetworks with different depths [26] and 73.4 using a hybrid CNN–scale invariant feature transform aggregator [27].

The research carried out in this paper aims to improve the accuracy of FER 2013 based CNN models by adding laboratory controlled images from CK+, JaFFE and KDEF. Hopefully there is an emotion compilation that satisfies our requirements and is available on Kaggle [28]. This dataset is divided in 24,336 training, 6957 validation and 3479 test sets. The FER2013 images make up the biggest part of the dataset, around 90% for each emotion.

When it comes to face detection the backbone networks such as VGGNet [29], Inception by Google [30] or ResNet [31] play an important role. In [32] the DeepFace architecture, a nine-layer CNN with several locally connected layers, is proposed for face detection. The authors reported an accuracy of 97.35% on the labeled faces in the Wild dataset. The FaceNet model uses an Inception CNN as the backbone network in order to obtain an accuracy of 99.63% on the widely used labeled faces in the Wild dataset [33]. Another widely used FR network is the VGGFace, which uses a VGGNet base model trained on data collected from the Internet. The network is then fine-tuned with a triplet loss function similar to FaceNet and obtains an accuracy of 98.95% [34]. In 2017, SphereFace [35] used a 64-layer ResNet architecture and proposed an angular softmax loss that enables CNNs to learn angularly discriminative features with an accuracy of 99.42%.

In terms of the human–robot interaction, the research is concentrated on different levels and implies, among others, the grasping configurations of robot dexterous hands using Dezert–Smarandache theory (DSmT) decision-making algorithms [36], developing advanced intelligent control systems for the upper limb [37–41] or other artificial intelligence techniques such as neutrosophic logic [36,41,42], extenics control [43,44] and fuzzy dynamic modeling [45–47], applicable for the human–robot interaction with feedback through facial expressions recognition. The research in this paper is focused on developing an AI based computer vision system to achieve the communication human–NAO robot aiming the future researches to develop an end-to-end pipeline system with advanced intelligent control using feedback from the interaction between a human and NAO robot.

#### **2. Methods**

Direct training of deep networks on relatively small facial datasets is prone to over fitting. Public databases are no larger than 100k images. To compensate this, many studies used data preprocessing or augmentation and additional task-oriented data to pretrain their self-built networks from scratch or fine-tuned on well-known pretrained models.

In order to choose the appropriate solution for the human–NAO robot face detection two different types of object detectors have been taken into account: single shot detector (SSD) and regional proposal network (RPN). These two architectures have different outputs in practice. The SSD architecture is a small CNN network (around 3 million parameters) with good accuracy for devices with limited computation power. The inference time is very small compared to other large object detectors. The faster region-based convolutional neural network (Faster R-CNN), instead, has high accuracy, better results on small objects, but the inference time takes longer than SSD models and is not suitable for real time applications with low computational power.

The FR architecture consists of a base model (Inception for SSD architecture and ResNet + Inception for Faster R-CNN) that is extended to generate multiscale or regional proposals feature maps. These models were pretrained on different databases: the SSD+InceptionV2 was trained on the common objects in context (COCO) dataset, while the Faster R-CNN was pretrained on a much larger database, the Google OpenImage database.

#### *2.1. Face Detection CNN*

The human faces share a similar shape and texture, the representation learned from a representative proportion of faces can generalize well to detect the others, which are not used in the network training process. The performance of the trained model depends on many factors such as number of images in the training dataset, data augmentation, the CNN architecture, loss function, hyper-parameters adjusting, transfer learning, fine-tuning, evaluation metrics, etc., which leads to a complex set of actions in order to develop the entire pipeline.

The face detector was used to localize faces in images and to align them to normalized coordinates afterwards. The CNN architecture is made up of a backbone network, a localization CNN and the fully connected layers for classification. The backbone networks used for facial recognition were the Inception V2 for SSD architecture and ResNet-InceptionV2 for Faster R-CNN, respectively. In terms of the localization network there were two types of architecture: SSD and Faster R-CNN. The SSD architecture adds auxiliary CNNs after the backbone network, while the Faster-RCNN uses a regional proposal network (RPN) for proposing regions in the images, which were further sent to the CNN, which was also used as a backbone. The classification layers of both face detectors were at the top of the architecture and used the softmax loss function together with L2 normalization in order to adjust the localization of the face region and to classify the image.

#### 2.1.1. Inception Based SSD Model

The SSD architecture uses an Inception V2 pretrained model as a backbone. The Inception model was designed for solving high variation in the location of information, thus is useful for localization and object detection when is added as base network for SSD and RPN architectures. Different types of kernels with multiple sizes (7 × 7, 5 × 5 and 3 × 3) were used. Larger kernels can look for the information that is distributed more globally, as the smaller one search in the information that is not as sparse. There is one important filter that is also used for this type of CNN, which is the 1 × 1 convolution for reducing or increasing the number of feature maps. The network is 22 layers deep when counting only layers with parameters (or 27 layers with pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. The SSD architecture adds auxiliary structure to the network to produce multiscale feature maps and convolutional predictors for detection. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes [48].

#### 2.1.2. Inception-ResNetV2 Based Faster R-CNN Model

The face detection and alignment is tested with a second object detector, the Faster R-CNN with Inception and ResNet as backbone CNN. Faster R-CNN convolutional networks use regional proposal method to detect the class and localization of the objects. The architecture is made up of two modules: the regional proposal network (RPN) and the detection network. The RPN has CNN architecture with anchors and multiple region proposals at each anchor location to output a set of bounding boxes proposals with a detection score. The detection network is a Fast R-CNN network that uses the region proposals from RPN to search for objects in those regions of interest (ROI). ROI pooling was performed and then resulted feature maps pass through CNN and two fully connected layers for classification (Softmax) and bounding box regression. The RPN and the detection network share a common set of convolutional layers in order to reduce computation [49].

As comparison, it is a common fact that the Faster R-CNN performs much better when it comes to detecting small object, but shares the same accuracy with the SSD network when detecting large objects. Due to the complex architecture of Faster R-CNN, the inference time is three times higher than SSD architecture.

For *face detection* the backbone of the Faster R-CNN is an Inception+ResNetV2 network that has the architecture similar to Inception V4. The overall diagram implies a five convolutions block in the first layers, five Inception-ResNetV2-typeA, ten Inception-ResNetV2-typeB and five Inception-ResNetV2-typeC. The type A and type B Inception-RResNet layers were followed by a reduction in feature size, while the type C Inception-ResNet layer was followed by average pooling. The average pooling was performed before fully connected (FC) layers and the dropout was chosen to be 0.8.

#### 2.1.3. Pipeline for Deep Face Detection CNN

Transfer learning and fine-tuning has been used for training SSD and Faster R-CNN models pretrained on COCO and Open Image databases. The input for the fine-tuned training consisted of 2800 face images randomly selected from the Open ImageV4dataset images [50,51]. The dataset was split into three categories: 2000 images for training, 400 for validation and 400 for testing.

The hyperparameters for training each network are described below. The values were obtained under experiments and represent the best configuration in terms of accuracy, generalization and inference speed obtained after training each model. Due to the limitations of the graphical process unit used for training the number of operations was reduced for each training epoch, which implied a low number of images per batch, especially for Faster R-CNN model.

For fine-tuning the SSD architecture, the hyper-parameters were set as follows:


For fine-tuning the Faster R-CNN architecture, the hyperparameters were set as follows:


#### *2.2. Facial Emotion Recognition CNN*

The database used for training FER CNN is a selection of uncontrolled images from FER2013 and laboratory controlled images from CK+, JaFFE and KDEF [28]. This dataset was divided in 24,336 training, 6957 validation and 3479 test sets. The labeling of the training and testing dataset were previously made by the authors of the database and, in addition, were verified by our team in order to avoid bias. A small amount of the images that did not meet our criteria in terms of class annotations and distinctiveness were dropped or moved to the corresponding class. In addition, data augmentation was used during training in order to increase the generalization of the model. A series of rotations, zooming, width and height shifting, shearing, horizontal flipping and filling were applied to the training dataset. The FER2013 images represents the majority of the dataset, around 90% for each emotion.

The facial expressions were divided in seven classes: angry, disgust, fear, happy, neutral, sad and surprise. Training classes distribution over the dataset was angry 10%, disgust 2%, fear 3%, happy 26%, neutral 35%, sad 11% and surprise 13%. The validation and test set followed the same distribution.

The CNN models used for FER were pretrained on the ImageNet database and could recognize objects from 1000 classes. We did not need a SSD or RPN architecture as the face localization was already achieved with face detection CNN. ImageNet did not provide a class related to face or humans but there were some other classes (e.g., t-shirt or bowtie) that helped the network to extract these kinds of features during the prelearning process. This is related to the fact that the network needs these features in order to classify emotions related to classes. Taking into account that only some bottleneck layers will be trained, we would use transfer learning and fine-tuning, as follows: in the first phase the fully connected layers of the pretrained model were replaced with two new randomly initialized FC layers that were able to classify the input images according to our dataset and classes. During this warm-up training all the convolutional layers were frozen allowing the gradient to back-propagate only through the new FC layers. In the second stage the last layers of the convolutional networks were unfrozen, where high-level representations were learned allowing the gradient to back-propagate through these layers but with a very small learning rate in order to allow small changes to the weights. The representation of the three CNN (VGG, ResNet and Inception V3) and the transfer learning approach is shown in Figure 1.

**Figure 1.** Architectures of the models used for facial emotion recognition (FER): (**a**) VGG; (**b**) ResNet50 and (**c**) Inception V3.

#### 2.2.1. VGG16

In order to avoid over fitting several techniques were taken into account, such as shuffling data during training or adopting dropout. The images were shuffled during training to reduce variance and to make sure that the batches are representative to the overall dataset. On the other hand, using dropout in the fully connected layer reduced the risk of over fitting by improving the generalization. This type of regularization reduced the size of the network by "dropping" an amount of different neurons at each iteration. The CNN was made of five convolutional blocks (16 or 19 layers) each followed by feature map reduction using max-pooling. The bottleneck layer output was passed through an average pooling in order to flatten the feature map to a 1 × 1 × 4096 and the representation was forwarded to the FC network. The FC had one dense layer of 512 neurons with rectified linear unit (ReLu) activation function and dropout of 0.7 and a Softmax classifier.

For fine-tuning the VGG model, the hyperparameter values were obtained under experiments and represent the best configuration in terms of accuracy, generalization and inference speed after training each model and were set as follows:


#### 2.2.2. ResNet

ResNet is a deep convolutional network that used identity convolutional blocks in order to overcome the problem of vanishing gradients (Figure 2). The gradient may become extremely small as it is back-propagated through a deep network. The identity block use shortcut connections, which are an alternate path for the gradient to flow through, thus solving the problem of vanishing gradient. We would use ResNet with 50 convolutions, which were divided into five stages. Each stage had a convolutional and an identity block, while each block had three convolutions, with 1 × 1, 3 × 3 and 1 × 1 filters, where the 1 × 1 kernel (filter) was responsible for reducing and then increasing (restoring) dimensions. The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut is replaced with the projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. Thereby, identity shortcuts lead to more efficient models for the bottleneck designs [31].

Fine-tuning of ResNet architecture, presented at the bottom of Figure 2, implied the following hyperparameter configuration.


The values were obtained under experiments and represent the best configuration in terms of accuracy, generalization and inference speed after training each model.

**Figure 2.** ResNet model using transfer learning and fine-tuning.

#### 2.2.3. InceptionV3

Inception V3, presented in Figure 3, is a deep neural network with 42 layers, which reduced the representational bottlenecks. It was composed of five stem convolutional layers, three type A Inception blocks, followed by a type A reduction block, four type B Inception blocks and one reduction block, two type C Inception blocks followed by an average pooling layer and the fully connected network. In order to reduce the size of a deep neural network the factorization was taken into account. Different factorization modules were introduced in the convolutional layers to reduce the size of the model in order to avoid over fitting. Neural networks performed better when convolutions did not change the size of the input drastically, reducing the dimensions too much causing loss of information. One factorization implied splitting 5 × 5 convolutions to two 3 × 3 convolution (type A inception block). In addition, factorization of the n × n filter to a combination of 1 × n and n × 1 asymmetric convolutions (type B inception block) was found to dramatically reduce the computation cost. In practice, it was found that employing this factorization does not work well on early layers, but it gives very good results on medium grid-sizes [52]. The last factorization taken into account was the high dimensional representations by replacing two of the 3 × 3 convolution with asymmetric convolutions of 1 × 3 and 3 × 1.

**Figure 3.** Inception model using transfer learning and fine-tuning.

For fine-tuning the Inception V3 model, the hyperparameters configuration is as follows:


The values were obtained under experiments and represent the best configuration in terms of accuracy, generalization and inference speed after training each model.

#### *2.3. Optimization Using Rectified Adam and Batch Normalization*

There are several methods that accelerate deep learning model optimization by applying adaptive learning rate, such as the adaptive gradient algorithm (Adagrad), Adadelta, Adamax, root mean square propagation (RMSprop), adaptive moment optimization (Adam) or Nesterov adaptive moment optimization (Nadam). Rectified Adam is a state-of-the-art version of the Adam optimizer, developed by [53], which improves generalization and introduces a term to rectify the variance of the adaptive learning rate, by applying warm up with a low initial learning rate.

Computing the weights according to the Adam optimizer:

$$w\_t = w\_{t-1} - \eta \frac{\dot{m}\_t}{\sqrt{\overline{\upsilon}\_t} + \varepsilon} \tag{1}$$

The first moving momentum:

$$m\_t = (1 - \beta\_1) \sum\_{i=0}^t \beta\_1^{t-i} \mathbf{g}\_i \tag{2}$$

The second moving momentum:

$$w\_t = (1 - \beta\_2) \sum\_{i=0}^t \beta\_2^{t-i} g\_i^{\;2} \tag{3}$$

The bias correction of the momentums:

$$m\hbar\_t = \frac{m\_t}{1 - \beta\_1^t} \tag{4}$$

$$
\psi\_t = \frac{\upsilon\_t}{1 - \beta\_2^t} \tag{5}
$$

Adding the rectification term in Equation (1), the recent variant of Adam optimization, named rectified Adam (RAdam), has the form:

$$w\_t = w\_{t-1} - \eta r\_t \frac{\hat{m}\_t}{\sqrt{\hat{\upsilon}\_t}} \tag{6}$$

where the step size, η, is an adjustable hyperparameter and rectification rate is:

$$r\_t = \sqrt{\frac{(p\_t - 4)(p\_t - 2)p\_{\infty}}{(p\_{\infty} - 4)(p\_{\infty} - 4)p\_t}}\tag{7}$$

while *pt* <sup>=</sup> *<sup>p</sup>*<sup>∞</sup> <sup>−</sup> <sup>2</sup>*t*β*<sup>t</sup>* 2 <sup>1</sup>−β*<sup>t</sup>* 2 and *<sup>p</sup>*<sup>∞</sup> = <sup>2</sup> (1−β*<sup>t</sup>* <sup>2</sup>) <sup>−</sup> 1.

When the length of the approximated simple moving average is less or equal than 4, the variance of the adaptive learning rate is deactivated. Otherwise, the variance rectification term is calculated and parameters are updated with the adaptive learning rate.

After applying batch normalization to the activation function output of the convolutional layers, the normalized output will be:

$$
\mathfrak{x}\_i = \mathfrak{y}\mathfrak{d}\_i + \beta \tag{8}
$$

where γ and β are parameters used for scale and shift that are learned during training. Moreover, the weights normalization over a mini-batch is:

$$\psi\_i = \frac{w\_i - \mu\_B}{\sqrt{\sigma\_B^2 + \varepsilon}}\tag{9}$$

where the mini- batch average:

$$\mu\_B = \frac{1}{m} \sum\_{i=1}^{m} w\_i \tag{10}$$

and the mini-batch variance:

$$
\sigma\_B^2 = \frac{1}{m} \sum\_{i=1}^m \left( w\_i - \mu\_B \right)^2 \tag{11}
$$

For FER our end-to-end human–robot interaction pipeline used convolutional neural network models that were trained using batch normalization after ReLu activation and RAdam optimizer. As we will later see in the paper the best results using RAdam have been obtained for the ResNet model.

#### *2.4. Experiment Setup*

In order to develop an end-to-end pipeline for the interaction between human and NAO robot using computer vision based on deep convolutional neural networks a preprocessing CNN for facial detection was added before the FER CNN.

The entire facial expression pipeline was implemented on the NAO robot and is presented in Figure 4. The system was divided in four parts: the NAO robot image caption (NAO camera), the face recognition model (FR CNN) and the facial emotion recognition model (FER CNN) and robot facial expression (output to the user/human). For image caption and output to the user we used the Naoqi library functions running on the robot, while the FR and FER models were uploaded to the robot and enabled when the emotion recognition was activated.

**Figure 4.** End-to-end emotion detection pipeline.

NAO is a humanoid robot developed by Aldebaran (Softbank Robotics), it has a height of 57 cm, weighs 4.3 kg and is widely used in the research and education due to its good performances, small size, affordable price and the wide range of sensors it is equipped with. Since the scripts can be developed in several programming languages and can be compiled both locally (on the robot) and remotely, NAO can be used in various applications. In order to achieve a human–robot interaction NAO's top front camera was used for taking pictures or videos of the person in front and NAO's voice module together with eye and ear LEDs for outputting robot emotion. NAO has two identical RGB video cameras located in the forehead, which provide a 640 × 480 resolution at 30 frames per second. The field of

view is 72.6◦DFOV with a focus range between 30 cm and infinity [54]. Eye and foot LEDs are RGB full color, so it is possible to combine the three primary colors (red, green and blue) in order to obtain different colors. This feature will be used to associate an emotion to one color: happy is green, red is angry, blue is sad, disgust is yellow, neutral is black, surprise in white and fear is orange (Figure 5). The intensity of the color will be adjusted depending on the probability of emotion detection.

In terms of computer vision, NAO's embedded module is able to detect faces with a miss rate of 19.16% for frontal head positions, based on a face detection solution developed by OMRON. The NAO robot software (Choregraph) and libraries share a computer vision module for facial recognition named ALFaceDetection, which is able to recognize one particular face previously learned and stored in its limited flash memory. The learning and storing process of one face is tedious and involves several steps, in which, for example, NAO checks that the face is correctly exposed (no backlighting, no partial shadows) in three consecutive images. Compared to the NAO face detection module, the CNN based facial recognition, which represents the preprocessing module of the pipeline presented in the paper, was straightforward and could recognize one random face with an accuracy of 97.8% as it will be presented in the Section 3.1.1.

**Figure 5.** NAO color expression.

The training of the FR and FER models were performed on a GeForce GTX 1080 graphics processing unit (GPU) running on compute unified device architecture (CUDA) and CUDA deep neural network library (CUDNN) with the following specifications: memory 8 GB GDDR5X, processor 1.8 GHz and 2560 CUDA cores. The inference (detection) was running on the NAO robot, which was equipped with an Intel Atom Z530 1.6 GHz CPU and 1 GB RAM memory.

In order to increase the inference time we used a neural network preprocessor developed by Intel, called the Neural Compute Stick 2 (NCS 2), together with asynchronous threading, which boosts the NAO performances from 0.25 frames per second (FPS) to 4-6 FPS on our FR+FER pipeline. The inference speed when using only one CNN, reached 8-9 FPS with NCS 2.

#### **3. Results**

#### *3.1. Facial Recognition*

#### 3.1.1. ResNet-Inception Based Faster R-CNN Model

The loss variation of Faster R-CNN is presented in Figure 6. It can be seen that the loss dropped very quickly. This happened because the complex architecture of R-CNN, which is a combination of ResNet and Inception models, increased the training accuracy. Another parameter that artificially boosted the accuracy of the network was the number of images in the batch. In our case there was only one image per batch due to the limitations of the GPU. Since the gradient descent was applied to every batch, not to the entire dataset, it was normal to have a rapid drop and small loss overall. After the rapid drop to a value of 0.4 in the first 5k steps, the loss decreased slowly until it started to converge at around 60k steps with the value of 0.25. The accuracy of Faster R-CNN model was 97.8% using the Pascal visual object class (VOC) evaluation metrics.

**Figure 6.** Total loss of the faster region-based convolutional neural network (Faster R-CNN) model.

#### 3.1.2. Inception Based SSD Model

In contrast with Faster R-CNN the smoothened loss for SSD Inception has an exponential dropping and converged at around 120k steps. The loss variation can be observed in Figure 7, with the minimum value around 1.8. The precision of SSD Inception model was 97.42% using the Pascal VOC evaluation metrics.

**Figure 7.** Total loss of the Inception based single shot detector (SSD) model.

#### *3.2. Facial Emotion Recognition*

#### 3.2.1. VGG16

The fine-tuning of the FC layers and training of the last convolutional layers of the VGG model for FER are presented in Figure 8. The training (red line) and validation loss (blue line) during the "warm-up" drop after several epochs and, after completing the FC layers training, the train loss was 1.1 and validation loss was 0.8 (Figure 8a). The accuracy was 0.6 (60%) for training (purple line) and 0.7 (70%) for validation (gray line). Once the classification layers weight was initialized according to new output classes and learns some patterns from the new dataset, it was possible to skip to the next step. The last two convolutional layers and the max pooling of the VGG model were unfrozen in order to get feature maps that relate to emotion recognition. Initially, the training was set for 100 epochs but because the model started to over fit, the learning process was stopped after 50 epochs. When the training completed, the train loss (red line) dropped to 0.36 and the validation loss (blue line) started to become instable and converge around 0.5. The accuracy of training (purple line) increased to 0.87 (87%) while the loss (gray line) reached 0.84 (84%; Figure 8b).

#### 3.2.2. ResNet

In Figure 9 the fine-tuning of the fully connected layers and last unfrozen layers of ResNet model is presented. The fully connected network was made of one dense layer of 1024 neurons activated by a ReLu function and one dense layer for classification activated by a Softmax function. The model that achieved the best results was compiled using a rectified Adam optimizer with a learning rate value set at 0.0001

**Figure 8.** VGG training loss and accuracy: (**a**) fully connected (FC) layers warm-up and (**b**) training of last two convolutional layers.

The training of these two layers was made while keeping all the model layers frozen. The losses of the training and validation set converged around the value of 1.1 and the accuracy of both sets around 0.6 (Figure 9a). For the actual training of the entire model we tried different fine-tunings of the hyperparameters while unfreezing different top layers of the ResNet architecture. The setting that achieved the best results was: batch size: 16, optimizer rectified Adam, learning rate—0.0001 and unfreezing the layers from the 31st convolution (the three ID blocks before the last convolutional block). This model reached a training accuracy of 0.9014 (90.14%) when the learning process was stopped, after 50 epoch, because the validation loss started to increase and model was over fitting (Figure 9b). In other tests the training accuracy increased to 0.96 while over fitting the model but for our FER inference system we want the network to have a good generalization on different datasets. The ResNet gave the best results of all tested FER architectures, in terms of generalization and was used for the NAO-SCNN inference system pipeline.

**Figure 9.** ResNet training loss and accuracy of the FC layers warm-up (**a**) and training of the last five convolutional layers (**b**).

#### 3.2.3. InceptionV3

The results of fine-tuning the InceptionV3 network are presented in Figure 10. As with the VGG and ResNet models the training was divided in two steps: first the weights of the fully connected layers were updated while freezing the rest of the network and then some of the layers of the Inception B block together with the Reduction B and Inception C block were unfrozen (Figure 10a). The fully connected network was made of one dense layer with 512 neurons activated by a ReLu function and one dense layer for classification activated by a Softmax function. The model was compiled using rectified Adam, Adam, RMSProp and stochastic gradient descent (SGD) optimizers with a learning rate value set at 0.0001. The training of the entire network was made by combining different hyperparameters values while unfreezing different top layers of the InceptionV3 architecture.

The best results were achieved for batch size: 32, Adam optimizer, learning rate—0.0001 and unfreezing the layers starting from the fifth convolution of the first Inception Block-type B. This model reached a training accuracy 0.81 (81%) and 0.78 (78%) when the training was stopped, after 50 epoch (Figure 10b), due to over fitting.

**Figure 10.** InceptionV3 training loss and accuracy for: (**a**) FC layers warm-up and (**b**) training of the last five convolutional layers.

The confusion matrix for FER predictions of the three architectures is presented in Figure 11. The network with the best overall results, from the confusion matrix perspective, was also ResNet (Figure 11b), followed by the VGG (Figure 11a) and InceptionV3 (Figure 11c). The first thing to notice was that the class with the highest score was the "happy" class regardless of the network model. There were two crucial factors that enabled such a good prediction: high variance and large dataset, with the first being the most important. Training samples play an important role for learning a complex future but the variance was determinant when it came to highlighting crucial features. This can be seen from the fact that the happy class is not the biggest in the training samples, being overtaken by the neutral class. Although it had the largest share of the dataset, the "neutral" class failed to generalize as well as others. The same thing when comparing angry, sad and surprise classes. The classes shared the same proportion from the dataset but the accuracy was different.

All of these observations mentioned above were influenced by emotion variance. The similarity of the emotion variation could also be observed in the confusion matrix. For example "fear" can be easily mistaken with "surprise" and "sad" with "neutral. An important number of images from "angry" and "sad" classes were classified as "neutral" images. This happens because these classes had low variance, in terms of mouth and eyebrows shape. The shape of mouth did not change significantly, while the displacement of the eyebrows was hard to be distinguished by the CNN model. It is interesting that the vice versa was not happening, probably because of the high share of training samples. Another important misclassification was the fear emotion classified as surprise. Due to the similarity of these emotions in terms of mouth shape and low changes in the eyebrows shape, fear was overly misclassified as surprise (19%). The misclassification of the classes could be improved by

increasing the number of images and by removing the neutral class in order to force the model to learn other distinct features. Bias played an important role, mainly when the expressions were difficult to be labeled, thereby affecting the models that were trained on them.

**Figure 11.** Confusion matrix for: (**a**) VGG; (**b**) ResNet and (**c**) InceptionV3.

In Table 1, the performances of state-of-the-art CNN models trained on FER2013 database are presented with respect to the CNN type, preprocessing method and optimizer. The NAO-SCNN models developed in this research achieved the highest accuracy, with the ResNet based architecture obtaining the best performances.

In Table 2, the performances of state-of-the-art CNN models used for emotion recognition on NAO robot are presented. The NAO-SCNN models achieved the highest accuracy, with the ResNet based architecture obtaining the best performances.


**Table 1.** Results for models trained on FER2013 database.



In Figure 12 the manner the VGG model learns patterns for a given input image is presented. We followed the feature maps transformation of two outputs from convolutional blocks four and nine. After the convolution four our fine-tuned model focused on learning different edges like face, mouth, nose, eyes or teeth outline. This was the result of the convolutional filters learning how to identify lines, colors, circles or other basic information.

**Figure 12.** VGG feature maps at convolution 4 and convolution 9.

The feature maps after the second max pooling were concentrated to learn more complex representations from the picture. At this stage the convolutional filters were able to detect basic shapes. The filters from the deeper layers were able to distinguish between difficult shapes at the end of training and the representation of the feature maps became more abstract.

The feature maps for two layers of ResNet model are presented in Figure 13. The first set of images were captured after the second convolutional block. At this point, the feature maps had high resolution and focused on sparse areas of the face. The second set of images was taken from the second convolution of third ID block. It could be observed that at this point the filters were concentrated on particular areas, such as eyes, mouth, teeth, eyebrows and the representation of the feature maps was more abstract.

The feature maps for two layers of Inception V3 model are presented in Figure 14. The first set of images was captured after the first Inception type-A block. As VGG and ResNet, the convolutional filters were able to detect the shape of areas responsible for emotion detection such as eyes, mouth, eyebrows and also neighbor regions with relevant variance.

**Figure 13.** ResNet feature maps after the second convolution block and 8th ID block.

**Figure 14.** Inception V3 feature maps after the Inception block type A and Inception block type B.

The second set of images was taken after the first Inception block type B. At this point, the resolution was low as the filters concentrated on small regions of the input image and the feature maps had a more abstract representation.

#### **4. Conclusions**

The SSD and the Faster R-CNN models for face detection shared almost the same total loss and accuracy. The accuracy of the Faster R-CNN model was 97.8% based on the VOC metrics, while for the SSD Inception model it was 97.42%. Thus, taking into account that Faster R-CNN is a large network, in terms of total parameters, the SSD architecture was chosen for FR detection in order to keep a low inference speed.

The network with the best overall performances, in terms of accuracy and loss for FER detection was ResNet and the results were confirmed by the confusion matrix. This model obtained a training accuracy of 90.14%, while VGG had 87% accuracy and Inception V3 reached 81%. By adding 5–6% of laboratory controlled images from other databases the accuracy of FER2013 increased with more than 10%, of the highest score so far, using the ResNet model together with other optimizations. In order to develop a model that better generalizes FER it is important to mix controlled and uncontrolled images.

We also noticed that using the preprocessing module for FR before running the inference for emotion recognition enhances the FER confidence score for the same images. The results show improvements well over 10% when using two serialized CNN, instead of just using the FER CNN.

A recent optimization model, called rectified Adam, was applied for training the three FER CNN models, which led to a better generalization. RAdam introduces a term to rectify the variance of the adaptive learning rate, by applying warm up with a low initial learning rate. The tests showed a more robust loss during training and accuracy improvement of 3%-4% on each FER CNN compared to other optimizers.

The FR and FER architectures were implemented on the NAO humanoid robot in order to switch from the current computer vision library running on the robot to a newer and more accurate serialized model based on deep convolutional neural networks.

In order to increase the inference time we used a neural network preprocessor developed by Intel, called Neural Compute Stick 2(NCS 2), together with asynchronous threading, which boosts the NAO performances from 0.25 to 4-6 FPS on our FR+FER pipeline. The inference speed when using only one CNN, reached 8-9 FPS with NCS2.

Future developments will involve the fusion of other inputs, such as audio models, infrared images, depth information from 3D face models, physiological data, etc. will provide additional information and further enhance the accuracy of the process.

The human robot interaction model obtained during this research will be further developed in order to be applied for medical purposes, mainly for improving communication and behavior of children with autism spectrum disorder. The research will be carried out using medical feedback from psychologists, physicians, educators and therapists to reduce frustration and anxiety through communication with a humanoid robot.

**Author Contributions:** Conceptualization: D.O.M. and L.V.; methodology: L.V. and D.O.M.; software: D.O.M.; validation: D.O.M. and L.V.; formal analysis: L.V.; investigation: D.O.M. and L.V.; resources: L.V.; data curation: D.O.M.; writing—original draft preparation D.O.M. and L.V.; writing—review and editing: L.V. and D.O.M.; visualization: D.O.M. and L.V.; supervision: L.V.; project administration: L.V.; funding acquisition: L.V. and D.O.M. Both authors have read and agreed to the published version of the manuscript.

**Funding:** The paper was funded by: 1. UEFISCDI Multi-MonD2 Project, Multi-Agent Intelligent Systems Platform for Water Quality Monitoring on the Romanian Danube and Danube Delta, PN-III-P1-1.2-PCCDI2017-0637/33PCCDI/ 01.03.2018; 2. Romanian Ministry of Research and In-novation, CCCDI–UEFISCDI, project number PN-III-P1- 1.2-PCCDI-2017-0086/contract no. 22 PCCDI /2018, within PNCDI III; 3. Yanshan University: "Joint Laboratory of Intelligent Rehabilitation Robot" project, KY201501009, Collaborative research agreement between Yanshan University, China and Romanian Academy by IMSAR, RO; 4. The European Commission Marie Skłodowska-Curie SMOOTH project, Smart Robots for Fire-Fighting, H2020-MSCA-RISE-2016-73487.

**Acknowledgments:** This work was supported: 1. UEFISCDI Multi-MonD2 Project, Multi-Agent Intelligent Systems Platform for Water Quality Monitoring on the Romanian Danube and Danube Delta, PN-III-P1-1.2-PCCDI2017-0637/ 33PCCDI/01.03.2018; 2. Romanian Ministry of Research and In-novation, CCCDI–UEFISCDI, project number PN-III-P1-1.2-PCCDI-2017-0086/contract no. 22 PCCDI /2018, within PNCDI III; 3. Yanshan University: "Joint Laboratory of Intelligent Rehabilitation Robot" project, KY201501009, Collaborative research agreement between Yanshan University, China and Romanian Academy by IMSAR, RO; 4. The European Commission Marie Skłodowska-Curie SMOOTH project, Smart Robots for Fire-Fighting, H2020-MSCA-RISE-2016-73487. The authors gratefully acknowledge the support of the Robotics and Mechatronics Department, Institute of Solid Mechanics of the Romanian Academy.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **The Growth of Ga2O3 Nanowires on Silicon for Ultraviolet Photodetector**

**Badriyah Alhalaili 1,2, Ruxandra Vidu 2,3,\* and M. Saif Islam <sup>2</sup>**


Received: 4 September 2019; Accepted: 27 November 2019; Published: 2 December 2019

**Abstract:** We investigated the effect of silver catalysts to enhance the growth of Ga2O3 nanowires. The growth of Ga2O3 nanowires on a P+-Si (100) substrate was demonstrated by using a thermal oxidation technique at high temperatures (~1000 ◦C) in the presence of a thin silver film that serves as a catalyst layer. We present the results of morphological, compositional, and electrical characterization of the Ga2O3 nanowires, including the measurements on photoconductance and transient time. Our results show that highly oriented, dense and long Ga2O3 nanowires can be grown directly on the surface of silicon. The Ga2O3 nanowires, with their inherent n-type characteristics formed a pn heterojunction when grown on silicon. The heterojunction showed rectifying characteristics and excellent UV photoresponse.

**Keywords:** β-Ga2O3; nanowires; oxidation; silver catalyst; electrical conductivity; photodetector

#### **1. Introduction**

The development of wide band gap semiconductor technology has received considerable attention as basic materials that facilitate various ultraviolet (UV) applications in nanoscale electronics and optoelectronics [1] such as engine control, solar UV monitoring, astronomy, communications, or the detection of missiles. Recently, UV photodetectors (PDs) have received special attention, because the civil, military, environmental, and industrial markets need to improve UV instrumentation that operates at extremely harsh environments. Therefore, numerous studies have been proposed to fabricate UV photodetectors with specialized features to operate and survive in the UV region of the spectrum.

Semiconductor nanowires exhibit different and often improved material properties [2,3] compared to bulk or thin-film semiconductors. In recent years, gallium oxide (Ga2O3) became one of the most important materials that can operate in harsh conditions. With a band-gap of 4.8 eV, a high melting point of 1900 ◦C, excellent electrical conductivity, high figure of merit for high-frequency applications, and photoluminescence [4,5], it is an ideal candidate for visible-blind UV-light sensors, particularly for power electronics, solar-blind UV detectors, and devices for harsh environments [6,7]. New processes have been investigated to synthesize Ga2O3 nanowires (NWs) through a bottom-up approach, which include thermal oxidation [8,9], vapor-liquid-solid mechanism [10], pulsed laser deposition [11], sputtering [12], thermal evaporation [13–15], molecular beam epitaxy [16], laser ablation [17], arc-discharge [18], carbothermal reduction [19], microwave plasma [20], metalorganic chemical vapor deposition [21], and the hydrothermal method [22,23].

Due to the surface area, small nanowire diameter and high nanowire photoconductivity, high responsivity can be achieved in UV photodetectors. Additionally, one of the beneficial parameters of nanowires is their ability to enhance light absorption and confinement to increase photosensitivity [24]. The superiority of the growth of Ga2O3 nanowires, compared to thin film is the ability to increase the sensitivity in detection due to the higher surface-to-volume ratio, leading to more available surface states at the interface, and thus, exceptional interaction with analytes or physical states [25]. Although various reports have been obtained to grow Ga2O3 thin films on Si [26,27], there have been few reports on the growth of nanowires onto a silicon (Si) substrate [28], which will pave the way for future sensing devices and circuit technology integrations. The sensors obtained using this innovative approach will lead to new trends in design, control, and applications of real-time intelligent sensor system control by advanced intelligent control methods and techniques. The effect of Ag thin film as a catalyst to enhance the growth of Ga2O3 nanowire and crystalline thin film on quartz has been reported [29], but it has not been explored on the silicon surface. We also wanted to observe the contribution of silicon atoms in enhancing the conductivity of Ga2O3 nanowires via diffusion-enabled incorporation into the nanowires during the growth process.

In our previous work, the surface of the quartz was coated with a 5 nm Ag catalyst using a shadow mask intentionally to examine the effect of Ag nanoparticles (NPs)distribution. In this work, the entire silicon surface was coated with a 5 nm catalyst to enhance the growth of highly oriented nanowires that has not been shown before. Compared to other reported works [28], the length of the nanowires was much higher and highly oriented when Ag catalyst was used rather than the Au catalyst, where the nanowires were randomly oriented.

In this work, we proposed the growth of β-Ga2O3 nanowires on P+-silicon substrate by thermal oxidation at 950 ◦C using an Ag catalyst. We studied the sensitivity of β-Ga2O3 nanowires for UV detection.

#### **2. Materials and Methods**

The UV photodetector was fabricated on (100) P+-Si substrate doped with phosphorus. The substrate was 500 μm thick and had a resistivity between 0.001 and 0.005 Ω-cm. Before each experiment, the silicon substrate was cleaned for 5 min in acetone and then, in methanol for 5 min in an ultrasonic bath. Following the cleaning procedure, the wafer was rinsed with deionized water for 5 min. To obtain Ga2O3, 0.2 g of gallium [(Ga) (purity 99.999%)] was dripped onto cleaned quartz crucible. Silver was used as a catalyst to enhance the growth of gallium oxide NWs. An ultrathin layer of 5 nm Ag was sputtered on silicon. The silicon wafer was positioned with the Ag-coated surface to face the crucible quartz containing Ga. The distance between the substrate and the gallium pool was 10 mm. Then, the substrate was loaded into a quartz crucible which was placed into an OTF-1200X-50-SL horizontal alumina tube furnace made by MTI Corporation (Richmond, CA, USA). The oxidation was performed at 950 ◦C for 1 h in a 20 sccm nitrogen atmosphere.

Figure 1 illustrates the setup of the UV photodetector fabrication process. As the system cools down to room temperature, the samples were removed from the furnace, cleaved, and characterized by scanning electron microcopy (SEM), X-ray photoelectron spectroscopy (XPS), and high-resolution transmission electron microscopy (HRTEM), equipped with energy dispersive X-ray spectroscopy. The electrical contacts were patterned on top of the nanowires using shadow mask and then 1 nm Cr and 150 nm Au were sputtered using a Lesker sputtering system. Electrical characterization of the system was also carried out to assess the performance of the UV photodetector. For electrical measurements, a custom probe station attached to a Keithly 2400 series SMU instrument was used. For photocurrent measurements, UV illumination was from a Dymax Bluewave 75 UV lamp (280–320 nm) (Dymax Corporation, Torrington, CT, USA). A light intensity of 1.5 W/cm2 was used.

**Figure 1.** Schematic of the growth process of Ga2O3 NWs on Si substrate coated with 5 nm thin film of Ag and positioned downward to face liquid Ga pool in a quartz crucible. The distance between Ga pool and silicon substrate is about a ~10 mm gap.

#### **3. Results and Discussion**

#### *3.1. Surface Morphology*

Ga2O3 nanowires were grown on P+-Si at 950 ◦C. As shown in Figure 2, the silver catalyst plays a major role in the growth mechanism. Using 5 nm Ag as a catalyst, a homogeneous coating and denser nanowires were achieved due to the low contact angle. A low contact angle reflects the extension of wetting, i.e., the liquid advances on the surface and homogeneously wets the surface. To control the wetting contact angle, deposition or incorporation of elements and molecules onto the surface is a standard procedure. We believe that Ag has the role to improve wettability, which will enhance the homogeneous appearance of Ga2O3 nuclei that could lead to dense nanowires. The contact angle of Ga on a silver film is 30◦ [30], and on a silicon substrate, it is 73.9◦ [31], leading to better wetting of Ga on Ag surface and uniform growth of Ga2O3 nanowires (Figure 3).

**Figure 2.** SEM images of Ga2O3 nanowires growth on Si at 950 ◦C (**a**) Top view and (**b**) Side view of Ga2O3 nanowires growth on Si. Denser and longer growth of nanowires were attained.

**Figure 3.** Contact angle of liquid Ga droplet on different surfaces. (**a**) Silicon. (**b**) 5 nm silver thin film. Areas coated with 5 nm Ag show uniform and high-dense growth of Ga2O3 nanowires.

Various research strategies were conducted in the past, mainly to enhance the nanowires' growth on the target substrate [10,32–34]. In contrast, these techniques to grow Ga2O3 nanowires have shown lateral growth, overlapping nanowires, less dense and weak adhesion to the substrate. None of the previous techniques were able to produce a conformal growth process of Ga2O3 nanowires on the substrate surface.

The results obtained with the use of 5 nm Ag catalyst showed a remarkable improvement in the lengths and the density of the nanowires, most of them perpendicular to the surface. Even though the lengths of these nanowires were increased, their diameters were decreased. The diameters of the nanowires were in the range of 70–90 nm at the tip and 120–160 nm at the bottom. The average length of these nanowires was in the range of about 30–70 μm.

#### *3.2. X-ray Photoelectron Spectroscopy (XPS)*

To analyze the elemental composition of Ga2O3 nanowires, XPS was performed on a PHI 5800 model.

Figure 4 shows XPS spectra of Ga2O3 nanowires on Si. The XPS spectrum shows the chemical composition of the particles at the surface of β-Ga2O3 nanowires on Si in the presence of Ag. The binding energies of Ga2p3, O1s, and Ag3d (with two peaks) and Si2p are 1119.1 eV, 532 eV, 369.07 eV and 379.66 eV and 105.18 eV, respectively. The peaks of Ga and O for Ga2O3 and Ag are in agreement with the handbook of XPS spectra [35,36]. XPS analysis of the β-Ga2O3 nanowires on Si and the presence of Ag catalyst showed a positive shift due to the effect of the electronegativity difference [37]. In addition, this shift could be attained in Ag3d, as the size of Ag nanoparticles highly decreased [38].

**Figure 4.** XPS of the β-Ga2O3 nanowires was obtained at 950 ◦C and in the presence of an Ag catalyst. Different peaks were detected by XPS. (**a**) Ga. (**b**) O. (**c**) Ag. (**d**) Si. The peaks of Ag and Ga have positive slight shifts due to the difference in electronegativity and work function.

#### *3.3. High-Resolution Transmission Electron Microscopy (HRTEM)*/*Energy-Dispersive Spectroscopy (EDS)*

An energy-dispersive spectroscopy (EDS) profile analysis was performed on β-Ga2O3 nanowires grown on Si (Figure 5). Interestingly, none of the Ag nanoparticles were clearly observed on the surface of the nanowires. However, a very small amount in atomic percentage of Ag was detected by HRTEM equipped with EDS. Because no Ag was observed on the nanowire surface, a very small amount of Ag might be embedded into the Ga2O3 nanowires. These remaining Ag nanoparticles could be trapped inside the nanowires after all Ag was consumed and evaporated.

**Figure 5.** HRTEM image and the corresponding EDS mapping of Ga, O, Si and Ag of Ga2O3 NWs on P-doped (100) silicon substrate coated with 5 nm Ag.

Because silicon atoms can interact with silver at a high temperature (i.e., the oxidation temperature of 950 ◦C) the background impurity of silicon was measured in Ga2O3 nanowires. At high temperature and a few atomic percentages of Si, the Si-Ag phase diagram [39] shows that Si can interact with Ag. Silicon is one of the major impurities that strongly correlates to n-type conductivity [40]. If silicon were to be incorporated into Ga2O3 nanowires during oxidation, it could increase the n-type conductivity of nanowires. In addition, since Si has a strong effect on the dissolution of the large Ag NPs [41], there will be more Ag atoms available for diffusion on the Si surface, which could result in a denser growth of nanowires.

#### *3.4. Growth Mechanism of* β*-Ga2O3 Nanowires*

The contribution of a silver catalyst to the growth enhancement of β-Ga2O3 nanowires on Si showed a growth reaction rate strongly influenced by the oxidation temperature and follows the Arrhenius law [42]. Oxygen diffusivity and solubility are important parameters that distinguish Ag as an effective catalyst for Ga2O3 nanowire growth.

Diffusion is a result of the kinetic properties of atoms. In this case, diffusion appears to be due to the high capability of Ag to absorb oxygen, and it is greatly influenced by the variation of temperature. Different studies were focused on the oxygen diffusivity (D) in gallium [43] and silver [44]. Table 1 summarizes the major diffusivity coefficient of oxygen into solid Ag, liquid silver, and liquid gallium. The diffusion coefficient of oxygen in silver has a high tendency to absorb oxygen, and hence, boost nanowire growth.


**Table 1.** Summary of reported diffusivity coefficient and activation energy of oxygen in silver and gallium.

The solubility of oxygen is another factor that has essential perspective to speed up the growth of Ga2O3 nanowires. The activation energy of oxygen solubility in silver was 0.01192 eV/K at a temperature range of 763–937 ◦C [45]; however, in gallium, it was 2.38 <sup>×</sup> 10−<sup>4</sup> eV/K at a temperature range of 750–1000 ◦C [48]. Oxygen solubility in silver exhibited a higher solubility than Ga. Further studies are needed to measure the Ag-Ga-O thermodynamics at higher temperatures.

Taking these results into consideration, the growth mechanism of nanowires can be explained as follows. First, at higher temperatures, the liquid gallium can form gallium oxide in the presence of oxygen. Then, the oxide is further reduced by liquid metallic gallium and forms a gas phase of gallium suboxide (Ga2O), as shown in Equation (1) [29,49] as follows:

$$\rm{Ca\_2O\_{3(s)} + 4Ga\_{(l)} \to 3Ga\_2O\_{(g)} \uparrow} \tag{1}$$

The Ga2O gas phase is transported to the cooler regions and decomposes to liquid gallium and Ga2O3 [50,51], leading to a vapor-liquid-solid (VLS) growth mechanism. At high temperatures (T > 950 ◦C) denser Ga2O3 grows as nanowires. It has been shown that the presence of Ga atoms can easily etch the surface of silica substrate around 950 ◦C, as shown in Equation (2) [52].

$$2\text{SiO}\_2 + 4\text{Ga} \to 2\text{Ga}\_2\text{O} \uparrow + \text{Si} \tag{2}$$

In addition, the phase diagram of Si-Ag shows that a liquid phase exists in this system at high temperatures (T > 800 ◦C) at a small percentage of Si [39]. The contribution of small concentrations of silicon can detach and stimulate the melting point of Ag surface atoms [41]. Despite the fact that carrier doping in β-Ga2O3 is a difficult task, some impurity doping using Sn or Si has been shown to achieve electrical conduction [40,53–55]. In this growth mechanism, silicon has been detected by EDS (Figure 5), unintentionally improving the background conductivity of the nanowires. The presence of oxygen atoms segregated on the surface of Ag catalyst will react with Ga. This increases the flux of O atoms and Ga segregation at Ag-Si interface, leading to the formation of an equilibrium mixture of Ag-Si-Ga-O that becomes a solid phase source for Ga2O3 nucleation (Figure 6).

**Figure 6.** The growth mechanism of Ga2O3 NW on a silicon substrate coated with 5 nm Ag as a catalyst. The equilibrium liquid mixture of Ag-Ga-O at higher temperature (>900 ◦C) leads to the enhancement of the growth mechanism and increases the density of Ga2O3 NWs.

#### *3.5. Electrical Characterization*

#### 3.5.1. I-V Characterization

The β-Ga2O3/P+-Si PN heterojunction (Figure 7) was fabricated to determine the electronic properties of β-Ga2O3 nanowires. The choice of testing the P+Si substrate is due to availability of low-cost materials for electronics and to observe how silicon from the substrate can influence the conductivity of Ga2O3. Impact of silicon doping in Ga2O3 during the growth processes were reported in references [40,53–55] for the cases of thin films and bulk materials and we wanted to investigate if migration of silicon atoms from the substrate can have a similar effect. In addition, the formation of n-Ga2O3 nanowires on the surface of highly doped silicon substrates has not been reported so far. The results lead to the development of a simple growth technique for large-scale production of a highly sensitive and stable structure. In previous works, the growth of Ga2O3 was obtained due to the presence of an Au catalyst (instead of Ag) on the surface of the Si2/Si template [28].

**Figure 7.** Schematic diagram of Au/β-Ga2O3/Silicon photoconductor. The distance between the gold probes is 0.8 mm.

The current-voltage (I-V) characteristics were measured in dark conditions and under UV illumination at different voltages 10 and 50 V. Photocarriers, which were excited by UV illumination, were from a Dymax Bluewave 75 UV lamp (280–320 nm) (Dymax Corporation, Torrington, CT, USA) (Figure 8). The photoconductivity mechanism of the β-Ga2O3 NWs is credited to a surface oxygen adsorption and desorption process [56], which is highly influenced by the presence of silver as a catalyst, leading to improve oxygen detection and hence the electrical properties of the β-Ga2O3 nanowires.

**Figure 8.** Semi-logarithmic plots of current density of dark and photocurrent characteristics of Ga2O3 NWs grown on silicon substrate at 950 ◦C with an Ag catalyst at 10 V, (**a**) 10 V, (**b**) 50 V.

The ratio of photo-to-dark current at 10 V was 3066.11 which is higher than other reported studies [57,58]. The reduction in performance could be attributed to the presence of Ag NPs, which were detected by XPS, although they are difficult to see in the scanning and transmission microscopy (SEM) images. The hot carriers of Ag NPs could increase the self-heating effects [59]. This issue is one of the major challenges that is still under investigation to improve the thermal conductivity of Ga2O3. It is well known that Ga2O3 generates self-heating effects that cause degradation of the carriers mobility [60], leading to reduced performance of Ga2O3 at high voltage.

Even the addition of Ag catalyst could cause a drawback, as it can enhance the sensitivity of the photodetector. The effect of the catalytic Ag nanoparticles can be explained as follows. First, Ag nanoparticles have a significant contribution in improving the conductivity of Ga2O3 nanowires, leading to better sensing performance. Secondly, Ag nanoparticles have the ability to greatly enhance the adsorption and desorption of O2 on their surface due to the highly conductive behavior of Ag metal [61]. Consequently, the number of electrons drawn to O2 increases greatly. Third, Ag nanoparticles play the role of electron mediators that allow electrons to migrate from the surface of Ga2O3 nanowires to the O2 through the defect states of Ga2O3. As a result, the bulk defects of Ga2O3 may act as a secondary factor in the sensing mechanism in addition to the surface defects [4]. Consequently, Ag NPs significantly reduce the density of electrons of Ga2O3 and improve electrical conductivity, leading to better selectivity and sensitivity.

#### 3.5.2. Transient Time

The transient response of the photodetector was measured by turning on and off a UV light source with wavelength range from 280 to 450 nm (Figure 9). Under UV illumination, the oxygen adsorption and desorption processes are attained to improve the photoconductivity response by increasing the carrier mobility. In contrast, when UV illumination is switched off, the excess electrons and holes recombine rapidly. Ga2O3 on silicon with Ag catalyst showed a rapid transient response due to the enhanced carrier transport. The rise was 0.8 s and fall time was 1.5 s. Due to the enhanced carrier transport process, fast rise and decay of the photocurrent were obtained.

**Figure 9.** Transient response of the UV photodetector fabricated with Ag catalyst based on Au/β-Ga2O3/Silicon photojunction at 10 V.

#### 3.5.3. Detection Mechanism

In dark current measurements, Ag NPs cause a localized Schottky junction and deplete the carriers at the interface of β-Ga2O3 nanowires. Therefore, there is a large depletion width at the interface between Ag NPs and β-Ga2O3 nanowires, leading to a decrease in the dark current of the UV photodetector.

UV detection mechanism is determined based on the contribution of two different parts, namely, Ag nanoparticles catalyst and P+-silicon. Under UV illumination, when the photon energy is larger than the bandgap of Ga2O3, carriers (electron-hole pairs) are generated [*hv* <sup>→</sup> <sup>e</sup><sup>−</sup> <sup>+</sup> *<sup>h</sup>*+]. These enhanced photo-generated carriers by the large electric field increase the carrier density in β-Ga2O3 nanowires and improve the photocurrent response. The energy band diagrams of the AgNPs/ β-Ga2O3 /p-Si p-n junction is shown in Figure 10a. The band offsets values are estimated using the electron affinity 4.05 eV [62], 4.00 eV [63], and band gap 1.12 eV, 4.9 eV for p-Si, and β-Ga2O3, respectively. The work function (ϕGa2O3) and electron affinity (χGa2O3) of β-Ga2O3 are 4.11 eV and 4.00 eV [63], respectively. This is lower than the work function of Ag (4.26 eV), leading to the formation of a Schottky barrier which prevents the electrons transport from Ag NPs side to Ga2O3. In addition, Ag NPs on the surface of Ga2O3 is highly influenced with UV light below 320 nm due to the interband transitions, exciting the transition of highly energetic hot electrons from the 4d and 5-sp bands [64–66]. These hot electrons surmount the small height of Schottky barrier and lead to local band bending downward on the Ga2O3 side to enable the electron transfer to the conduction band of Ga2O3 nanowires.

Regarding the silicon contribution, when the applied voltage is positive on Ga2O3, the movement of holes can easily be achieved; hence, photocurrent response is increased. However, if the voltage is negative, the holes are constrained and cannot jump the hill to the side of p-Si. Consequently, the presence of more electrons can increase oxygen molecules absorption and ionization [O2+ e<sup>−</sup> → O2 − [ad]] [67,68]. However, the holes drift to the surface, accumulate, recombine with adsorbed ionized oxygen and form free oxygen molecules from the surface [O2 <sup>−</sup> [ad] + h<sup>+</sup> <sup>→</sup> O2]. The remaining electrons become the majority carriers that contribute to an increase in the photocurrent by generation and recombination until reaching an equilibrium phase.

Nanowires offer a great opportunity to form a higher density of exposed surface states due to the dangling bonds at the surface of nanowires. These trap states of oxygen generated at the surface of Ga2O3 nanowires have a large impact on device performance [2]. The detector can be easily and fully integrated on a chip with proper metal contacts similar to the graphene-based detectors [69]. Due to the large surface to volume ratio of nanowires and the existence of Ag NPs, the surface of NWs with trapped oxygen becomes highly sensitive.

**Figure 10.** Energy band diagram of Ag NPs and Ga2O3 NWs pn P+-Si. (**a**) at the interface before contact. (**b**) Under UV illumination, the interband transition in Ag NPs enhances the photosensitivity of the UV detection, and more photo-generated holes of Ga2O3 NWs migrate to the surface by band bending.

#### **4. Conclusions**

Highly oriented, dense, and long β-Ga2O3 nanowires were grown on P+-Si (100) substrate in the presence of a 5 nm thin film of Ag catalyst and oxidation treatment at high temperature (1000 ◦C). Silver was shown to have a great impact to expedite the growth of Ga2O3 nanowires and retain their physical and chemical properties. The morphological, compositional, and electrical properties were explored. The growth mechanism of nanowires on the silicon substrate was discussed. During the growth process, Ga2O3 nanowires are highly influenced by silicon as unintentional impurities that increase the n-type doping. The photoresponse under UV irradiation was excellent. The ratio of photo-to-dark current (Iphoto/Idark) was measured to be around 3.07 <sup>×</sup> 103 at 10 V. The high photosensitivity could be attributed to the higher electron density in Ga2O3 nanowires with Ag NPs. The carrier transport process was shown to have a fast response. The energy band gap and carrier dynamics at the interfaces were discussed. This synthesis can be optimized for sensing, electronics, and photonic applications.

**Author Contributions:** Conceptualization, B.A.; Methodology, B.A.; Resources, M.S.I.; Data Curation, B.A.; Writing—Original Draft Preparation, B.A.; Writing—Review & Editing, B.A., R.V. and M.S.I.; Supervision, M.S.I.; Project Administration, M.S.I.; Funding Acquisition, M.S.I.

**Funding:** This research received no external funding

**Acknowledgments:** The author gratefully acknowledged the financial support by Kuwait Institute for Scientific Research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
