*Article* **Two Novel Models for Traffic Sign Detection Based on YOLOv5s**

**Wei Bai 1, Jingyi Zhao 1, Chenxu Dai 1, Haiyang Zhang 2, Li Zhao 3, Zhanlin Ji 1,4,\* and Ivan Ganchev 4,5,6,\***

<sup>1</sup> College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China


**\*** Correspondence: zhanlin.ji@ncst.edu.cn (Z.J.); ivan.ganchev@ul.ie (I.G.)

**Abstract:** Object detection and image recognition are some of the most significant and challenging branches in the field of computer vision. The prosperous development of unmanned driving technology has made the detection and recognition of traffic signs crucial. Affected by diverse factors such as light, the presence of small objects, and complicated backgrounds, the results of traditional traffic sign detection technology are not satisfactory. To solve this problem, this paper proposes two novel traffic sign detection models, called YOLOv5-DH and YOLOv5-TDHSA, based on the YOLOv5s model with the following improvements (YOLOv5-DH uses only the second improvement): (1) replacing the last layer of the 'Conv + Batch Normalization + SiLU' (CBS) structure in the YOLOv5s backbone with a **t**ransformer self-attention module (T in the YOLOv5-TDHSA's name), and also adding a similar module to the last layer of its neck, so that the image information can be used more comprehensively, (2) replacing the YOLOv5s coupled head with a **d**ecoupled **h**ead (DH in both models' names) so as to increase the detection accuracy and speed up the convergence, and (3) adding a **s**mall-object detection layer (S in the YOLOv5-TDHSA's name) and an **a**daptive anchor (A in the YOLOv5-TDHSA's name) to the YOLOv5s neck to improve the detection of small objects. Based on experiments conducted on two public datasets, it is demonstrated that both proposed models perform better than the original YOLOv5s model and three other state-of-the-art models (Faster R-CNN, YOLOv4-Tiny, and YOLOv5n) in terms of the mean accuracy (*mAP*) and *F1 score*, achieving *mAP* values of 77.9% and 83.4% and *F1 score* values of 0.767 and 0.811 on the TT100K dataset, and *mAP* values of 68.1% and 69.8% and *F1 score* values of 0.71 and 0.72 on the CCTSDB2021 dataset, respectively, for YOLOv5-DH and YOLOv5-TDHSA. This was achieved, however, at the expense of both proposed models having a bigger size, greater number of parameters, and slower processing speed than YOLOv5s, YOLOv4-Tiny and YOLOv5n, surpassing only Faster R-CNN in this regard. The results also confirmed that the incorporation of the T and SA improvements into YOLOv5s leads to further enhancement, represented by the YOLOv5-TDHSA model, which is superior to the other proposed model, YOLOv5-DH, which avails of only one YOLOv5s improvement (i.e., DH).

**Keywords:** computer vision; object detection; traffic sign detection; you only look once (YOLO); attention mechanism; feature fusion

**MSC:** 68W01; 68T01

### **1. Introduction**

The detection and recognition of traffic signs play essential roles in the fields of assisted driving and automatic driving. Traffic signs are not only the main sources for drivers to obtain the necessary road information, but they also help adjust and maintain traffic flows [1]. However, in real-life scenarios, the influence of complex weather conditions and

**Citation:** Bai, W.; Zhao, J.; Dai, C.; Zhang, H.; Zhao, L.; Ji, Z.; Ganchev, I. Two Novel Models for Traffic Sign Detection Based on YOLOv5s. *Axioms* **2023**, *12*, 160. https://doi.org/10.3390/ axioms12020160

Academic Editor: Oscar Humberto Montiel Ross

Received: 28 December 2022 Revised: 29 January 2023 Accepted: 31 January 2023 Published: 3 February 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the existence of various categories of objects presented on the road—with a large proportion of these being small objects—have brought great challenges to the research on automatic detection and recognition of traffic signs.

There were two traffic sign detection and recognition techniques in the early days one based on color features and the other based on shape features. Later, hybrid techniques emerged, e.g., [2], which considered both the color and geometric information of traffic signs during the feature extraction. The noise reduction and morphological processing made it easier to process images based on shapes, using the geometric information of a triangle, a circle, or a square commonly found in traffic signs, along with the RGB color information, in order to identify the images containing traffic signs. Although such a technique can detect the presence of traffic signs in images, it cannot distinguish between different classes of traffic signs.

With the emergence of deep learning, some models based on it have been applied for image classification and object detection, showing excellent performance, such as the two-stage detectors, represented by, e.g., the region-based convolutional neural networks (R-CNNs), and the single-stage detectors, represented by, e.g., You Only Look Once (YOLO) versions. R-CNN [3] was the first model applying convolutional neural networks (CNNs) for object detection. R-CNN generates candidate boxes first before detection to reduce the information redundancy, thus improving the detection speed. However, it zooms and crops images, resulting in a loss of original information. SPP-net [4] defined a spatial pyramid pooling (SPP) layer in front of the fully connected layer, which allowed one to input images of an arbitrary size and scale, thus not only breaking the constraint of fixed sizes of input images but also reducing the computational redundancy. Fast R-CNN [5] changed the original string structure of R-CNN into a parallel structure and absorbed the advantages of SPP-net, which allowed it not only to accelerate the object detection but also to improve the detection accuracy. However, if a large number of invalid candidate regions is generated, it would lead to a waste of computing power, whereas a small number of candidate regions would result in missed detection. Based on the above problems, Ren et al. proposed the concept of region proposal networks (RPNs) [6], which generates candidate regions through neural networks to solve the mismatch between the generated candidate regions and the real objects. However, these two-stage models were not superior in training and detection speed, so single-stage models, represented by the YOLO family, came into existence [7]. By creating the feature map of the input image, the learning category probability, and the boundary box coordinates of the entire image, YOLO sets the object detection as a simple regression problem. The algorithm only runs once, which of course reduces the accuracy, but allows achieving a higher processing speed than the two-stage object detectors, thus making it suitable for real-time detection of objects. The first version of YOLO, YOLOv1 [8], divides each given image into a grid system. Each grid detects objects by predicting the number of bounding boxes of the objects in the grid. However, if small objects in the image appear in clusters, the detection performance is not as sufficient. The second version, YOLOv2 [9], preprocesses the batch normalization based on the feature extraction network of DarkNet19 to improve the convergence of the network. Later, YOLOv3 [10] added logic regression to predict the score of each bounding box. It also introduced the method of Faster R-CNN giving priority to only one bounding box. As a result, YOLOv3 can detect some small objects. However, YOLOv3 cannot fit well with the ground truth. YOLOv4 [11] uses weighted real connections (WRCs), crossmini-batch normalization (CmBN), self-adaptive training (SAT), and other methods, which allows it to not only keep suitable training and detection speed but also achieve better detection accuracy. YOLOv5 passes each batch of training data through a data loader, which performs three types of data enhancement—zooming, color space adjustment, and mosaic enhancement. From the five models produced to date based on YOLOv5, this paper proposes improvements to the YOLOv5s model, which uses two cross-stage partial connections (CSP) structures (one for the backbone network and the other for the neck) and

a weighted non-maximum suppression (NMS) [12] to improve the detection accuracy of the occluded objects in images.

The two-stage object detectors, such as R-CNN, SPP-net, and Fast R-CNN mentioned above, are not suitable for real-time detection of objects due to their relatively low detection speed. As single-stage object detectors, the YOLO versions are obviously better than the two-stage detectors in terms of the detection speed achieved. However, their detection performance is not as efficient. To tackle this problem, this paper proposes two novel YOLOv5s-based traffic sign detection models, called YOLOv5-DH and YOLOv5- TDHSA, with the following improvements to YOLOv5s (YOLOv5-DH uses only the second improvement below), which constitute the main contributions of the paper:


Based on results obtained from experiments conducted on two public datasets (TT100K and CCTSDB2021), the proposed YOLOv5-DH and YOLOv5-TDHSA models outperform the original YOLOv5s model along with three other state-of-the-art models (Faster R-CNN, YOLOv4-Tiny, YOLOv5n), as shown further in the paper.

The rest of the paper is organized as follows. Section 2 introduces the attention mechanisms, feature fusion networks, and detection heads commonly used in object detection models. Section 3 presents the main representatives of the two-stage and singlestage object detection models. Section 4 explains the YOLOv5s improvements used by the proposed models, including the transformer self-attention mechanism, the decoupled head, the small-object detection layer, and the group of adaptive anchor boxes. Section 5 describes the conducted experiments, and presents and discusses the obtained results. Finally, Section 6 concludes the paper.

### **2. Background**

### *2.1. Attention Mechanisms*

Attention is a data processing mechanism used in machine learning and extensively applied in different types of tasks such as natural language processing (NLP), image processing, and object detection [13]. The squeeze-and-exchange (SE) attention mechanism aims to assign different weights to each feature map and focuses on more useful features [14]. SE pools the input feature map globally, then uses a full connection layer and an activation function to adjust the feature map, thus obtaining the weight of the feature, which is multiplied with the input feature at the end. The disadvantage of SE is that it only considers the channel information and ignores the spatial location information. The convolutional block attention module (CBAM) solves this problem by first generating different channel weights, and then compressing all feature maps into one feature map to calculate the weight of the spatial features [15]. Currently, the self-attention [16] is one of the most widely used attention mechanisms due to its strong feature extraction ability and the support of parallel computing. The transformer self-attention mechanism, used by the YOLOv5-TDHSA model proposed in this paper, can establish a global dependency relationship and expand the receptive field of images, thus obtaining more features of traffic signs.

### *2.2. Multi-Scale Feature Fusion*

The feature pyramid network (FPN) [17] utilized in Faster R-CNN and Mask R-CNN [18] is shown in Figure 1a. It uses the features of the five stages of the ResNet

convolution groups C2–C6, among which C6 is obtained from a MaxPooling operation by directly applying 1 × 1/2 on C5. The feature maps P2–P6 are obtained after the FPN fusion, as follows: P6 is equal to C6, P5 is obtained through a 1 × 1 convolution followed by a 3 × 3 convolution, and P2–P4 are obtained through a 1 × 1 convolution followed by a fusion with the feature of the former 2 × Upsample and a 3 × 3 convolution.

**Figure 1.** Different feature fusion structures.

The FPN in YOLOv3 is shown in Figure 1b. The features of C3, C4, and C5 are used. The features from C5 to P5 first pass through five layers of convolution, and then through one layer of 3 × 3 convolution. The features of P4 are obtained by connecting M5 (through 1 × 1 Conv + 2 × Upsample) and C4 through five layers of convolution, and one layer of 3 × 3 convolution. The features of P3 are obtained by connecting M4 (through 1 × 1 Conv + 2 × Upsample) and C3 through five layers of convolution, and one layer of 3 × 3 convolution.

The feature extraction network of YOLOv5 uses a 'FPN + Path Aggregation Network (PAN)' [19] structure, as shown in Figure 1c. PAN adds a bottom-up pyramid behind the FPN as a supplement. FPN conveys the strong semantic features from top to bottom, while PAN conveys strong positioning features from bottom to top. The specific operation of PAN includes first copying the last layer M2 of FPN as the lowest layer P2 of PAN, and then fusing M3 with the downsampled P2 to obtain P3. P4 is obtained through a feature fusion of M4 and downsampled P3. However, the feature extraction network does not work well for the detection of small objects. The feature fusion utilized by the YOLOv5-TDHSA model, proposed in this paper, is based on a small-object detection layer, making the detection of small objects more accurate. This is described in more detail in Section 4.3.

### *2.3. Detector Head*

Since the head of YOLOv1 only generates two detection boxes for each grid, it is not suitable for both dense and small-object detection tasks. Its generalization ability is weak when the size ratio of the same-type objects is uncommon. The head of YOLOv2 improves the network structure and also adds an anchor box. YOLOv2 removes the last fully connected layer in YOLOv1, and uses convolution and anchor boxes to predict the detection box. However, since the use of convolution to downsample the feature map results in a loss of the fine-grained features, the model's detection of small objects is poor. Consequently, the passthrough layer structure has been introduced in the head of YOLOv2 to divide the feature map into four parts to preserve the fine-grained features. The head of

YOLOv3 introduces a multi-scale detection logic and utilizes a multi-label classification idea on the basis of YOLOv2. The loss function has been optimized as well. YOLOv4 adopts a multi-anchor strategy, different from YOLOv3. Any anchor box greater than the intersection over union (IoU) [20] threshold is regarded as a positive sample, thus ensuring that the positive samples ignored by YOLOv3 will be added to YOLOv4 to improve the detection accuracy of the model. The output of YOLOv5 has three prediction branches. The grid of each branch has three corresponding anchors. Instead of the IoU maximum matching method, YOLOv5 calculates the width–height ratio of the bounding box to the anchor of the current layer. If the ratio is greater than the parameter value set, this indicates that the matching degree is poor, which is considered as a background. The coupled detection head of YOLOv5s performs both the recognition and positioning tasks on a feature map simultaneously. However, these tasks have different focuses, making the final recognition accuracy low. The 'decoupled head' idea allows one to separate these two tasks and achieve better performance. Therefore, the models proposed in this paper use a decoupled head instead of the original YOLOv5s coupled head, which is described in more detail in Section 4.2.

### **3. Related Work**

Over the past 20 years, the object detection models were divided into two categories: (1) traditional models (before 2012), such as V-J detection [21,22], HOG detection [23], DPM [24], etc., and (2) deep learning (DL) models, beginning with AlexNet [25]. The following subsections briefly present the DL object detection models, divided into twostage and one-stage models, whose development route is illustrated in Figure 2.

**Figure 2.** The development route of the DL object detection models.

### *3.1. Two-Stage Object Detection Models*

Krizhevsky et al. proposed AlexNet as a CNN framework when participating (and winning the first place) in the ImageNet LSVRC 2012 competition. This model brought the climax to the development of deep learning.

Later, R-CNN emerged for object detection. However, R-CNN unifies the size of all candidate boxes, which causes a loss of the image content and affects the detection accuracy. Based on R-CNN, SPP-net, Fast R-CNN, Faster R-CNN, Mask R-CNN, and other models have been developed subsequently.

SPP-net was proposed in 2014. It inserts a spatial pyramid pooling layer between the CNN layer and fully connected layer, which allows it to solve the R-CNN loss of the image content caused by adjusting all candidate boxes to the same size. In order to find the location of each area in the feature map, the location information is added after the convolution layer. However, the time-consuming selective search (SS) [26] method is still used to generate the candidate areas.

On the basis of R-CNN, Fast R-CNN adds an RoI (region of interest) pooling layer and reduces the number of model parameters, thus greatly increasing the processing speed. The method of SPP-net is used for reference, CNN is used to process the input images, and the serial structure of R-CNN is changed to a parallel structure, so that classification and regression can be carried out simultaneously, and the detection is accelerated.

In order to solve the problem that Fast R-CNN uses the SS method to generate candidate areas, Faster R-CNN uses an RPN to directly generate candidate areas, which enables the neural network to complete the detection task in an end-to-end fashion [27].

Based on Faster R-CNN, Mask R-CNN uses a fully constructive network (FCN). The model operates in two steps: (1) generating the candidate regions through an RPN, and (2) extracting the RoI features from candidate regions using RoIAlign (region of interest alignment) to obtain the probability of object categories and the location information of prediction boxes.

The two-stage object detection models are not suitable for real-time object detection because they require multiple detection and classification processes, which lowers the detection speed.

### *3.2. One-Stage Object Detection Models* 3.2.1. YOLO

YOLO's training and detection are carried out in a separate network. The object detection is regarded as a process of solving a regression problem. As long as the input image passes through inference, the location information of the object and the probability of its category can be obtained [28]. Therefore, YOLO is particularly outstanding in terms of detection speed. There are different versions of YOLO proposed to date. Based on its fifth version, YOLOv5, five models have been produced, namely YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The YOLOv5-DH and YOLOv5-TDHSA models, described in this paper, propose improvements to the YOLOv5s model, whose network structure is shown in Figure 3. A focus network structure is used at the beginning of the trunk to derive the value of every other pixel in an image. This is followed by four independent feature layers, which are stacked. At that point, the width and height information is concentrated on the channel, and the input channel is expanded four times.

**Figure 3.** The structure of the YOLOv5s model.

YOLOv5s uses Mixup [29] and Mosaic for data enhancement, where Mosaic splices four images to enrich the background of the detected object. The data of the four images are processed at one time during a batch normalization computation.

In the backbone part, the model extracts features from the input image. The extracted features, through three feature layers, are used for the next network construction.

The main task of the neck part is to strengthen feature extraction and feature fusion, so as to combine feature information of different scales. In the Path Aggregation Network (PANet) structure, upsampling and downsampling operations are used to achieve feature extraction. When the input size is 640 × 640 pixels, the maximum scale of output feature is 80 × 80 pixels, so the minimum size of the detection frame is 8 × 8 pixels. However, when there are many smaller objects in the dataset, this will affect the detection accuracy. The proposed improvements of YOLOv5s in this regard are described in Section 4.3.

In the head part, the three feature layers, which have been strengthened, are regarded as a collection of feature points. This part is used to judge whether the feature points have objects corresponding to them. The YOLOv5s detection head is a coupled head which performs complete identification and location tasks on a feature map. However, recognition and location are two different tasks. Therefore, this paper proposes a branch structure to carry out recognition and location tasks separately. This improvement to the YOLOv5s structure is described in more detail in Section 4.2.

There have been some improvements of YOLOv5 recently proposed for traffic sign and traffic light recognition. For instance, Chen et al. [30] introduced a Global-CBAM attention mechanism for embedding into YOLOv5 s backbone in order to enhance its feature extraction ability, and achieved sufficient balance between the channel attention and spatial attention for improving the target recognition. Due to this, the overall accuracy of the model was improved, especially for small-sized target recognition, and the mean accuracy (*mAP*) achieved was 6.68% higher than that before the improvement.

In order to solve the problem of using YOLOv5s for the recognition of small-sized traffic signs, Liu et al. [31] proposed to replace the original DarkNet-53 backbone of YOLOv5s with MobileNetV2 network for feature extraction, selecting Adam as the optimizer. The result of this was the reduction in the number of parameters by 65.6% and the computation amount by 59.1% on the basis of improving the *mAP* by 0.129.

Chen et al. [32] added additional multi-scale features to YOLOv5s to make it faster and more accurate in capturing traffic lights when these occupy a small area in images. In addition, a loop was established to update the parameters using a gradient of loss values. This led to *mAP* improvement (from 0.965 to 0.988) and detection time reduction (from 3.2 ms inference/2.5 ms to 2.4 ms inference/1.0 ms NMS per image).

### 3.2.2. SSD

The Single Shot MultiBox Detector (SSD) [33] is a one-stage object detection model proposed after YOLOv1. In order to improve YOLO's imperfection for small-object detection, SSD uses feature maps of different sizes and prior boxes of different sizes to further improve the regression rate and accuracy of the predicted box. The proportion of the prior frame size to the image is calculated as follows:

$$S\_k = S\_{\min} + \frac{S\_{\max} - S\_{\min}}{m - 1}(k - 1),\tag{1}$$

where *k* ∈ [1, *m*], *m* denotes the number of characteristic graphs, and *Smax* and *Smin* denote the maximum and minimum value of the ratio, respectively.

### **4. Proposed Improvements to YOLOv5s**

This section describes the YOLOv5s improvements used by the models proposed in this paper. The decoupled head (DH) improvement is used by both proposed models, YOLOv5-DH and YOLOv5-TDHSA, whereas the other two improvements are used only by YOLOv5-TDHSA.

### *4.1. Transformer Self-Attention Mechanism*

The transformer model was proposed by the Google team in June 2017 [34]. It has not only become the preferred model in the NLP field, but also showed strong potential in the field of image processing. The transformer abandons the sequential structure of Recurrent Neural Networks (RNNs) and adopts a self-attention mechanism to enable the model to parallelize training and make full use of the global information of training data.

The core mechanism of the transformer model is the self-attention depicted in Figure 4. The regular attention mechanism first calculates the attention distribution on all input information and then obtains the weighted average of the input information according to this attention distribution. Self-attention maps the input features to three new spaces for representation, namely Query (Q), Key (K), and Value (V). The correlation between Q and K is calculated as well, after which a *SoftMax* function is used to normalize the data and widen the gap between the data to enhance the attention. The weight coefficient and V are weighted and summed to obtain the attention value. The self-attention mechanism maps the features to three spatial representations, which allows one to avoid problems encountered when features are mapped to only one space. For example, if Q1 and Q2 are directly used to calculate the correlation, there will be no difference between the correlation between Q1 and Q2 and the correlation between Q2 and Q1. In this case, the expression ability of the attention mechanism will become weak. If K is introduced to calculate the correlation between the original data, it can reflect the difference between Q1 and K2 on one hand and Q2 and K1 on the other, which can also enhance the expression ability of the attention mechanism. Since the input of the next step is the attention weight obtained, it is not appropriate to use Q or K; thus, the third space, V, is introduced. Finally, the attention value is obtained through weighted summation.

**Figure 4.** The module structure of the transformer self-attention mechanism.

However, the transformer model would significantly increase the amount of computation, resulting in higher training costs. The feature dimension is the smallest when the image features are transferred to the last layer of the network. At this moment, the influence on training the model would be the smallest if the transformer is added. Therefore, the proposed YOLOv5-TDHSA model uses the transformer only as a replacement of the CBS at the last layer of the backbone of the original YOLOv5s model, and also adds the transformer to the last layer of its neck.

### *4.2. Decoupled Head*

After performing analytical experiments indicating that the coupled detection head may harm YOLO's performance, the authors of [35] recommend replacing the original YOLO's head with a decoupled one. This idea is taken on board by the models proposed in this paper to reduce the number of parameters and network depth, thus improving the model training speed and reducing the feature losses.

During the object detection, it is necessary to output the category/class and position information of the object. The decoupled head uses two different branches to output the category and position information separately as the recognition and positioning tasks have different focuses. The recognition focuses more on the existing class to which the extracted features are closer. The positioning focuses more on the location coordinates of the ground truth box so as to correct the parameters of the bounding box. YOLO's head uses a feature map to complete the two tasks of recognition and location in a convolution. Therefore, it does not perform as well as the decoupled head D1 shown in Figure 5, which is used by the models proposed in this paper. However, the decoupling process increases the number of parameters, thus affecting the training speed of the model. Therefore, in order to reduce the number of parameters, the feature first goes through a 1 × 1 convolution layer to reduce the dimension and then through two parallel branches with two 3 × 3 convolution layers. The first branch is used to predict the category. Since there are 45 categories in the TT100K dataset used in this paper, the channel dimension becomes 45 after a convolution operation and the processing of the *Sigmoid* activation function [36]. The second branch is mainly used to determine whether the object box is a foreground or background. As a result, the channel dimension becomes 1 after the convolution operation and *Sigmoid* activation function. There is also a third branch used to predict the coordinate information (x, y, w, h) of the object box. Therefore, after the convolution operation, the channel dimension becomes 4. Finally, the three outputs are integrated into 20 × 20 × 50 feature information through *Concat* for the next operation. The decoupled heads D2, D3, and D4, shown in Figure 6, also follow the same steps to generate feature information of 40 × 40 × 50, 80 × 80 × 50, and 160 × 160 × 50, respectively. The proposed YOLOv5-DH model only uses D1, D2, and D3 to replace the 'Head' part of the original YOLOv5s model (c.f., Figure 3).

**Figure 5.** The structure of the decoupled head D1 used by the proposed models.

**Figure 6.** The structure of the proposed YOLOv5-TDHSA model.

### *4.3. Small-Object Detection Layer and Adaptive Anchor*

During the detection of traffic signs, the changing distance between the shooting equipment and the object makes the size of traffic signs in the collected images different, which has a certain impact on the detection accuracy [37]. YOLOv5s solves this problem in the form of PANet. Taking an input image size of 640 × 640 pixels as an example, the feature information of the feature map output through the original model is 80 × 80 × 255, 40 × 40 × 255, and 20 × 20 × 255, respectively. At this time, the grid sizes of the generated detection box are 8 × 8 pixels, 16 × 16 pixels, and 32 × 32 pixels, respectively. However, when there is a large number of objects with size smaller than 8 × 8 pixels in the dataset, the detection performance for these small objects is not acceptable. Furthermore, the feature pyramid pays more attention to the extraction and optimization of the underlying features. With increasing the depth of the network, some features at the top level will be lost, reducing the accuracy of the object detection.

To improve the detection of small objects, a branch structure is added to the PANet of YOLOv5s to maintain the same size of the input image. However, the neck part adds a 160 × 160 × 128 feature information output. In other words, the feature map continues to expand by performing the convolution and upsampling on the feature map after layer 17. Meanwhile, the 160 × 160 pixels feature information obtained from layer 19 is fused with the layer 2 feature in the backbone at the layer 20 to make up for the feature loss during feature transmission. The addition of a small object detection layer in the network can ease the difficulty of small object detection. At the same time, it combines the features of the top level with those of the bottom level to supplement the features lost in the bottom level, thus improving the detection accuracy.

The network structure after the addition of the small-object detection layer is shown in Figure 6. A branch is added to connect layer 2 and layer 19 (the red solid line part). In this case, the added fourth output size is 160 × 160 × 128. After the head decoupling, the feature information size is 160 × 160 × 50. The minimum size of the generated detection box is 4 × 4 pixels, which improves the detection of small objects.

The original YOLOv5s network model has only three detection layers. As a result, there are three groups of anchor boxes corresponding to the feature maps at three different resolutions. In each group of anchor boxes, there are three different anchors. A total of nine anchors can be used to detect large, medium, and small objects. However, the YOLOv5- TDHSA model, proposed in this paper, deepens the network and adds an output layer of

feature information. It uses a group of 12 anchor boxes, added to the original YOLOv5s model, to calculate the feature map at the new resolution. The ratio between an anchor and the width and height of each ground truth box is calculated, and the K-Means and genetic learning algorithms are used to obtain the best possible recall (BPR). When BPR is greater than 0.98, it indicated that the four groups of anchor boxes generated can be suitable for custom datasets.

The addition of the small-object detection layer and the group of adaptive anchor boxes allows us to significantly improve the detection accuracy of the proposed YOLOv5-TDHSA model, as demonstrated in the next section.

### **5. Experiments**

### *5.1. Datasets*

Two public datasets were used in the experiments conducted for the performance comparison of models. The first one was the Tsinghua-Tencent 100 K Chinese traffic sign detection benchmark [38], denoted as TT100K in [39]. It includes 100,000 high-definition images with large variations in illuminance and weather conditions, among which 10,000 images are annotated that contain 30,000 traffic sign instances (in total), each of which theoretically belongs to one of the 221 Chinese traffic sign categories. The images are taken from the Tencent Street View Map. Sample images are shown in Figure 7. However, there is a serious imbalance in the distribution of categories in this dataset, and even some categories do not have instances corresponding to them. Therefore, in the conducted experiments, similarly to [39], only categories with more than 100 traffic sign instances were used, resulting in 45 categories spread over 9170 images.

**Figure 7.** Sample images of the TT100K dataset.

The other dataset used in the experiments was the CCTSDB2021 Chinese traffic sign detection benchmark [40], which was built based on the CCTSDB2017 dataset [41,42] by adding 5268 annotated images of real traffic scenes and replacing images containing easily detected traffic signs with more difficult samples of a complex and changing detection environment. Three traffic sign classes are distinguished in CCTSDB2021, namely a warning, a mandatory, and a prohibitory traffic sign class, as shown in Figure 8. There are a total of 17,856 images, including 16,356 images in the training set and 1500 images in the test set. However, the weather environment attribute, which represents a great challenge for the object detection models, is only present in the images of the test set and not of the training set. Therefore, only these 1500 images, presenting greater difficulty to the detection of traffic signs contained in them, were used in the experiments.

**Figure 8.** Sample images of the CCTSDB2021 dataset, containing (**A**) warning traffic signs; (**B**) mandatory traffic signs; (**C**) prohibitory traffic signs.

In the experiments, as shown in Table 1, the 9170 TT100K images and 1500 CCTSDB021 images were separately divided (using the same ratio) into a training set (60% of the total number of images), a validation set (20%), and a test set (20%). The corresponding number of labels in each of these three sets is shown in Table 1.

**Table 1.** Splitting the datasets into training, validation, and test sets.


### *5.2. Experimental Environment*

In the training process, the initial learning rate was set to 0.01, and a cosine annealing strategy was used to reduce it. 300 epochs were performed with the batch size set to 32. The experiments were conducted on a PC with a Windows 10 operating system, Intel (R) Core (TM) i7-10,700 CPU@2.90 GHz, NVIDIA GeForce RTX3090, and 24GB video memory, by using CUDA 11.1 for training acceleration, PyTorch 1.8.1 deep learning framework for training, and an input image size of 640 × 640 pixels, as shown in Table 2.

**Table 2.** Experimental environment's parameters.


### *5.3. Evaluation Metrics*

Evaluation metrics commonly used for the performance evaluation of object detection models include *precision*, *average precision* (*AP*), *mean average precision* (*mAP*), *recall*, *F1 score*, and *processing speed* measured in frames per second (fps).

*Precision* refers to the proportion of the true positive (*TP*) samples in the prediction results, as follows:

$$precision = \frac{TP}{TP + FP} \tag{2}$$

where *TP* denotes the number of images containing detected objects with IoU > 0.5, that is, the number of images containing positive samples that are correctly detected by the model; *FP* (false positive) represents the number of images containing detected objects with IoU ≤ 0.5.

*Recall* refers to the proportion of correct predictions in all positive samples, as follows:

$$recall = \frac{TP}{TP + FN'} \tag{3}$$

where *FN* (false negative) represents the number of images wrongly detected as not containing objects of interest.

The *average precision* (*AP*) is the area enclosed by the *precision*–*recall* curve and the X axis, calculated as follows:

$$AP = \int\_0^1 p(r) dr,\tag{4}$$

where *p*(*r*) denotes the precision function of recall *r*.

*F1 score* is the harmonic average of *precision* and *recall*, with a maximum value of 1 and a minimum value of 0, calculated as follows:

$$F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}.\tag{5}$$

The mean average precision (*mAP*) is the mean *AP* value over all classes of objects, calculated as follows:

$$mAP = \frac{\sum AP}{N\_{classes}}\,\,\,\tag{6}$$

where *Nclasses* denotes the number of classes.

### *5.4. Results*

Based on the two datasets, experiments were conducted for performance comparison of the proposed YOLOv5-DH and YOLOv5-TDHSA models to four state-of-the-art models, namely R-CNN, YOLOv4-Tiny, YOLOv5n, and YOLOv5s. The size and number of parameters of models are shown in Table 3 and the duration of a single experiment conducted with each model is shown in Table 4. On the two datasets, TT100K and CCTSDB2021, five separate experiments were performed with each of the models compared. In each experiment, the same data were utilized for all models, generated by randomly splitting the used dataset into a training set, a validation set, and a test set, as per Table 1. The results obtained for each model were averaged over the five experiments in order to serve as the final evaluation of the model performance.

**Table 3.** The size and number of parameters of compared models.



**Table 4.** Single experiment duration of compared models.

Tables 5–10 show the *mAP* and *F1 score* results obtained in each experiment, conducted on the TT100K dataset, for each of the models compared. Table 11 shows the averaged *mAP* and *F1 score* results over the five experiments, along with the processing speed achieved, measured in frames per second (fps). The obtained results, shown in Table 11, demonstrate that on the TT100K dataset, both proposed models (YOLOv5-DH and YOLOv5-TDHSA) outperform all four state-of-the-art models in terms of *mAP* and *F1 score*, at the expense of having a bigger size, greater number of parameters, and slower processing speed (surpassing only Faster R-CNN). From the two proposed models, YOLOv5-TDHSA is superior to YOLOv5-DH in terms of both evaluation metrics (*mAP* and *F1 score*).

### **Table 5.** Results of Faster R-CNN on TT100K dataset.


**Table 6.** Results of YOLOv4-TINY on TT100K dataset.


**Table 7.** Results of YOLOv5n on TT100K dataset.


**Table 8.** Results of YOLOv5s on TT100K dataset.


**Table 9.** Results of YOLOv5-DH on TT100K dataset.



**Table 10.** Results of YOLOv5-TDHSA on TT100K dataset.

**Table 11.** Results of compared models on TT100K dataset.


Tables 12–17 show the *mAP* and *F1 score*results obtained in each experiment, conducted on the CCTSDB2021 dataset, for each of the models compared. Table 18 shows the averaged *mAP* and *F1 score* results over the five experiments, along with the processing speed achieved. The obtained results, shown in Table 18, demonstrate that both proposed models (YOLOv5-DH and YOLOv5-TDHSA) outperform all four state-of-the-art models in terms of *mAP* and *F1 score* on this dataset as well, at the expense of having a bigger size, greater number of parameters, and slower processing speed (surpassing only Faster R-CNN). From the two proposed models, YOLOv5-TDHSA is again superior to YOLOv5-DH in terms of both evaluation metrics (*mAP* and *F1 score*).



**Table 13.** Results of YOLOv4-TINY on CCTSDB2021 dataset.


**Table 14.** Results of YOLOv5n on CCTSDB2021 dataset.



**Table 15.** Results of YOLOv5s on CCTSDB2021 dataset.

**Table 16.** Results of YOLOv5-DH on CCTSDB2021 dataset.


**Table 17.** Results of YOLOv5-TDHSA on CCTSDB2021 dataset.


**Table 18.** Results of compared models on CCTSDB2021 dataset.


### **6. Discussion**

The incorporation of the proposed improvements into YOLOv5s resulted in overall better traffic sign detection. This was confirmed by a series of experiments conducted for evaluating and comparing the performance of the proposed models (YOLOv5-DH and YOLOv5- TDHSA) to that of YOLOv5s and three other state-of-the-art models, namely Faster R-CNN, YOLOv4-Tiny, and YOLOv5n, based on two datasets—TT100K and CCTSDB2021. The obtained results clearly demonstrate that both proposed models outperform all four models, in terms of the *mean average precision* (*mAP*) and *F1 score*.

Although both proposed models are better than the two-stage detection Faster R-CNN model, in terms of the model's size, number of parameters, and processing speed, they still have some shortcomings in this regard compared with the one-stage detection models (YOLOv4-Tiny, YOLOv5n, YOLOv5s). Therefore, in the future, some lightweight modules will be introduced into the proposed YOLOv5-TDHSA model (which is superior to the other proposed model YOLOv5-DH) in order to reduce its size and number of parameters, and increase its processing speed.

To check if the proposed models are significantly different statistically from the compared state-of-the-art models, we applied the (non-parametric) Friedman test [43,44] with the corresponding post-hoc Bonferroni–Dunn test [45,46], which are regularly used for the comparison of classifiers (more than two) over multiple datasets.

First, using the Friedman test, we measured the performances of the models, used in the experiments described in the previous section, across both datasets. Basically, the Friedman test shows whether the measured average ranks of models are significantly different from the mean rank expected, by checking the null hypothesis (stating that all models perform the same and the observed differences are merely random), based on the following formula:

$$\mathcal{T}\_{x^2} = \frac{12N}{k(k+1)} \left( \sum\_{i=1}^k r\_i^2 - \frac{k(k+1)^2}{4} \right),\tag{7}$$

where *k* denotes the number of models, *N* denotes the number of datasets, and *ri*. represents the average rank of the *i*-th model. In our case, *k* = 6 and *N* = 2.

Instead of Friedman's *Tx*<sup>2</sup> statistic, we used the better Iman and Davenport statistic [47], which is distributed according to the F-distribution with (*k* − 1) and (*k* − 1)(*N* − 1) degrees of freedom, as follows:

$$T\_F = \frac{(N-1)T\_{\chi^2}}{N(k-1) - T\_{\chi^2}} \,. \tag{8}$$

Using (8), we calculated the following values: *TF* = 34 for *F1 score* and *TF* = ∞ for *mAP*. As both these values are greater than the critical values of 3.45 and 5.05 for six models and two datasets, with confidence levels of α = 0.10 and α = 0.05, respectively, we rejected the null hypothesis and concluded that there are significant differences between the compared models.

Next, we proceeded with a post-hoc Bonferroni–Dunn test, in which the models were compared only to a control model and not between themselves [44,48]. In our case, we used the proposed YOLOv5-TDHSA model as a control model. The advantage of the Bonferroni– Dunn test is that it is easier to visualize because it uses the same Critical Difference (CD) for all comparisons, which can be calculated as follows [48]:

$$CD = q\_{\alpha} \sqrt{\frac{k(k+1)}{6N}} \,\,\,\,\,\tag{9}$$

where *q*<sup>α</sup> denotes the critical value for *<sup>α</sup> <sup>k</sup>*−<sup>1</sup> . When *<sup>k</sup>* <sup>=</sup> 6, *<sup>q</sup><sup>α</sup>* <sup>=</sup> 2.326 for <sup>α</sup> = 0.10, and *q<sup>α</sup>* = 2.576 for α = 0.05 [48]. Then, the corresponding *CD* values, calculated according to (9), are equal to 4.352 and 4.819, respectively. Figure 9 shows the CD diagrams based on *F1 score* and *mAP*. As can be seen from Figure 9, the proposed YOLOv5-TDHSA model is significantly superior to Faster R-CNN on both evaluation metrics for both confidence levels, and achieves at least comparable performance to that of YOLOv4-Tiny on both evaluation metrics for both confidence levels, and to that of YOLOv5n on *F1 score* for both confidence levels. It is not surprising that the Bonferroni–Dunn test found YOLOv5- DH and YOLOv5-TDHSA similar to YOLOv5s, as both proposed models are based on it. Having incorporated only one YOLOv5s improvement into itself, naturally, YOLOv5-DH is reported by the Bonferroni–Dunn test as more similar to YOLOv5s than YOLOv5-TDHSA.

**Figure 9.** Critical difference (CD) comparison of YOLOv5-TDHSA (the control model) against other compared models with the Bonferroni–Dunn test, based on (**a**) *F1 score* with confidence level α = 0.05, CD = 4.819; (**b**) *F1 score* with confidence level α = 0.10, CD = 4.352; (**c**) *mAP* with confidence level α = 0.05, CD = 4.819; (**d**) *mAP* with confidence level α = 0.10, CD = 4.352 (any two models not connected by a thick black horizontal line are considered to have significant performance differences between each other).

### **7. Conclusions**

We have proposed two novel models for accurate traffic sign detection, called YOLOv5- DH and YOLOv5-TDHSA, based on the YOLOv5s model with additional improvements. Firstly, a transformer self-attention module with stronger expression abilities was used in YOLOv5-TDHSA to replace the last layer of the 'Conv + Batch Normalization + SiLU' (CBS) structure in the YOLOv5s backbone. A similar module was added to the last layer of the YOLOv5-TDHSA's neck, so that the image information can be used more comprehensively. The features were mapped to the new three spaces for representation, thus improving the representation ability of the feature extraction. The multi-head mechanism used aims to realize the effect of multi-channel feature extraction. So, the transformer can increase the diversity of similarity computation between inputs and improve the ability of feature extraction. Secondly, a decoupled detection head was used in both proposed models to replace the YOLOv5s coupled head, which is responsible for the recognition and positioning on a feature map. As these two tasks have different focuses, resulting in a misalignment problem, the decoupled head uses two parallel branches—one responsible for the category recognition and the other responsible for positioning—which allows to improve the detection accuracy. However, as the decoupled head is not as fast as the coupled head, and due to the increase in the number of model parameters, the dimension was reduced through a 1 × 1 convolution before the decoupling to achieve balance between the speed and accuracy. Thirdly, for YOLOv5-TDHSA, a small-object detection layer was added to the YOLOv5s backbone and connected to the neck. At the same time, upsampling was used on the feature map of the neck to further expand the feature map. Supplemented by a group of adaptive anchor boxes, this new branch structure can not only ease the difficulty of small-object detection performed by YOLOv5-TDHSA, but can also compensate the feature losses caused by feature transmission with the increasing network depth.

Experiments conducted on two public datasets demonstrated that both proposed models outperform the original YOLOv5s model and three other state-of-the-art models (Faster R-CNN, YOLOv4-Tiny, YOLOv5n) in terms of the mean accuracy (*mAP*) and *F1 score*, achieving *mAP* values of 77.9% and 83.4% and *F1 score* values of 0.767 and 0.811 on the TT100K dataset, and *mAP* values of 68.1% and 69.8% and *F1 score* values of 0.71 and 0.72 on the CCTSDB2021 dataset, respectively, for YOLOv5-DH and YOLOv5-TDHSA. The results also confirm that the incorporation of the T and SA improvements into YOLOv5s leads to further enhancement, and a better performing model (YOLOv5-TDHSA), which is superior to the other proposed model (YOLOv5-DH) that avails of only one YOLOv5s improvement (i.e., DH).

**Author Contributions:** Conceptualization, Z.J., W.B.; methodology, W.B.; validation, J.Z., C.D.; formal analysis, H.Z., L.Z.; writing—original draft preparation, W.B.; writing—review and editing, I.G.; supervision, I.G.; project administration, J.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This publication has emanated from research conducted with the financial support of the National Key Research and Development Program of China under grant no. 2017YFE0135700, the Tsinghua Precision Medicine Foundation under grant no. 2022TS003,and the MES by grant no. D01-168/28.07.2022 for NCDSC part of the Bulgarian National Roadmap on RIs.

**Data Availability Statement:** The data used in this study are openly available as per [38] and [40].

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Article* **Barrier Options and Greeks: Modeling with Neural Networks**

**Nneka Umeorah 1,\*, Phillip Mashele 2, Onyecherelam Agbaeze <sup>3</sup> and Jules Clement Mba <sup>4</sup>**


**Abstract:** This paper proposes a non-parametric technique of option valuation and hedging. Here, we replicate the extended Black–Scholes pricing model for the exotic barrier options and their corresponding Greeks using the fully connected feed-forward neural network. Our methodology involves some benchmarking experiments, which result in an optimal neural network hyperparameter that effectively prices the barrier options and facilitates their option Greeks extraction. We compare the results from the optimal NN model to those produced by other machine learning models, such as the random forest and the polynomial regression; the output highlights the accuracy and the efficiency of our proposed methodology in this option pricing problem. The results equally show that the artificial neural network can effectively and accurately learn the extended Black–Scholes model from a given simulated dataset, and this concept can similarly be applied in the valuation of complex financial derivatives without analytical solutions.

**Keywords:** barrier options; Black–Scholes model; polynomial regression; random forest regression; machine learning; artificial neural network; option Greeks; data analysis

**MSC:** 91G20; 91G30; 62J05; 68T07

### **1. Introduction**

The concept, techniques and applications of artificial intelligence (AI) and machine learning (ML) in solving real-life problems have become increasingly practical over the past years. The general aim of machine learning lies in attempting to 'learn' data and make some predictions from a variety of techniques. In the financial industry, they offer a more flexible and robust predictive capacity compared to the classical mathematical and econometric models. They equally provide significant advantages to the financial decision makers and market participants regarding the recent trends in financial modeling and data forecasting. The core applications of AI in finance are risk management, algorithmic trading, and process automation [1]. Hedge funds and broker dealers utilize AI and ML to optimize their execution. Financial institutions use the technologies to estimate their credit quality and evaluate their market insurance contracts. Both private and public sectors use these technologies to detect fraud, assess data quality, and perform surveillance. ML techniques are generally classified into supervised and non-supervised systems. A branch of ML (supervised) techniques that have been fully recognized is deep learning, as it provides and equips machines with practical algorithms needed to comprehend the fundamental principles, and pattern detection in a significant portion of data. The neural networks, the cornerstones of these deep learning techniques, evolved and developed in the 1960s. In the fields of quantitative finance, the neural networks are applied in the optimization of portfolios, financial model calibrations [2], high-dimensional futures [3], market prediction [4], and exotic options pricing with local stochastic volatility [5].

**Citation:** Umeorah, N.; Mashele, P.; Agbaeze, O.; Mba, J.C. Barrier Options and Greeks: Modeling with Neural Networks. *Axioms* **2023**, *12*, 384. https://doi.org/10.3390/ axioms12040384

Academic Editor: Oscar Humberto Montiel Ross

Received: 19 December 2022 Revised: 29 March 2023 Accepted: 7 April 2023 Published: 17 April 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

For the methodology employed in this paper, artificial neural networks (ANNs) are systems of learning techniques which focus on a cluster of artificial neurons forming a fully connected network. One aspect of the ANN is the ability to generally 'learn' to perform a specific task when fed with a given dataset. They attempt to replicate or mimic a mechanism which is observable in nature, and they gain their inspiration from the structure, techniques and functions of the brain. For instance, the brain is similar to a huge network with fully interconnected nodes (neurons, for example, the cells) through links, also referred to as synapses, biologically. The non-linearity feature is introduced to the network within these neurons as the non-linear activation functions are applied. The non-linearity aspect of the neural network tends to approximate any integrated function reasonably well. One significant benefit of the ANN method is that they are referred to as 'universal approximators'. This feature implies that they can fit any continuous function, together with functions having non-linearity features, even without the assumption of any mathematical relationship which connects the input and the output variables. Essentially, the ANNs are also fully capable of approximating the solutions to partial differential equations (PDE) [6], and they easily permit parallel processing, which facilitates evaluation processes on graphics processing units (GPUs) [7]. The presence of this universal approximator function is often a result of their typical architecture, training and prediction process.

Meanwhile, due to the practical significance of the use of financial derivatives, these instruments have sharply risen in more recent years. This has led to the development of sophisticated economic models, which tend to capture the dynamics of the markets, and there has also been an increase in proposing faster, more accurate and more robust models for the valuation process. The pricing of these financial contracts has significantly helped manage and hedge risk exposure in finance and businesses, improve market efficiencies and provide arbitrage opportunities for sophisticated market participants. The conventional pricing techniques of option valuation are theoretical, resulting in the formulation of analytical closed forms for some of these option types. In contrast, others rely heavily on numerical approximation techniques, such as Monte Carlo simulations, finite difference methods, finite volume methods, binomial tree methods, etc. These theoretical formulas are mainly based on assumptions about the behavior of the underlying prices of securities, constant risk-free interest rates, constant volatility, etc., and they have been duly criticized over the years. However, modifications have been made to the Black–Scholes model, thereby giving rise to such models as the mixed diffusion/pure jump models, displaced diffusion models, stochastic volatility models, constant elasticity of variance diffusion models, etc. On the other hand, neural networks (NNs) have proved to be emerging computing techniques that offer a modern avenue to explore the dynamics of financial applications, such as derivative pricing [8].

Recent years have seen a huge application of AI and ML, as they have been utilized greatly in diverse financial fields. They have contributed significantly to financial institutions, the financial market, and financial supervision. Li in [9] summarized the AI and ML development and analyzed their impact on financial stability and the micro-macro economy. In finance, AI has been utilized greatly in predicting future stock prices, and the concept lies in building AI models which utilize ML techniques, such as reinforcement learning or neural networks [10]. A similar stock price prediction was conducted by Yu and Yan [11]; they used the phase-space reconstruction method for time series analysis in combination with a deep NN long- and short-term memory networks model. Regarding applying neural networks to option pricing, one of the earliest research can be found in Malliaris and Salchenberger [12]. They compared the performance of the ANN in pricing the American-style OEX options (that is, options defined on Standard and Poor's (S&P) 100) and the results from the Black–Scholes model [13] with the actual option prices listed in the *Wall Street Journal* . Their results showed that in-the-money call options were valued significantly better when the Black–Scholes model was used, whereas the ANN techniques favored the out-of-the-money call option prices.

In pricing and hedging financial derivatives, researchers have incorporated the classical Black–Scholes model [13] into ML to ensure robust and more accurate pricing techniques. Klibanov et al. [14] used the method of quasi-reversibility and ML to predict option prices in corporations with the Black–Scholes model. Fang and George [15] proposed valuation techniques for improving the accuracy rate of Asian options by using the NN in connection with Levy approximation. Hutchinson et al. in [16] further priced the American call options defined on S&P 500 futures by comparing three ANN techniques with the Black–Scholes pricing model. Their results proved the supremacy of all three ANNs to the classical Black– Scholes model. Other comparative research studies on the ANN versus the Black–Scholes model are also applicable in pricing the following: European-style call options (with dividends) on the Financial Times Stock Exchange (FTSE) 100 index [17], American-style call options on Nikkei 225 futures [8], Apple's European call options [18], S&P 500 index call options with an addition of neuro-fuzzy networks [19], and in the pricing call options written on the Deutscher Aktienindex (DAX) German stock index [20]. Similar works on pricing and hedging options using the ML techniques can be found in [21–25].

Other numerical techniques, such as the PDE-based and the DeepBSDE-based (BSDE— -backward stochastic differential equations) methods, have also been employed in valuing the barrier options. For instance, Le et al. in [26] solved the corresponding option pricing PDE using the continuous Fourier sine transform and extended the concept of pricing the rebate barrier options. Umeorah and Mashele [27] employed the Crank–Nicolson finite difference method in solving the extended Black–Scholes PDE, describing the rebate barrier options and pricing the contracts. The DeepBSDE concept initially proposed by Han et al. in [28] converted high-dimensional PDE into BSDE, intending to reduce the dimensionality constraint, and they redesigned the solution of the PDE problem as a deeplearning problem. Further implementation of the BSDE-based using the numerical method with deep-learning techniques in the valuation of the barrier options is found in [29,30].

Generally, the concept of ANN can be classified into three phases: the neurons, the layers and the whole architecture. The neuron, which is the fundamental core processing unit, consists of three basic operations: summation of the weighted inputs, the addition of a bias to the input sum, and the computation of the output value via an activation function. This activation function is used after the weighted linear combination and implemented at the end of each neuron to ensure the non-linearity effect. The layers consist of an input layer, a (some) hidden layer(s) and an output layer. Several neurons define each layer, and stacking up various layers constitutes the entire ANN architecture. As the data transmission signals pass from the input layer to the output layer through the middle layers, the ANN serves as a mapping function among the input–output pairs [2]. After training the ANN in options pricing, computing the in-sample and out-of-sample options based on ANN becomes straightforward and fast [31]. Itkin [31] highlighted this example by pricing and calibrating the European call options using the Black–Scholes model.

This research is an intersection of machine learning, statistics and mathematical finance, as it employs recent financial technology in predicting option prices. To the best of our knowledge, this ML approach to pricing the rebate and zero-rebate barrier options has received less attention. Therefore, we aim to fill the niche by introducing this option pricing concept to exotic options. In the experimental section of this work, we simulate the barrier options dataset using the analytical form of the extended Black–Scholes pricing model. This is a major limitation of this research, and the choice was due to the non-availability of the real data. (A similar synthetic dataset was equally used by [32], in which they constructed the barrier option data based on the LIFFE standard European option price data by the implementation of the Rubenstein and Reiner analytic model. These datasets were used in the pricing of the up-and-out barrier call options via the use of a neural net model.) We further show and explain how the fully connected feed-forward neural networks can be applied in the fast and robust pricing of derivatives. We tuned different hyperparameters and used the optimal in the modeling and training of the NN. The performance of the optimal NN results is compared by benchmarking the results against other ML models,

such as the random forest regression model and the polynomial regression model. Finally, we show how the barrier options and their Greeks can be trained and valued accurately under the extended Black–Scholes model. The major contributions of this research are classified as follows:


The format of this paper is presented as follows: In Section 1, we provide a brief introduction to the topic and outline some of the related studies on the applications of ANN in finance. Section 2 introduces the concept of the Black–Scholes pricing model, together with the extended Black–Scholes pricing models for barrier options and their closed-form Greeks. Section 3 focuses on the machine learning models, such as the ANN, as well as its applications in finance, random forest regression and the polynomial regression models. In Section 4, we discuss the relevant results obtained in the course of the numerical experiments, and Section 5 concludes our research study with some recommendations.

### **2. Extended Black–Scholes Model for Barrier Options**

The classical Black–Scholes model developed by Fischer Black and Myron Scholes is an arbitrage-free mathematical pricing model used to estimate the dynamics of financial derivative instruments. The model was initially designed to capture the price estimate of the European-style options defined under the risk-neutral measure. As a mathematical model, certain assumptions, such as the log-normality of underlying prices, constant volatility, frictionless market, continuous trading without dividends applied to stocks, etc., are made for the Black–Scholes model to hold [13]. Though the Black–Scholes model has been criticized over the years due to some underlying assumptions, which are not applicable in the realworld scenario, certain recent works are associated with the model [33–36]. Additionally, Eskiizmirliler et al. [37] numerically solved the Black–Scholes equation for the European call options using feed-forward neural networks. In their approach, they constructed a function dependent on a neural network solution, which satisfied the given boundary conditions of the Black–Scholes equation. Chen et al. [38] proposed a Laguerre neural network to solve the generalized Black–Scholes PDE numerically. They experimented with this technique on the European options and generalized option pricing models.

On the other hand, the valuation of exotic derivatives, such as the barrier options, has been extensively studied by many authors, mainly by imploring a series of numerical approximation techniques. Barrier options are typically priced using the Monte-Carlo simulations since their payoffs depend on whether the underlying price has/has not crossed the specified barrier level. The closed-form solutions can equally be obtained analytically using the extended Black–Scholes models [39], which shall be implemented as a benchmark of the exact price in this work. The structure of the model is described below.

### *2.1. Model Structure*

Generally, the Black–Scholes option pricing formula models the dynamics of an underlying asset price *S* as a continuous time diffusion process given below:

$$\mathbf{dS}(t) = \mathbf{S}(r\mathbf{d}t + \sigma \mathbf{d}B(t)),\tag{1}$$

where *r* is the risk-free interest rate, *σ*, the volatility and *B*(*t*) is the standard Brownian motion at the current time *t*. Suppose *V*(*S*, *t*) is the value of a given non-dividend paying European call option. Then, under the pricing framework of Black and Scholes, *V*(*S*, *t*) satisfies the following PDE:

$$\frac{\partial V(S,t)}{\partial t} + rS\frac{\partial V(S,t)}{\partial S} + \frac{\sigma^2 S^2}{2} \frac{\partial^2 V(S,t)}{\partial S^2} - rV(S,t) = 0,\tag{2}$$

subject to the following boundary and terminal conditions:

$$V(0, t) = 0, \forall \, t \in [0, T] \tag{3}$$

$$V(\mathcal{S}, t) = \mathcal{S} - \text{Ke}^{-r(T-t)} \text{ for } \mathcal{S} \to \infty \text{ \textquotedbl{}} \tag{4}$$

$$V(S,T) = \max\{S(T) - K, 0\},\tag{5}$$

where *K* is the strike price and *T* is the time to expiration.

Since the barrier options are the focus of this study, the domain of the PDE in Equation (2) reduces to D = {(*S*, *t*) : *B* ≤ *S* ≤ ∞; *t* ∈ [0, *T*]} with the introduction of a predetermined level known as barrier *B*, and that feature distinguishes them from the vanilla European options. The boundary and terminal conditions above remain the same, with the exception of Equation (3), which reduces to *V*(*B*, *t*) = 0 for zero-rebate and *V*(*B*, *t*) = *R* for the rebate barrier option. (In this paper, we shall consider the rebate paid at knock-out. The other type is the rebate paid at expiry, and in that case, Equation (3) becomes *<sup>V</sup>*(*B*, *<sup>t</sup>*) = *<sup>R</sup>*e−*r*(*T*−*t*) , <sup>∀</sup> *<sup>t</sup>* <sup>∈</sup> [0, *<sup>T</sup>*].) The barrier options are either activated (knock-in options) or extinguished (knock-out options) once the underlying price attains the barrier level. The direction of the knock-in or the knock-out also determines the type of barrier options being considered, as this option is generally classified into up-and-in, up-and-out, down-and-in, and down-and-out barrier options. This paper will consider the down-and-out (DO) barrier options, both with and without rebates. For this option style, the barrier level is normally positioned below the underlying, and when the underlying moves in such a way that the barrier is triggered, the option becomes void and nullified (zero rebate). However, when the barrier is triggered, and the option knocks out with a specified payment compensation made to the option buyer by the seller, then we have the rebate barrier options. Under the risk-neutral pricing measure *Q*, the price of the down-and-out (DO) barrier options is given as

$$V(S,t) = \mathbb{E}^Q \left[ \mathbf{e}^{-r(T-t)} (\mathbb{S}\_T - \mathbf{K})^+ \mathbb{I} \{ \min\_{0 \le t \le T} S\_t > B \} \right],\tag{6}$$

and the solution to the above is given in the following theorem.

**Theorem 1.** *Extended Black–Scholes for a DO call option (note that Equations (7) and (8) occurs when the strike price K* ≥ *B. For K* < *B, we substitute K* = *B into d*<sup>1</sup> *and d*3*.) is given by [39]*

$$N(S,t) = S N(d\_1) - \text{Ke}^{-r\tau} N(d\_2) - \left[ S \left( \frac{B}{S} \right)^{2\eta} N(d\_3) - \text{Ke}^{-r\tau} \left( \frac{B}{S} \right)^{2\eta - 2} N(d\_4) \right] \tag{7}$$

$$\text{for } d\_1 = \frac{\log\left(\frac{S}{\mathsf{X}}\right) + \left(r + \frac{\sigma^2}{\mathsf{Z}}\right)\tau}{\sigma\sqrt{\tau}}, \quad d\_3 = \frac{\log\left(\frac{B^2}{\mathsf{X}\mathsf{X}}\right) + \left(r + \frac{\sigma^2}{\mathsf{Z}}\right)\tau}{\sigma\sqrt{\tau}}, \quad d\_5 = \frac{\log\left(\frac{B}{\mathsf{X}}\right) + \left(r + \frac{\sigma^2}{\mathsf{Z}}\right)\tau}{\sigma\sqrt{\tau}}, \quad d\_6 = \frac{\log\left(\frac{B}{\mathsf{X}}\right) + \left(r + \frac{\sigma^2}{\mathsf{Z}}\right)\tau\left(\frac{B}{\mathsf{X}}\right) + \left(r + \frac{\sigma^2}{\mathsf{Z}}\right)\tau\left(\frac{B}{\mathsf{X}}\right)}{\sigma\sqrt{\tau}}$$

*where τ* = *T* − *t*, *d*2,4 = *d*1,3 − *σ* <sup>√</sup>*τ*, *<sup>η</sup>* = (2*<sup>r</sup>* <sup>+</sup> *<sup>σ</sup>*2)(2*σ*2)−<sup>1</sup> *and <sup>N</sup>*(*x*) = ) *<sup>x</sup>* <sup>−</sup><sup>∞</sup> <sup>√</sup> 1 <sup>2</sup>*<sup>π</sup>* <sup>e</sup> <sup>−</sup>*y*<sup>2</sup> <sup>2</sup> d*y is the cumulative standard normal distribution function. In the presence of a rebate R, the option value becomes*

$$V\_R(S,t) = V(S,t) + R\left[\left(\frac{B}{S}\right)^{2\eta - 1}N(d\_5) + \left(\frac{S}{B}\right)N(d\_5 - 2\eta\sigma\sqrt{\tau})\right] \tag{8}$$

### *2.2. Option Greeks*

These refer to the sensitivities of option prices with respect to different pricing parameters. The knowledge and the application of option Greeks can equip investors with risk-minimization strategies, which will be applicable to their portfolios. Such knowledge is as vital as hedging the portfolio risk using any other risk management tools. For options that have an analytical form based on the Black–Scholes model or other closed-form models, the Greeks or the sensitivities are normally estimated from these formulas. In the absence of analytical option values, numerical techniques are employed to extract the Greeks. These Greeks are adopted from [40], and we only consider the delta (Δ*DO*), gamma (Γ*DO*) and the vega (*νDO*).

### 2.2.1. Delta

This measures the sensitivity of options values to changes in the underlying prices. The delta for the DO call options behaves like the delta of the European call options when the option is deep in-the-money, and it becomes very complicated as the underlying price approaches the barrier level:

$$\frac{\partial V(S,t)}{\partial S} = N(d\_1) - \left(\frac{B}{S}\right)^{2\eta - 2} \left\{ -\frac{B^2}{S^2} N(d\_3) + \frac{2\eta - 2}{S} \left( \frac{B^2}{S} N(d\_4) - K e^{-r\tau} N(d\_3) \right) \right\}, \quad \eta = \frac{\partial V}{\partial S}$$

where *d*1, *d*<sup>3</sup> and *d*<sup>4</sup> are given in Theorem 1.

### 2.2.2. Gamma

This measures the sensitivity of delta to a change in the underlying price, or the second partial derivative of the option value with respect to the underlying price:

$$\begin{split} \frac{\partial^2 V(\mathcal{S}, t)}{\partial \mathcal{S}^2} &= -\frac{\phi(d\_1)}{\sigma \mathcal{S} \sqrt{\tau}} - \left(\frac{B}{S}\right)^{2\eta - 2} \left\{ \frac{(2\eta - 2)(8\eta - 7)}{S} \left(\frac{B^2}{S} N(d\_4) - \mathrm{Ke}^{-r\tau} N(d\_3)\right) \right\} \\ &+ \frac{B^2}{S^2} \left(2N(d\_3) + \frac{\phi(d\_3)}{\sigma \sqrt{\tau}}\right) + 2(2\eta - 2) \left(\frac{B^2}{S^2} N(d\_3)\right) \right\}, \end{split}$$

where *d*1, *d*<sup>3</sup> and *d*<sup>4</sup> are given in Theorem 1; also, *φ*(*x*) = <sup>√</sup> 1 <sup>2</sup>*<sup>π</sup>* <sup>e</sup> <sup>−</sup>*x*<sup>2</sup> <sup>2</sup> is the probability density function of the standard normal distribution.

### 2.2.3. Vega

This measures the sensitivity of options values to changes to volatility. It is calculated as

$$\frac{\partial V(S,t)}{\partial \sigma} = S\sqrt{\tau}\phi(d\_1) - \left(\frac{B}{S}\right)^{2\eta-2} \left\{ \sqrt{\tau} \text{Ke}^{-r\tau}\phi(d\_4) - \frac{4r}{\sigma^3} \left(\frac{B^2}{S} N(d\_4) - \text{Ke}^{-r\tau}N(d\_3)\right) \ln \frac{B}{S} \right\}.$$

where *d*1, *d*<sup>3</sup> and *d*<sup>4</sup> are given in Theorem 1; also, *φ*(*x*) = <sup>√</sup> 1 <sup>2</sup>*<sup>π</sup>* <sup>e</sup> <sup>−</sup>*x*<sup>2</sup> <sup>2</sup> is the probability density function of the standard normal distribution.

### **3. Machine Learning Models**

Machine learning models, such as the ANN, polynomial regression model and random forest regression models, form the methodology in this research. Here, we briefly describe each of them and their financial application as they relate to the rebate barrier options problem. The numerical experiments are performed on an 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz processor, 16 GB RAM, 64-bit Windows 10 operating system and x64-based processor.

### *3.1. Artificial Neural Networks*

This subsection utilizes the concept of ANN in the approximation of functions which describe the financial model in perspective. It will highlight the whole network environment and the multi-layer perceptron (MLP) idea. In connection with the application of ANN to option pricing, the concept lies first in the generation of the financial data (barrier option pricing data) and then employing the ANN to predict the option prices according to the trained model.

### 3.1.1. Network Environment

The research computations and the data processing are implemented using Python (version 3.9.7), which is an open-source programming tool. The ANN employed in the data analysis and construction of the model, as well as the training and validation, is implemented with Keras ( https://keras.io/about/, accessed on 6 April 2023), which is a deep learning application programming interface, running concurrently with the machine learning platform Tensorflow (version 2.2.0).

### 3.1.2. Multi-Layer Perceptron

An MLP is a feed-forward ANN category comprising a minimum of three layers: the input layer, the hidden layer and the output layer. The MLP with as little as one hidden layer tends to approximate a large category of non-linear and linear functions with arbitrary accuracy and precision. Except for input nodes, every other node consists of neurons triggered by non-linear activation functions. During the training phase, an MLP employs the supervised learning techniques, also known as backpropagation, and in this section, we use the backpropagation network method, which is by far the most widespread neural network type.

Mathematically, consider an MLP network's configuration with first and second hidden layers *h* (1) *<sup>k</sup>* and *h* (2) *<sup>k</sup>* , respectively, and input units *xk*, where *k* denotes the number of the units. The non-linear activation function is written as *f*(.), and we denote *f* (1)(.), *f* (2)(.) and *f* (3)(.) differently since the network layers can have various activation functions, such as the sigmoid (Sigmoid is defined by *f*(*z*) = 1/(1 + exp (−*z*)), where *z* is the input to the neuron), hyperbolic tangent (Tanh is defined by *f*(*z*) = 2sigmoid(*z*) − 1, where *z* is the input to the neuron), rectified linear unit (ReLU) (ReLU is defined by *f*(*z*) = max[0, *z*], where *z* is the input to the neuron), etc. The weights of the network are denoted by *wjk*, the activation output value *yj*, and the bias *bj*, where *j* denotes the number of units in each layer. Thus, we have the following representation:

$$\begin{aligned} h\_j^{(1)} &= f^{(1)} \left( \sum\_k w\_{jk}^{(1)} x\_k + b\_j^{(1)} \right) \\ h\_j^{(2)} &= f^{(2)} \left( \sum\_k w\_{jk}^{(2)} h\_k^{(1)} + b\_j^{(2)} \right) \\ y\_j &= f^{(3)} \left( \sum\_k w\_{jk}^{(3)} h\_k^{(2)} + b\_j^{(3)} \right) . \end{aligned}$$

### 3.1.3. The Hyperparameter Search Space and Algorithm

This section further explains the hyperparameter optimization techniques, which aim to search for the optimal algorithm needed for our optimization problem. It is essential to note that the parameters of the NN are internal configuration variables that the models can learn. Examples are the weights and the bias. In contrast, the hyperparameters are external and cannot be learned from the data but are used to control the learning process and the structure of the NN. These parameters are set before the training process, and some examples include the activation function, batch size, epoch, learning rates, etc. The choice of hyperparameters hugely affects the accuracy of the network. As a

result, different optimal methods, such as manual search, Bayesian optimization, random search and grid search, have been developed. We will employ the Keras tuner framework ( https://keras.io/keras\_tuner/, accessed on 6 April 2023), which encompasses some algorithms, such as the random search, hyperband and Bayesian optimization. For these three search algorithms, we choose the validation loss as the objective with the maximum search configuration of six trials. The following variables define the search space of our NN architecture:


The activation functions are used in each layer, except the output layer. The network is trained with 45 epochs, 256 batch sizes and an early stopping callback on the validation loss with patience = 3. Since the option pricing model is a regression problem, our primary objective is to keep the mean squared error (MSE) of the predicted prices to a minimum. The essence of training a neural network entails minimizing the errors obtained during the regression analysis, and this is done by selecting a set of weights in both the hidden and the output nodes. Thus, to evaluate the performance of our ANN, we consider the MSE as the loss function used by the network and the mean absolute error (MAE) as the network metrics, which are given, respectively, as follows:

$$\text{MSE} = \frac{1}{N} \sum\_{i=1}^{N} (V\_i(S\_\prime t) - \hat{V}\_i(S\_\prime t))^2$$

$$\text{MAE} = \frac{1}{N} \sum\_{i=1}^{N} |V\_i(S\_\prime t) - \hat{V}\_i(S\_\prime t)| \,\,\,\,$$

where *N* is the number of observations, *Vi*(*S*, *t*) is the exact option values and *V*ˆ *<sup>i</sup>*(*S*, *t*) is the predicted option values. Finally, we alternate the activation functions, optimizers, batch normalization and dropout rates to investigate the effect of the network training on the option valuation and avoid overfitting the models.

### 3.1.4. Data Splitting Techniques for the ANN

Data splitting is a fundamental aspect of data science, especially for developing databased models. The dataset is divided into training and testing sets, and an additional set known as the validation can also be created. The training set is used mainly for training, and the model is expected to learn from this dataset while optimizing any of its parameters. The testing set contains the data which are used to fit and measure the model's performance. The validation set is mainly used for model evaluation. If the difference between the training set error and the validation set error is large, there is a case of over-fitting, as the model has high variance. This paper considers supervised learning in which the model is trained to predict the outputs of an unspecified target function. This function is denoted by a finite training set F consisting of inputs and corresponding desired outputs: F = {[ −→*a*<sup>1</sup> , −→*x*<sup>1</sup> ], [ −→*a*<sup>2</sup> , −→*x*<sup>2</sup> ], ··· [ −→*an* , −→*xn* ]}, where *<sup>n</sup>* is the number of 2-tuples of input/ output samples.

### Train–Test Split

This paper considers the train–test split as 80:20 and a further 80:20 on the new train data to account for a validation dataset. Thus, 80% of the whole dataset will account for the training set and 20% for the test dataset. Additionally, 80% of the training set will be used as the actual training dataset and the remaining 20% for validation. After training, the final model should correctly predict the outputs and generalize the unseen data. Failure to accomplish this leads to over-training, and these two crucial conflicting demands between accuracy and complexity are known as the bias–variance trade-off [41,42]. A common approach to balance this trade-off is to use the cross-validation technique.

### *K*-Fold Cross Validation

The *k*-fold cv is a strategy for partitioning data with the intent of constructing a more generalized model and estimating the model performance on unseen data. Denote the validation (testing) set as F*te* and the training set as F*tr*. The algorithm (Algorithm 1) is shown below.

### **Algorithm 1** Pseudocode for the *k*-fold cross validation

Input the dataset F, number of folds *k* and the error function (MSE)

	- Randomly split F into *k* independent subsets F*i*, F2, ··· , F*<sup>k</sup>* of same size.
	- For *i* = 1, 2, ··· , *k*: F*te* ← F*<sup>i</sup>* and F*tr* ← F \ {F*i*}.
	- Fit and train model on F*tr* and evaluate model performance using F*te* periodically: R*te*(*i*)=Error (F*te*).
	- Terminate model training when the R*te*(*i*) stop criterion is satisfied.
	- Evaluate the model performance using <sup>R</sup>*te* <sup>=</sup> <sup>1</sup> *<sup>k</sup>* <sup>∑</sup>*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> R*te*(*i*).

### 3.1.5. Architecture of ANN

This research considers a fully connected MLP NN in the option valuation for this research, which will consist of eight input nodes (in connection to the extended Black– Scholes for the rebate option parameters). There will be one output node (option value); the hidden layers and nodes will be tuned. There are two main models classified under the datasplitting techniques: Model A (train–test split) and Model B (5-fold cross-validation split). Each model is further subdivided into 3 according to the hyperparameter search algorithm. Thus, Models A1, A2, and A3 represent the models from the data train–test split for the hyperband algorithm, random search algorithm and Bayesian optimization algorithm, respectively. Additionally, Models B1, B2, and B3 represent the models from the k-fold crossvalidation data split for the hyperband algorithm, random search algorithm and Bayesian optimization algorithm, respectively. Finally, Tables 1 and 2 present the post-tuning search details and the optimal model hyperparameters for the NN architecture, respectively.



Table 1 compares the search time taken by each of the algorithms in tuning the hyperparameters. We observed that the hyperband algorithm is highly efficient regarding the search time for the train–test split and the *k*-fold cross-validation models. The hyperband algorithm search time is generally less when compared to the random search and the Bayesian optimization algorithm. Furthermore, the Bayesian optimization provided the lowest MAE score and required fewer trainable parameters than the hyperband and the random search algorithm. This characteristic is equally observable for both models A and B. From the tuning, we can see that the Bayesian optimization effectively optimizes the hyperparameter when producing the lowest MAE, though it had the disadvantage of a higher search time. In contrast to the Bayesian optimization, the hyperband algorithm is optimal in terms of search time, despite having a higher MAE score. From the results section, the final comparison of optimality will be made in terms of the deviation from the actual values when all the models are used in the pricing process.


### *3.2. Random Forest Regression*

Random forest combines tree predictors in such a way that each tree in the ensemble is contingent on the values of a randomly sampled vector selected from the training set. This sampled vector is independent and has similar distribution with all the trees in the forest [43]. The random forest regressor uses averaging to improve its predictive ability and accuracy.

Let *f*(**x**; *βn*) be the collection of tree predictors where *n* = 1, 2, ··· , *N* denotes the number of trees. Here, **x** is the observed input vector from the random vector **X**, and *β<sup>n</sup>* are the independent and identically distributed random vectors. The random forest prediction is given by

$$f(\mathbf{x}) = \frac{1}{N} f(\mathbf{x}; \boldsymbol{\beta}\_n) \,\,\,\,$$

where ¯ *f*(**x**) is the unweighted average over the tree collection *f*(**x**). As the number of trees increases, the tree structure converges. This convergence explains why the random forest does not overfit, but instead, a limiting value of the generalization (or prediction) error is produced [43,44]. Thus, we have that as *n* → ∞, the law of large numbers ensures that

$$\mathbb{E}\_{\mathbf{X},Y}[Y-\boldsymbol{f}(\mathbf{X})]^2 \to \mathbb{E}\_{\mathbf{X},Y}[Y-\mathbb{E}\_{\boldsymbol{\beta}}[\boldsymbol{f}(\mathbf{X};\boldsymbol{\beta})]]^2$$

Here, *Y* is the outcome. The training data are assumed to be drawn independently from the joint distribution of (**X**,*Y*). In this research, we use the 80:20 train–test split techniques to divide the whole dataset into a training set and a testing set. Using the RandomForestRegressor() from the scikit-learn ML library, we initialize the regression model, fit the model, and predict the target values.

### *3.3. Polynomial Regression*

Polynomial regression is a specific type of linear regression model which can predict the relationship between the independent variable to the dependent variable as an *n*th degree polynomial. In this research, we first create the polynomial features object using the PolynomialFeatures() from the scikit-learn ML library and indicate the preferred polynomial degree. We next use the 80:20 train–test split techniques to divide this new polynomial feature into training and testing datasets, respectively. Next, we construct the polynomial regression model, fit the model and predict the responses.

### **4. Results and Discussion**

### *4.1. Data Structure and Description*

For the ANN model input parameters, we generated 100000 sample data points and then used Equation (8) to obtain the exact price for the rebate barrier call options. These random observations will train, test and validate an ANN model to mimic the extended Black–Scholes equation. We consider the train–test split and the cross-validation split on the dataset and then measure these impacts on the loss function minimization and the option values. The generated samples consist of eight variables, that is (*S*, *K*, *B*, *R*, *T*, *σ*,*r*, *VR*), which are sampled uniformly, except the option price *VR*, and following the specifications and logical ranges of each of the input variables (See Table 3). During the training process, we fed the ANN the training samples with the following inputs (*S*, *K*, *B*, *R*, *T*, *σ*,*r*), where *VR* is the expected output. In this phase, the ANN 'learns' the extended Black–Scholes model from the generated dataset, and the testing phase follows suit, from which the required results are predicted. Meanwhile, under the Black–Scholes framework, we assume that the stock prices follow a geometric Brownian motion, and we used GBM(*x* = 150,*r* = 0.04, *σ* = 0.5, *T* = 1, *N* = 100,000) for the random simulation. Table 3 below shows the extended Black–Scholes parameters used to generate the data points, whereas Table 4 gives the sample data generated. The range for the rebate, strike and barrier is from the uniform random distribution, and they are multiplied by the simulated stock price to obtain the final range.

**Table 3.** Extended Black–Scholes range of parameters—rebate barrier.



**Table 4.** Sample training data for rebate barrier option pricing model.

Statistics and Exploratory Data Analysis

In this section, we aim to summarize the core characteristics of our option dataset by analyzing and visualizing them. The descriptive statistics which summarize the distribution shape, dispersion and central tendency of the dataset are presented in Table 5. The following outputs were obtained: the number of observations or elements, mean, standard deviation, minimum, maximum and quartiles (25%, 50%, 75%) of the dataset. We observed that the distribution of the simulated stock is left skewed since the mean is lesser than the median, whereas the distributions of the option values, strike price and barrier levels are right skewed.


**Table 5.** Descriptive statistics for the rebate barrier.

In Figure 1, we consider the visualization using the seaborn library in connection with the pairplot function to plot a symmetric combination of two main figures, that is, the scatter plot and the kernel density estimate (KDE). The KDE plot is a non-parametric technique mainly used to visualize the nature of the probability density function of a continuous variable. In our case, we limit these KDE plots to the diagonals. We focus on the relationship between the stock, strike, rebate and the barrier with the extended Black–Scholes price (OptionV) for the rebate barrier options. From the data distribution for the feature columns, we notice that the sigma, time and rate columns could be ignored. This is because the density distribution shows that these features are basically uniform, and the absence of any variation makes it very unlikely to improve the model performance. Suppose we consider this problem as a classification problem; then, no split on these columns will increase the entropy of the model.

On the contrary, however, if this was a generative model, then there would be no prior to updating given a uniform posterior distribution. Additionally, the model will learn a variate of these parameters since, by definition of the exact option price (referred to as OptionV) function, these are the parameters which can take on constant values. Another method to consider would be to take these parameters 'sine' functions as inputs to the model instead of the actual values. We observed from our analysis that this concept works, but there is not a significant improvement in model performance, which can be investigated in further research.

### *4.2. Neural Network Training*

The first category (train dataset) is employed to fit the ANN model by estimating the weights and the corresponding biases. The model at this stage tends to observe and 'learn' from the dataset to optimize the parameters. In contrast, the other (test dataset) is not used for training but for evaluating the model. This dataset category explains how effective and efficient the overall ANN model is and the prediction probability of the model. Next and prior to the model training, we perform data standardization techniques to improve the performance of the proposed NN algorithm. The StandardScalar function of the Sklearn python library was used to standardize the distribution of values by ensuring that the distribution has a unit variance and a zero mean. During the compilation stage, we plot the loss (MSE) and the evaluation metrics (accuracy) values for both the train and validation datasets. We equally observe that the error difference between the training and the validation dataset is not large, and as such, there is no case of over- or under-fitting of the ANN models. Once the 'learning' phase of the model is finished, the prediction phase will set in. The performance of the ANN model is measured and analyzed in terms of the MSE and the MAE. Table 6 gives the evaluation metrics for both the out-sample prediction (testing dataset) and the in-sample prediction (training dataset).

**Figure 1.** Visualization plot.

**Table 6.** Model evaluation for testing and training data (shows no over- or underfitting).


Table 6 shows the model evaluation comparison for the train/test loss and accuracy. It is observed that the test loss is greater than the training loss, and the test accuracy is greater than the training accuracy for all the models. The differences in error sizes are not significant, and thus the chances of having an overfitting model are limited. Figures 2–5 show the training and validation (test) of the loss and MAE values for all the models when the models are fitted and trained on epoch = 45, batch size = 256, and verbose = 1. We

visualize these graphs to ascertain whether there was any case of overfitting, underfitting or a perfect fitting of the model. In underfitting, the NN model fails to model the training data and learn the problem perfectly and sufficiently, leading to slightly poor performance on the training dataset and the holdout sample. Overfitting occurs mainly in complex models with diverse parameters, which happens when the model aims to capture all data points present in a specified dataset. In all the cases, we observe that the models show a good fit, as the training and validation loss are decreased to a stability point with an infinitesimal gap between the final loss values. However, the loss values for Model B3 followed by Model B2 are highly optimal in providing the best fit for the algorithm.

**Figure 2.** Train/test MAE values for Models A1, A2 and A3; (**a**) MAE—Model A1; (**b**) MAE— Model A2; (**c**) MAE—Model A3.

**Figure 3.** Train/Test MAE values for Models B1, B2 and B3; (**a**) MAE—Model B1; (**b**) MAE—Model B2; (**c**) MAE—Model B3.

**Figure 4.** Train/Test LOSS values for Models A1, A2 and A3; (**a**) LOSS—Model A1; (**b**) LOSS— Model A2; (**c**) MAE—Model A3.

**Figure 5.** Train/Test LOSS values for Models B1, B2 and B3; (**a**) LOSS—Model B1; (**b**) LOSS— Model B2; (**c**) LOSS—Model B3.

Next, we display the plots obtained after compiling and fitting the models. The prediction is performed on the unseen data or the test data using the trained models. Figures 6 and 7 give the plot of the predicted values against the actual values, the density plot of the error values and the box plot of the error values for all six models.

**Figure 6.** Option values visualization for Models A1, A2, A3; (**a**) Model A1: Regression plot; (**b**) Model A1: Histogram plot; (**c**) Model A1: Box plot; (**d**) Model A2: Regression plot; (**e**) Model A2: Histogram plot; (**f**) Model A2: Box plot; (**g**) Model A3: Regression plot; (**h**) Model A3: Histogram plot; (**i**) Model A3: Box plot.

The box plot enables visualization of the skewness and how dispersed the solution is. Model A2 behaved poorly, as this can be observed with the wide range of dispersion of the solution points, and the model did not fit properly. For a perfect fit, the data points are expected to concentrate along the 45 deg red line, where the predicted values are equal to the actual values. This explanation is applicable to Models A2 and A3, as there was no perfect alignment in the regression plots. We could retrain the neural network to improve this performance since each training can have different initial weights and biases. Further improvements can be made by increasing the number of hidden units or layers or using a larger training dataset. For the purpose of this research, we already performed the hyperparameter tuning, which solves most of the above suggestions. To this end, we focus on Model B, another training algorithm.

Models B3 and B1 provide a good fit when their performance is compared to the other models, though there are still some deviations around the regression line. The deviation of these solution data points is also fewer than in the other models. It is quite interesting to note that the solution data points of Models B1 and B3 are skewed to the left, as can be seen in the box plots. This could be a reason for their high performance compared to other models, such as A1, A2, and A3, which are positively skewed. However, this behavior would be worth investigating in our future work.

**Figure 7.** Option values visualization for Models B1, B2, B3; (**a**) Model B1: Regression plot; (**b**) Model B1: Histogram plot; (**c**) Model B1: Box plot; (**d**) Model B2: Regression plot; (**e**) Model B2: Histogram plot; (**f**) Model B2: Box plot; (**g**) Model B3: Regression plot; (**h**) Model B3: Histogram plot; (**i**) Model B3: Box plot.

Table 7 shows the error values in terms of the MSE, MAE, mean squared logarithmic error (MSLE), mean absolute percentage error (MAPE) and the *R*<sup>2</sup> (coefficient of determination) regression score. It also shows the models' comparison in terms of their computation speed, and it must be noted that the computation is measured in seconds. Mathematically, the MSLE and MAPE are given as

$$\begin{aligned} \text{MSLE} &= \frac{1}{N} \sum\_{i=1}^{N} [\log\_e(1 + V\_i(S, t)) - \log\_e(1 + \mathcal{V}\_i(S, t))]^2 \\ \text{MAPE} &= \frac{100\%}{N} \sum\_{i=1}^{N} \left| \frac{V\_i(S, t) - \mathcal{V}\_i(S, t)}{V\_i(S, t)} \right| .\end{aligned}$$

where *N* is the number of observations, *Vi*(*S*, *t*) is the exact option values and *V*ˆ *<sup>i</sup>*(*S*, *t*) is the predicted option values. For the MAPE, all the values lower than the threshold of 20% are considered 'good' in terms of their forecasting capacity [45]. Thus, all the models have good forecasting scores, with Model A1 possessing a highly accurate forecast ability. Similarly, the values for the MSLE measure the percentile difference between the log-transformed predicted and actual values. The lower, the better, and we can observe that all the models gave relatively low MSLE values, with Models A1 and B1 giving the lowest MSLE values.Please check that intended meaning is retained.

From Table 7, the *R*<sup>2</sup> measures the capacity of the model to predict an outcome in the linear regression format. Models B1 and B3 gave the highest positive values compared to the other models, and these high *R*2-values indicate that these models are a good fit for our options data. It is also noted that for well-performing models, the greater the *R*2, the smaller the MSE values. Model B3 gave the smallest MSE, with the highest *R*2, compared to the least performed model A2, which had the largest MSE and the smallest *R*<sup>2</sup> score. The MAE

measures the average distance between the predicted and the actual data, and the lower values of the MAE indicate higher accuracy.


**Table 7.** Error values and computation time for various NN models.

Finally, we display the speed of the NN algorithm models in terms of their computation times, as shown in Table 7. The computation time taken by the computer to execute the algorithm encompasses the data splitting stage, standardization of set variables, ANN model compilation and training, fitting, evaluation and the prediction of the results. As noted in Models A1 and A2, the use of Sigmoid and Tanh activation functions accounted for higher computation time, and this is due to the presence of exponential functions, which need to be computed. Model A1 was the least performed in terms of the computation time, and Model B3 was the best, accounting for a 66.56% decrease in time. We observe that the computation time is reduced when the *k*-fold cross-validation split is implemented prior to the ANN model training, as compared to the traditional train–test split. This feature is evident as a further 41.62% decrease was observed when the average computation time for Model B was used against Model A.

The overall comparison of the tuned models is presented in Figure 8. Here, we rank the performance of each MLP model with regards to ST:TP, algorithm computation time, and finally, the errors spanning from the *R*<sup>2</sup> score, MAE and the MSE. The ST:TP ratio denotes the search time per one trainable parameter. The ranking is ascending, with 1 being the maximum preference and 6 being the least preference. From the results and regardless of the number of search times per one trainable parameter, we observe that Model B3 is optimal, followed by Model B1, and the lowest performing is Model A2. Hence, we can conclude that models which consist of the *k*-fold data split performed significantly well in the valuation of the rebate barrier options using the extended Black–Scholes framework.

**Figure 8.** Ranking of models for optimality.

### *4.3. Analysis of Result Performance*

One avenue to show the accuracy of our proposed model is to test the architecture on a non-simulated dataset for the rebate barrier options. At present, we are not able to obtain such real market data due to non-accessibility, and this is one of the limitations of the research. However, we compare the NN results with other machine learning models, such as the polynomial regression and the random forest regression on the same dataset. Both techniques are capable of capturing non-linear relationships that exist amongst variables.

Polynomial regression provides flexibility when modeling the non-linearity. Improved accuracy can be obtained when the higher-order polynomial terms are incorporated, and this feature makes it easier to capture the non-complex patterns in the dataset. It is also very fast when compared to both our proposed NN methodology and the random forest regression (Table 8). In this work, we only present the results obtained using the 2nd-, 3rd- and 4th-degree polynomial regressions. We observed that in terms of accuracy, polynomials of higher degrees gave rise to more accurate results and a significant reduction in their error components.

However, one of the issues facing the polynomial regression is model complexity; when the polynomial degree is high, the chances of model overfitting will be significantly high. Thus, we are faced with the trade-off between accuracy and over-fitting of the model. Regression random forest, on the other hand, combines multiple random decision trees, with each tree trained on a subset of data. We build random forest regression models using 10, 30, 50, and 70 decision trees, then we fit the model to the barrier options dataset, predict the target values, and then compute the error components. Finally, we compare these two models to the optimal model obtained with the NN results (Model B3), and we have the following table.


**Table 8.** Error values and computation time for Model B3, polynomial regression and random forest regression.

<sup>1</sup> We consider polynomials of order ≥ 4 to be higher-order, and this is because of the increase in their complexity. The accuracy of the 4th-order polynomial regression is actually higher than our proposed model, but the former has the issue of overfitting the data, which comes with a higher degree of polynomial regression. Additionally, the N/A in the MSLE cells is due to some negative values in the prediction set, which makes the logarithm of the values N/A.

Increasing the number of decision trees leads to more accurate results, and Oshiro et al. (2012) explained that the range of trees should be between 64 and 128. This feature will make it feasible to obtain a good balance between the computer's processing time, memory usage and the AUC (area under curve) [46]; we observed this feature in our research. The model was tested on 80, 100, 120, 140, 160, 180, and 200 decision trees, and we obtained the following coefficient of determination *R*<sup>2</sup> regression score (computation time): 0.9924 (34 secs), 0.9928 (52 secs), 0.9929 (62 secs), 0.9925 (75 secs), 0.9929 (83 secs), 0.9929 (89 secs) and 0.9926 (102 secs), respectively. We obtained the optimal decision tree to be between 110 and 120 with an *R*<sup>2</sup> score of 0.9929, and any other value below 110 will give rise to a less

accurate result. Any value above 120 will not lead to any significant performance gain but will only lead to more computational cost.

Table 8 compares the performance of our optimal NN model to the random forest and the polynomial regressions. The performance is measured based on the error values and the computational time. The NN model performed better than the random forest regression regardless of the number of decision trees used, and this was obvious from the results presented in Table 8 above. On the other hand, polynomial regression of the 2nd and 3rd orders underperformed when compared to the NN model, but maximum accuracy was obtained when higher order (≥ 4) was used. This higher order posed a lot of complexity issues, which our optimal NN model does not face. Hence, more theoretical understanding is needed to further explain the phenomenon, and this current research does not account for it.

### *4.4. Option Prices and Corresponding Greeks*

To compute the zero-rebate DO option prices and their corresponding Greeks, we simulate another set of data (1,000,000) in accordance with the extended Black–Scholes model, and the Table 9 below gives a subset of the full dataset after cleansing.


**Table 9.** Data subset of option values and Greeks.

For the NN application, we used the hyperparameters of Model B3 to construct the NN architecture and train and predict the option values and their corresponding Greeks. The risks associated with the barrier options are complicated to manage and hedge due to their path-dependent exotic nature, which is more pronounced as the underlying approaches to the barrier level. For the Greeks considered here, we focus on predicting the delta, gamma and vega using the optimal NN model, and the following results were obtained.

Figure 9 shows the plot of the predicted and actual values of the DO option prices, together with the delta, gamma and vega values. For the option value, the DO call behaves like the European call when the option is far deep in-the-money, and this is because the impact of the barrier is not felt at that phase. The option value decreases and tends to zero as the underlying price approaches the barrier since the probability of the option being knocked out is very high. The in-the-money feature is equally reflected in the delta and gamma as they remain unchanged when the barrier is far away from the underlying. Here, the delta is one, and the gamma is zero.

Gammas for this option style are typically large when the underlying price is in the neighborhood of the strike price or even near the barrier, and it is the lowest for out-ofmoney options or knocked-out options. From Figure 9c, gamma tends to switch from positive to negative without switching from long to short options. The values of gammas are usually bigger than the gamma for the standard call option. These extra features pose a great challenge to risk managers during the rebalancing of portfolios. Lastly, vega measures the sensitivity of the option value with respect to the underlying volatility. It measures the change in option value based on a 1% change in implied volatility. Vega declines as the options approach the knock-out phase; it falls when the option is out-of-money and deep in-the-money, and it is maximum when the underlying is around the strike price. Overall, Figure 9a–d display how accurately Model B3 predicts the option values and their Greeks, as little or no discrepancies are observed in each dual plot.

**Figure 9.** Option values and Greeks; (**a**) DO option value; (**b**) DO delta; (**c**) DO gamma; (**d**) DO vega.

### **5. Conclusions and Recommendations**

This research suggested a more efficient and effective means of pricing the barrier call options, both with and without a rebate, by implementing the ANN techniques on the closed-form solution of these option styles. Barrier options belong to exotic financial options whose analytical solutions are based on the extended Black–Scholes pricing models. Analytical solutions are known to possess assumptions which are not often valid in the real world, and these limitations make them ideally imperfect in the effective valuation of financial derivatives. Hence, through the findings of this research, we were able to show that neural networks can be employed efficiently in the computation and the prediction of unbiased prices for both the rebate and non-rebate barrier options. This study showed that it is possible to utilize an efficient approximation method via the concept of ANN in estimating exotic option prices, which are more complex and often require expensive computational time. This research has provided an in-depth concept into the practicability of the deep learning technique in derivative pricing. This was made viable through some statistical and exploratory data analysis and analysis of the model training provided.

From the research, we conducted some benchmarking experiments on the NN hyperparameter tuning using the Keras interface and used different evaluation metrics to measure the performance of the NN algorithm. We finally estimated the optimal NN architecture, which prices the barrier options effectively in connection to some data-splitting techniques. We compared six models in terms of their data split and their hyperparameter search algorithm. The optimal NN model was constructed using the cross-validation data-split and the Bayesian optimization search algorithm, and this combination was more efficient than the other models proposed in this research. Next, we compared the results from the optimal NN model to those produced by other ML models, such as the random forest and the polynomial regression; the output highlights the accuracy and the efficiency of our proposed methodology in this option pricing problem.

Finally, hedging and risk management of barrier options are complicated due to their exotic nature, especially as the underlying is near the barrier. Our research extracted the

barrier option prices and their corresponding Greeks with high accuracy using the optimal hyperparameter. The predicted and accurate results showed little or no difference, which explains our proposed model's effectiveness. For future research direction, more theoretical underpinning seems to be lacking in connection to the evaluation/error analysis for all the proposed models used in this research. Another limitation of this work is the use of a fully simulated dataset; it will suffice to implement these techniques on a real dataset to estimate the effectiveness. The third limitation of this research lies in the convergence analysis of the proposed NN scheme, and future research will address this issue. In addition, more research can be conducted to value these exotic barrier options from the partial differential perspective, that is, solving the corresponding PDE from this model using the ANN techniques and extending the pricing methodology to other exotic options, such as the Asian or the Bermudian options.

**Author Contributions:** These authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data supporting this study's findings are available from the corresponding author upon reasonable request.

**Acknowledgments:** This work commenced while the first author was affiliated with the Center for Business Mathematics and Informatics, North-West University, Potchefstroom and the University of Johannesburg, both in South Africa. The authors wish to acknowledge their financial support in collaboration with the Deutscher Akademischer Austauschdienst (DAAD).

**Conflicts of Interest:** The author declares that they have no competing interests.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
