*YOLOv4 Algorithm*

The You Only Look Once version 4 (YOLOv4) algorithm is a state-of-the-art object detection algorithm that processes an entire image and directly predicts the bounding boxes and class probabilities for all objects in the image. It uses a convolutional neural network (CNN) to extract features from the input image and then apply them a series of convolutional and fully connected layers in order to predict the class probabilities and bounding box coordinates for each object [18].

The YOLOv4 algorithm predicts object classes and bounding box coordinates by dividing the input image into a grid of cells and predicting the class probabilities and bounding box offsets for each cell [19]. Specifically, for each cell in the grid, the algorithm predicts:


These predictions are made using a series of convolutional and fully connected layers in the YOLOv4 network. The network architecture is based on a variant of the DarkNet architecture, which consists of multiple convolutional layers and followed by max-pooling layers, and ends with multiple fully connected layers [20]. The final layer of the network outputs a tensor that is the shape of (grid size) × (grid size) × (number of anchor boxes) × (5 + number of classes), where the 5 refers to the objectness score, *bx*, *by*, *bw*, and *bh* [21].

The YOLOv4 algorithm then uses non-maximum suppression to remove redundant bounding boxes for the same object [22]. Specifically, for each class, it applies non-maximum suppression to the set of predicted bounding boxes with objectness scores above a certain threshold. This threshold is usually set to a value between 0.5 and 0.7, depending on the desired balance between precision and recall [23].

The YOLOv4 algorithm can be trained using a loss function that measures the errors between the predicted and ground-truth bounding boxes and class probabilities [24]. The loss function consists of two components: a localization loss that penalizes errors in the predicted bounding box coordinates, and a classification loss that penalizes errors in the predicted class probabilities. The localization loss is typically computed using the mean squared error (MSE) between the predicted and ground-truth bounding box coordinates, while the classification loss is typically computed using the cross-entropy loss between the predicted and ground-truth class probabilities [25].

Algorithm 1 shows the main steps of the YOLOv4 algorithm.

#### **Algorithm 1** YOLOv4 object detection algorithm
