3.3.1. Object Detection

In order to deal with human–object interactions, one key step is to recognize the object from an RGB image. For this, we propose to use an object recognition algorithm based on the neural network [37]. The you only look once (YOLO) model processes images in real time at 45 frames per second. A smaller version of the network, Fast YOLO, processes 155 frames per second while still achieving double the mAPof other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors, but is less likely to predict false positives on the background. YOLO system detection is a regression problem. It divides the image into an even grid S × S and simultaneously predicts bounding boxes B, confidence in those boxes, and class probabilities C. These predictions are encoded as an S × S × (B × 5 + C) tensor. YOLO imposes strong spatial constraints on the box prediction delineation since each cell in the grid predicts only two boxes and can only have one class. This spatial constraint limits the number of nearby objects that the model can predict. Figure 3 shows the steps of detecting objects in an RGB image.

### *Electronics* **2020**, *9*, 1888

**Figure 3.** The YOLO detection system. (1) resizes the input image to 448 × 448; (2) runs a single convolutional network on the image; and (3) thresholds the resulting detections by the model's confidence.
