*3.3. Method of Object Detection (SSD)*

Wei Liu [36] devised the single shot multibox detector (SSD), a one-stage method in which a neural network (VGG-16) is used to extract feature maps for classification and regression before the target objects are tested. It incorporates the regression concept in YOLO and identifies the location of the target class in regression. Similar to the anchor mechanism in Faster-RCNN, prior boxes are established and features are extracted from the backbone network. Feature maps of various dimensions are used for prediction, with large feature maps to detect small targets and small maps to detect large targets. Convolution kernel is applied on the feature maps to predict the classes and coordinate offsets of a series of default bounding boxes.

VGG-16 serves as the backbone model for the single shot multibox detector (SSD) structure. The fully connected layer of VGG, fc6, is modified and converted into a 3 × 3 convolution layer, Conv6, and fc7 into a 1 × 1 convolution layer, Conv7, while the pooling layer, pool5, is changed from originally 2 × 2 with stride = 2 to 3 × 3 with stride = 1. 4; convolution layers are added; the test module layer of the 1st feature map is Conv4\_3, followed by Conv8\_2, Conv9\_2, Conv10\_2, and Conv11\_2 [36,39]. Their sizes are shown in Figure 12.

**Figure 11.** Contents of the XML file of a tagged job site image.

**Figure 12.** Single shot multibox detector model structure.

The size and length–width ratio require consideration for testing the box on a feature map. Every grid on the feature map is scanned to generate corresponding testing boxes (Figure 13). During the training, the ground truth in the picture is checked to match the testing box. The best-fit box is filtered based on intersection over union (IOU). The exact positive and negative sample ratio is close to 1:3. The loss function depends on the weights of location error and confidence error. Data enhancement is carried out via horizontal flipping, random cutting, color twisting, and random sampling of block regions. Top-k prediction boxes with high confidence levels are reserved during the prediction before the object detection algorithm of non-maximum suppression (NMS) is used to filter prediction terms with significant overlapping. The prediction term left at the end is the result [36].

**Figure 13.** Single shot multibox detector target feature detection process.
