**4. Evaluation**

#### *4.1. Experiment Settings*

Our proposed method can be flexibly modified to recognize different target CBs to cope with frequently changing target CBs in smart retail solutions. To evaluate our proposed method, we used our collected laboratory dataset [21] and the public MERL dataset [7]. Our proposed method recognizes target CBs in input videos and calculates their f1-score as the accuracy metric. Since videos in two datasets were taken in different environments, it can be considered a change in retail environments to some extent. To recognize different target CBs in the two different datasets, we only changed a few parameters of our designed algorithms and predefined primitive patterns. By observing the accuracy of our method on different datasets, and considering only a few modifications when changing datasets, we could infer our method's flexibility to some degree.

The inputs of our method are the trajectory coordinates, which need to be obtained using object detection and a tracking model. However, wrong tracking results obtained by other models mean wrong inputs to our method, which probably leads to wrong outputs. To eliminate the influence of different object detection models on our evaluation results, we track the annotated bounding boxes with a Kalman-filter and Hungarian algorithm [22] to obtain the input trajectories for our method. In addition, although some tracking models can predict the trajectories of occluded objects, the occluded trajectories are not annotated in the evaluation. Regarding the output CB annotations, we annotated the target CB in each frame for our laboratory dataset. As the MERL dataset is public, we used its original CB annotations. For the experiments on both datasets, we implemented our method in the same Windows 11 device with RAM of 16 GB. The CPU was an Intel i7-12700K (3.6 GHz). The GPU was an NVIDIA GeForce RTX 3060 Ti (8 GB). The program was written in Python 3.9. The ML framework was PyTorch 1.12. The third-party libraries used included numpy 1.22 and scipy 1.8.

#### *4.2. Our Laboratory Dataset*

This is a dataset we collected at a public activity, where the randomly selected 19 participants were requested to simulate shopping in front of the shelf one-by-one. The dataset includes 19 top-view videos of 19 subjects with a resolution of 640 × 480. Each video was about 30–60 s with 10 FPS and only one subject. Figure 2 shows some examples of the annotated target CBs in the dataset. We built a laboratory retail environment and installed an RGB top-view camera to obtain an occlusion-free view. Each participant in the videos was asked to interact with the products on the shelf. The participant were required to take at least one product from the shelf. There were four products of different shapes and sizes, including a boxed juice, a deodorant spray, a stainless steel bottle, and a wet-tissue. The products were not visible when they were on the shelf. Our data was collected when our proposed method was demonstrated in a public activity. The videos were collected without requiring the participants to sign any confidentiality agreement, and the participants' faces were exposed to the cameras. Unfortunately, as a result, we cannot publish our collected dataset until all the private information has been removed, such as by masking the faces.

Since the innovative part of our proposed method involves the receipt of trajectory coordinates as inputs, we annotated the bounding box of person, hand, and four products in each frame. Then, we used a tracker with a Kalman-filter and Hungarian algorithm [22] to obtain the object's trajectory as input. Regarding the output CBs, we selected eight CBs as listed in Table 3. Among them, the first six CBs included most target CBs used in existing methods. However, with the annotation of the first six CBs, we found that many frames still remained without annotation. Thus, we added two CBs to fill the frames without annotations. We used some approximate definitions for some CBs, such as "browse", because the approximate definition enabled reuse of primitives with nearly no loss of accuracy.

**Figure 2.** Example of annotated CB in our laboratory dataset.

Figure 3 shows the confusion matrix of our laboratory dataset. Each CB's column includes two columns of frame count and each row's frame count percentage. Figure 4 shows the f1-score and some statistics for our laboratory dataset. The total average is the average value of the column calculated using the sum of the product of the frame percent and each row's value. The total average f1-score of our method was 89.35%, which is an acceptable result. The f1-score for most CBs was also acceptable, except for "viewing," "walking," and "touch". In terms of "viewing", the confusion matrix revealed the reason with 68.18% precision. Some "viewing" frames were recognized as "select" and "browse." The ambiguous boundary caused the wrong prediction of "select". The different definition of "viewing" between annotation and CB definition led to the wrong prediction of "browse". As our proposed method cannot recognize the target's orientation or track the target's eyes currently, our CB definition approximately defines "viewing" as stay static out of the shelf,

while the annotation of "viewing" means the target is standing still and looking at the shelf. The low recall of "viewing" indicates that most frames of "browse" were recognized as "viewing". The difference in CB definition is whether the target is holding a product or not. Products are usually occluded in the "browse" frames, which caused the wrong recognition output for "viewing".


**Table 3.** Primitive patterns in our laboratory dataset.


**Figure 3.** Confusion matrix of our laboratory dataset.


**Figure 4.** Results of F1-score of our laboratory dataset.

With respect to "walking", some of its frames were recognized as "browse". When the target is walking while holding a product, it is difficult to determine the ambiguous boundary between "browse" and "walking". The CB definition in Table 3 recognizes them by distinguishing whether the target is moving while holding a product. "Browse" refers to holding a product while staying static. We used a single threshold to divide the object's moving speed to detect move or stay, which was not sufficiently accurate for totally correct detection. Some frames were detected as staying static, which led to the wrong recognition. This also applied to the low recall of "walking".

In the case of "touch", there was only one case in the dataset. It was defined as a customer putting their hand inside the shelf but taking nothing out of it. Some wrong recognition of "pick" results in the low recall occurred because the picked object was occluded. In addition, Figure 3 shows that most video frames were "browse" and occurred more frequently than any other CBs. Thus, we considered discriminating within "browse" to make the distribution of CBs more uniform.

According to the above results, our method showed acceptable accuracy for the laboratory dataset. Some individual CBs with low f1-score are anticipated to be improved by changing the CB definitions into more accurate definitions. To evaluate our proposed method's ability to discriminate CB, we predefined different primitive patterns to discriminate the CB "select" according to whether one hand or both hands were used. This indicates that our proposed method is able to deal with CB discrimination to some extent. Concerning the evaluation of flexibility, we measured the time required by our method when applied to different datasets. For the collected laboratory dataset, we spent about an hour tuning the five parameters in the three designed algorithms and two to three hours defining the primitive patterns in Table 3. Then, we annotated the CBs in each frame for about five hours per day. The annotations took about one week in total. Since annotation is not required during the application of our method, the time for annotation is considered as a reference for the ML-based methods' modification time.

#### *4.3. MERL Dataset*

The MERL shopping dataset [7] is a public dataset consisting of 106 top-view videos with a resolution of 920 × 680, each of which is about two minutes long with 30FPS. All 41 subjects were asked to do shopping in a retail store setting. Figure 5 presents some examples of the annotated CBs in the dataset. With regard to the input trajectory coordinates, we annotated the bounding box of person and hand in each frame based on the results from the pose estimation model Higher HRNet [23] pretrained on the COCO dataset [24]. We manually annotated the product's bounding box in each frame. Due to the limited time, we only finished the object's bounding box annotations in 46 videos for evaluation. Similar to the process for the laboratory dataset, we used the same tracker with a Kalman-filter and Hungarian algorithm [22] to obtain the input trajectories.

For the output CBs, we used the CB annotations included in the dataset. This provided five CBs' annotation, and we defined them using our proposed method, presented in Table 4. Among the five CBs, we excluded the CB "hand in shelf" from the evaluation because many ground truths were not annotated during our random check of the annotations.


**Table 4.** Primitive Patterns in MERL dataset.

**Figure 5.** Example of annotated CB in MERL dataset.

Figure 6 shows the confusion matrix of the MERL dataset. Each CB's column includes two columns of frame count and each row's frame count percentage. Figure 7 shows the f1-score and statistics for the MERL dataset. The calculation of the total average was the same as in Figure 4. The average f1-score of our method was 79.66%, which is acceptable for our proposed method with only a change in CB definitions. Among the four target CBs, our method achieved only about 60% precision for "reach to shelf" and "retract from shelf". We found that this was caused by the different boundary in the definition. Specifically, there was a difference between our definition of "reach to shelf" and the definition in the MERL dataset. We defined the CB's boundary using a threshold of moving speed. Therefore, our method started to recognize "reach to shelf" from the frame in which the hand was already moving. The MERL dataset defines the start of "reach to shelf" as when one intends to "reach to shelf", when one's hand has not ye<sup>t</sup> moved. Thus, our recognition results always differed from the annotations by a few frames. For "retract from shelf", this accounted for the low precision. The errors for "reach to shelf" and "retract from shelf" were caused by different definitions. We consider our method to have been successful in recognizing every "reach to shelf" and "retract from shelf" CB with a few frames' difference. This implies that we could improve our method by recognizing intention in our future research.


**Figure 6.** Confusion matrix of MERL dataset.


**Figure 7.** Results of F1-score of MERL dataset.

Except for recognition accuracy, Table 5 compares the required modifications and the estimated required time when applying our approach and the machine learningbased approach to different datasets. Our proposed method changed the five parameters (*thresholdidle*, *lenwindow*1, *thresholdf ollow*, *lenwindow*2, *timeout*) in the three algorithms we designed in Section 3. They were mainly used to cope with change in the person's scale in the video frames. We also re-defined the primitive patterns for the new target CBs. As shown in Table 5, in our experiments, all the modifications took about 3–4 h.

**Table 5.** Flexibility: Modifications for dataset change adaptation.


For the ML-based methods, the main modification was re-annotation. Since the required time for data re-collection and model tuning varied greatly when dealing with changes of datasets, we currently lack sufficient reference data to estimate its required time. However, regarding the time spent on re-annotation, as we annotated both datasets for the purpose of accuracy calculation, the required time for modification was estimated to be about 2–3 months.

In conclusion, since our method cannot be fine-tuned as ML-based methods are, our proposed method sacrifices accuracy to obtain flexibility. Nonetheless, the huge difference in modification time indicates that the trade-off is justified. The considerably enhanced flexibility could have application value in the context of CBR.
