**3. Proposal**

In this paper, we designed a unit, called a primitive, which is a kind of partitioned CB. Our CBR process consists of object tracking, primitive recognition, and CBR by matching recognized primitives with a predefined pattern of primitives. Since the innovative part of our approach is CBR with the combination of primitives, we applied existing methods to object tracking. The workflow of our approach is shown in Figure 1. At the beginning, the existing method tracks objects from the input video captured by in-store cameras. Then, each frame's primitives are recognized based on the object trajectories. We predefine CB as a pattern consisting of primitives. Finally, we match the recognized primitives with the predefined primitive pattern. The matched pattern is regarded as the corresponding CB. This section explains our proposed method in detail, including how we design the primitives, the method for primitive recognition, customizing CB using primitives, and CBR by pattern matching.

$$\begin{aligned} \text{Video} & \longrightarrow \underbrace{\begin{aligned} \text{Object Tracking} \\ \hline \end{aligned}} \end{aligned} \xrightarrow{\text{Plimitive Recogention}} \underbrace{\begin{aligned} \text{Plimitive Recogention} \\ \hline \end{aligned}} \xrightarrow{\text{Vimitive Patter Matching}} \begin{aligned} \text{CBO} \end{aligned}$$

**Figure 1.** Proposal flow.

## *3.1. Primitive*

The dictionary definition of a behavior is the accomplishment of a thing, usually over a period of time or in stages. We believe that this definition reveals the process by which the human brain recognizes a behavior from visual information. Behavior consists of several stages, and our brains recognize this behavior by checking whether these stages occur in the correct order. In this paper, we refer to these stages as primitives. Thus, CB can be decomposed into primitive(s). Table 1 lists the target CBs in existing methods and the primitives from our subjective decomposition of the target CBs. We did not list a type of CB [18] in Table 1 because they recognize customer's emotion from facial expressions and speech text, which might breach customers' privacy. During the decomposition, we controlled the decomposition granularity to avoid redundancy from over-decomposition. We found that the objects in the target CBs were body parts or products. There are two types of primitives: one describes an object's motion state and the other describes the relationship between two objects. Based on what we have found so far, we can decide what kind of information is in the primitive and how detailed it is.

It is necessary to design an expression format for primitives. Generally, using natural language is considered an efficient method when we need to let others know that we understand a behavior. Therefore, we define the primitive by a sentence with reference to the natural language grammar. The syntax is:

$$is subject\text{ }verb\text{ }object\text{ }from\text{ }where\_{start}\text{ }to\text{ }where\_{end}\tag{1}$$

where italic words are syntax elements which can be replaced by words in the vocabulary below. If *wherestart* = *whereend*, the syntax can be simplified as *subject verb object where*. As the syntax shows, the primitive consists of *subject*, *verb*, *object* and *where*, each of which has a corresponding vocabulary, as follows:

• *subject*: person, hand, product

• *verb*: move, stay, follow, face to


*Subject* and *object* refer to the name of an entity. *verb* describes the movement of *subject* or the relation between *subect* and *object*. *where* means the place where the primitive happens. As our proposed method should cover a wide range of CBs, the vocabulary should be a selection of commonly used words in retail environments. Therefore, these words are selected based on our aforementioned findings from the existing methods in Table 1 Nevertheless, more and more words will be available as our research progresses. There are some constraints and options for the syntax to avoid confusing definition sentences, as below:


In sum, the syntax describes what an object does or what happens to it. With some verbs, it could represent two objects' relationship. This design could define motion primitives, the motion of an object, relation primitives, or the relation between two objects. In the case of more than two objects, combining several relation primitives could describe a CB composed of multiple objects.


**Table 1.** Primitives in target CBs of current approaches.

However, though the proposed syntax is enough for our current research, its application range is limited due to the design of *subject*, *verb*, *object*, and *where*. Despite the ability to define multi-object interactions theoretically, each sentence only defines two objects' one-to-one relationship. Therefore, the resources for multi-object relationships definition grow exponentially with the number of related objects. Nevertheless, it is currently sufficient for us because there are at most two objects in interaction. Since *where* limits the number of positions only to start and end, it cannot describe complex motion, such as spiral movement.

#### *3.2. Primitive Recognition*

In this section, we consider the elements in the syntax from the objects' trajectories. Since most CBs last for a few seconds which implies many frames for a video with 30 fps, this leads to redundancy in the trajectories with the object-tracking method. Consequently, we first perform trajectory segmentation to reduce redundancy in the trajectories. Then, we recognize primitive elements using the results of segmentation.

Trajectory segmentation refers to compressing a trajectory into several segments, which preserve most features of the trajectory. Current approaches [19,20] separate a trajectory based on the moving distance and direction of each vector in the trajectory. Thus, we design an approximate trajectory partitioning (ATP)-based algorithm [19] for trajectory segmentation. However, ATP is sensitive to direction changes. In our case, an object's frequent direction changes over short distances probably refers to idling. We anticipate that the algorithm will only react to change in the moving distance in this case. Hence, we designed a thresholding algorithm based on ATP as shown in Algorithm 1. The algorithm receives two inputs: a list of points *KptsATP* ← [*p*1, *p*2, *p*3, ..., *pi*, ..., *pN*] from ATP outputs, where *pi* refers to the *i*-th element in *KptsATP*, *N* is the number of key-points from ATP, and a threshold *thresholdidle* is set to preserve the key-points with a distance longer than *thresholdidle*. Since the time complexity of ATP and Algorithm 1 are *<sup>O</sup>*(*n*), the time complexity of the tracjectory segmentation is *<sup>O</sup>*(*n*<sup>2</sup>), where *n* is the length of the trajectory.


In the primitive's syntax, *subject* and *object* are the entity names that can be obtained directly from the trajectory information. The words "in the shelf/cart" and "out of shelf/cart" for *where* can be directly acquired from the coordinates of the trajectory. Therefore, only *verb* needs to be recognized from the trajectories. Algorithm 2 explains the recognition for "move" and "stay". The two words are a pair of antonyms that mean an object is moving faster than a certain speed or staying still. The input segmented trajectory *ST* ← [*p*1, *p*2, *p*3, ..., *pi*, ..., *pM*] contains the trajectory processed by segmentation algorithm, where *pi* refers to the *i*-th point in *ST*, and *M* is the number of points of *ST*. *thresholdidle* is reused in this algorithm to detect whether an object is moving or not. To improve the robustness to noise, we applied a window with length of *lenwindow*1 to filter the noise. The algorithm output *verb*1 is one of the words "move" and "stay", which means the recognition result for the current frame. The time complexity is *<sup>O</sup>*(*n*), where *n* is the smaller of the length of the segmented trajectory and *lenwindow*1.

#### **Algorithm 2:** Verb Recognition(move, stay)

**Input:** List Of Points *ST* ← [*p*1, *p*2, *p*3, ..., *pi*, ..., *pM*], Integer *thresholdidle*, Integer*lenwindow*1 **Output:** String *verb*1 **1** *index* ← *M*; **2** *results* ← []; **3 while** *results.length* ≤ *lenwindow*1 **do 4** *ptstart* ← *pindex*; **5** *ptend* ← *pindex*+1; **6** *vec* ← *pend* − *pstart*; **7** *distance* ← *vec*.*x*<sup>2</sup> + *vec*.*y*2; **8 if** *distance* ≤ *thresholdidle* **then 9** *results*.Add(1); **10 else 11** *results*.Add(0); **12 end 13** *index*← *index* − 1; **14 if** *index* = 0 **then 15 Break**; **16 end 17 end 18** *sum* ← 0; **19 foreach** *element result of results* **do** *sum* ← *sum* + *result*; **20** ; **21 if** *sum* ≤ *results.length/2* **then 22** *verb*1 ← "stay"; **23 else 24** *verb*1 ← "move"; **25 end 26 return** *verb*1;

Algorithm 3 shows the recognition for the *verb*, "follow". The word means the *subject* is moving/staying together with the *object*. The inputs are two objects' segmented trajectory *ST*1 ← [*p*11, *p*12, *p*13, ..., *p*1*i*, ..., *p*1*M*] and *ST*2 ← [*p*21, *p*22, *p*23, ..., *p*2*i*, ..., *p*2*M*], where *pji* refers to the *i*-th point in the trajectory *STj*, *M* is the number of points of the segmented trajectory. *thresholdf ollow* is used to detect whether an object is close to another one or not. Similar to Algorithm 2, a parameter *lenwindow*2 is passed to the algorithm for denoising. The algorithm output *verb*2 is "follow" or *null*, which means the recognition result for the current frame. The time complexity is *<sup>O</sup>*(*n*). Furthermore, the *verb* "face to" refers

to *subject* is facing *object*. Since it requires detecting the orientation of the body or head, which is not currently supported in our method, we intend to omit it in this paper and consider it in future work. The time complexity is *<sup>O</sup>*(*n*), where *n* is the smaller of the length of the segmented trajectory and *lenwindow*1.

**Algorithm3:** VerbRecognition(follow)

 **Input:** List Of Points *ST*1 ← [*p*11, *p*12, *p*13, ..., *p*1*i*, ..., *p*1*M*], List Of Points *ST*2 ← [*p*21, *p*22, *p*23, ..., *p*2*i*, ..., *p*2*M*], Integer *thresholdf ollow*, Integer *lenwindow*2 **Output:** String *verb*2 **1** *index* ← *M*; **2** *results* ← []; **3 while** *results.length* ≤ *lenwindow*2 **do 4** *pt*1 ← *p1index*; **5** *pt*2 ← *p2index*; **6** *vec* ← *pt2* − *pt1*; **7** *distance* ← *vec*.*x*<sup>2</sup> + *vec*.*y*2; **8 if** *distance* ≤ *thresholdf ollow* **then 9** *results*.Add(1); **10 else 11** *results*.Add(0); **12 end 13** *index* ← *index* − 1; **14 if** *index* = 0 **then 15 Break**; **16 end 17 end 18** *sum* ← 0; **19 foreach** *element result of results* **do** *sum* ← *sum*+*result*; **20** ; **21 if** *sum* ≤ *results.length/2* **then 22** *verb*2 ← *null*; **23 else 24** *verb*2 ← "follow"; **25 end 26 return** *verb*2;

#### *3.3. Define CB by Primitives*

With our designed primitives, we are able to customize a wide range of CBs with a combination of primitives. Since our primitives are designed with reference to target CBs in existing methods, we applied primitives to define those target CBs. The clothes-related CBs are excepted because they are not common in normal retail stores, and because they are too complex for our proposal. We defined CBs in Table 1 by primitives, as shown in Table 2. The symbol "→" defines the primitives' chronological order. Primitives that precede this symbol are assumed to occur first. Since the product is occluded when it is on the shelf in our implementation, a precise definition of "touch the shelf" is difficult to formulate. Therefore, we defined it broadly as the primitive pattern in Table 2.


**Table 2.** Define target CBs by primitives.

#### *3.4. Primitive Pattern Matching*

The recognized primitives are stored in a sequence to retain their chronological order. Once any primitive has been recognized in the current frame, our method matches the primitive sequence with the predefined primitive patterns. Any matched result is considered as a recognized CB. Algorithm 4 explains the details of the pattern matching. Since forward matching in chronological order consumes a grea<sup>t</sup> deal of computational resources to save different matching states for each primitive pattern, it leads to the running speed becoming slow as the running time grows. Therefore, we match recognized primitives in reverse chronological order. In other words, we start matching from the most recently recognized primitives, which saves a grea<sup>t</sup> deal of computational resources because there is no need to save the matching states. The algorithm takes the inputs of a sequence, including recognized primitives, a predefined primitive pattern, and a number *timeout*, to stop the algorithm when there are not any matched primitives within the recent *timeout* frames. The output is a Boolean value of whether the corresponding CB is matched or not. The time complexity is *<sup>O</sup>*(*n*), where *n* is the smaller of the length of *Pseq* and the length of *Pdef*.

> **do**

#### **Algorithm 4:** Primitive Pattern Matching

**Input:** List Of Primitive *Pseq*, List Of Primitive *Pdef* , Integer *timeout* **Output:** Boolean *matched*

**1** *seqIndex* ← *Pseq*.length; **2** *def Index* ← *Pdef* .length; **3** *timeoutCounter* ← 0; **4 while** (*seqIndex* > 0) *AND* (*timeoutCounter* ≤ *timeout*)**5 if** *Pseq*[*seqIndex*] = *Pdef* [*def Index*] **then 6** *def Index* ← *defIndex* − 1; **7 if** *def Index* = 0 **then 8 return True**; **9 end 10 else 11** *timeoutCounter* ← *timeoutCounter* + 1; **12 end 13** *seqIndex* ← *seqIndex* − 1; **14 end 15 returnFalse**;
