Feature Extraction

We make use of low-level and mid-level visual information for predicting the highlevel features that are necessary to determine the semantic composition of a logical story unit. In our approach we use object, person and location tags as high-level features for detecting the LSU boundaries. To obtain the object and person annotations, the latest version of the YOLO object detector [12], pre-trained on the COCO dataset [22], is used. COCO stands for Common Objects in Context. The dataset comprises of 1.5 million object instances covering 80 object classes. Along with the object detector, the place or the location of the scenes are predicted using the ResNet-50 CNN architecture, pretrained on the places-365 dataset [11]. This dataset contains more than 10 million images in total, comprising 400+ unique scene categories [23].

#### *3.2. Structure Mining: Shot Boundary Detection*

Once we extract the visual features of the video frames, we utilise them to estimate the similarity between frames. This, in turn, is used to predict the overall structure of the video as shown in Figure 3. Broadcast videos generally have a frame rate of 24 fps. We process every sixth frame of our video for computational advantage (4 frames/s). Furthermore, we cluster temporally similar frames to form shot and story units.

**Figure 3.** Overview of the framework for Shot Detection. Shot is defined as a group of continuous frames without a cut. To predict shot boundaries, the framework utilises only frame-level visual features from the given input video.

#### *Spatio-Temporal Visual Similarity Modelling*

In contrast to other approaches that use clustering for boundary detection, we construct a similarity matrix that jointly describes spatial similarity and temporal proximity. The generic element *Sij* defines the similarity between frames *i* and *j*, as shown in Equation (1).

$$S\_{ij} = \exp\left(-\frac{d\_1^2(\psi(x\_i), \psi(x\_j)) + \mathfrak{a} \cdot d\_2^2(x\_i, x\_j)}{2\sigma^2}\right) \tag{1}$$

where, *ψ*(*xi*) and *ψ*(*xj*) are the list of visual tags for the *i*th and *j*th frame, respectively. *d*2 1 is the cosine distance between frame *xi* and *xj*, while *d*2 2 is the normalised temporal distance between frame *xi* and frame *xj*. The parameter *α* tunes the relative importance of semantic similarity and temporal distance. The effect of alpha on the similarity matrix is shown in Figure 4.

**Figure 4.** Effect of *α* (from left to right 0, 5, and 10) on similarity matrix *Sij*. Higher values of *α* enforce temporal connections between nearby frames and increase the quality of the detected shots.

As shown in Figure 4, the effect of applying increasing values of *α* to the similarity matrix is to raise the similarities of adjacent frames, thereby boosting the temporal correlations of frames in the neighbourhood. At the same time, too high values of *α* would lead to the boosting of the temporal correlation of very close neighbouring frames, thereby failing to capture gradual shot changes. The final boundaries are created between frames that do not belong to the same cluster. An experiment was conducted with the videos of the RAI dataset, where values from 1 to 10 were provided for *α*, and its effect was studied. We found that an *α* value of 5 performed well on average, for both gradual and sharp shot changes. Therefore, we use an *α* value of 5 for our shot boundary detection experiments, since it provides the right amount of local temporal similarity for the prediction of boundaries.

As seen in Equation (1), semantic composition-based frame-similarity estimation is composed of the following two sub parts:


#### 3.2.1. Semantic Similarity Scoring Scheme

We use the cosine similarity principle to measure inter-frame similarity; that is, we measure the cosine angle between the two frame vectors of interest. The cosine similarity between the *i*th and the *j*th frame is calculated by taking the normalised dot product as follows:

$$\text{sim}(\mathbf{x}\_{i\prime}\mathbf{x}\_{j}) = ||\psi(\mathbf{x}\_{i})|| \cdot ||\psi(\mathbf{x}\_{j})||\tag{2}$$

where, *ψ*(*xi*) is the normalised vector based on the list of visual tags for frame *xi*. This results in a spatial similarity matrix. The similarity measure is converted into a distance measure based on the following Equation:

$$d\_1^2(\psi(\mathbf{x}\_i), \psi(\mathbf{x}\_j)) = 1 - \text{sim}(\mathbf{x}\_{i\prime}, \mathbf{x}\_j) \tag{3}$$

An example of utilising the spatial similarity matrix to retrieve the top four similar frames from a video is shown in Figure 5.

**Figure 5.** An example of utilising the spatial similarity matrix to retrieve top four similar frames from a video. The video used is Season 5 Episode 21 of *FRIENDS* show.

3.2.2. Temporal Model Analysis

As per Equation (1) the temporal proximity is modelled using *d*22, which is the normalised temporal distance between frames *xi* and *xj*. The normalised temporal distance can be defined by Equation (4)

$$d\_2^2(\mathbf{x}\_{i\prime}, \mathbf{x}\_j) = \frac{|f\_i - f\_j|}{l} \tag{4}$$

where *fi* and *fj* are the index of frame *xi* and *xj*, respectively, and *l* is the total number of frames in the video.

## 3.2.3. Boundary Prediction

Based on Equation (1), the lower the value of *Sij*, the more dissimilar frames *xi* and *xj* are. Thus, we calculate the shot boundary by thresholding *Sij*. In our experiments, 0.4 was used as the threshold value. The entire shot boundary detection algorithm is shown in Algorithm 1.


#### *3.3. Similarity Estimation: Context Based Logical Story Unit Detection*

Based on our experiments, we have deduced that normal broadcast content, such as a SOAP episode or the news, often make use of multiple angles pertaining to the same story unit.In more than 90% of the cases, these angles recur multiple times throughout the video. Therefore, as shown in Figure 6, the context-based similarity estimation begins with shot detection. Progressing from these estimated shot boundaries, frame-level semantic descriptions are merged as follows:

$$L\_{ij} = \frac{w\_1(place\\_sim) + w\_2(obj\\_sim)}{w\_1 + w\_2} \tag{5}$$

where *w*1 and *w*2 are the weights for place and object descriptions. In our experiments, we have given more importance to place descriptions than to object descriptions, mainly because the state-of-the-art object detection models do not have the ability to predict all the objects in a frame. Moreover, the pre-trained place-detection model has the ability to capture the overall context of the shot location, and therefore has been deemed more important. Thus we have maintained *w*1 and *w*2 as 2 and 1, respectively, in all our experiments.

The shot-level similarity measure is calculated based on the joint similarity estimated using Equation (5). An example of the similarity matrix of a video from RAI is shown in Figure 7. The final similarity matrix is used along with the re-identification algorithm to generate object timelines.

**Figure 6.** Overview of the LSU-detection module. Given the input video, the framework extracts audio–visual features to predict logical story unit boundaries based on semantic similarity between temporally coherent *shots*. The final decision boundary is based on thresholding the distance between consecutive *shots*.

**Figure 7.** Estimated shot similarity for RAI video 23353. The figure also shows key frames of a selected LSU (red box).
