*4.1. Dataset*

In this work, a thorough, objective, and accurate performance evaluation has been carried out to evaluate the pipeline for shot boundary detection, LSU boundary detection and object re-ID.

To evaluate the proposed approach for shot and LSU boundary detection, we tested the pipeline on the benchmark RAI dataset. This dataset is a collection of ten challenging broadcasting videos from the Rai Scuola video archive, ranging from documentaries to talk shows constituted by both simple and complex transitions.

We evaluate our approach for object re-ID on randomly selected SOAP episodes. For fair evaluation, we chose to validate our approach on two different sets of SOAP broadcast content; namely, *New Girl* and *Friends*. We selected 10 episodes from Season 4 of *New Girl* and 10 episodes from Season 3 of *Friends* as our final dataset for object re-ID.

#### *4.2. Evaluation Metrics*

We evaluated the pipeline based on three tasks: (1) accuracy of the shot boundary detection; (2) accuracy of the LSU boundary detection; and (3) accuracy of the object re-ID algorithm.

For all the experiments, we use the precision, recall, and f1-score for the evaluation of our results. Precision, recall, and f1-score are computed based on the matched shots/LSU with the ground truth. Furthermore, the results were graphically visualised and analysed to promote insight.

The precision measure refers to the fraction of rightly predicted boundaries from total predictions, whereas recall measure denotes the fraction of boundaries rightly retrieved. If *groundtruth* refers to the list of ground-truth values and *prediction* refers to the list of automatically predicted values, then precision and recall can be expressed as in Equation (10).

$$\begin{aligned} precision &= \frac{|\text{groundtruth} \cap \text{prediction}|}{|\text{prediction}|}\\ recall &= \frac{|\text{groundtruth} \cap \text{prediction}|}{|\text{ground truth}|} \end{aligned} \tag{10}$$

F-score, on the other hand, combines precision and recall measures; it is the harmonic mean of the two. Traditional *Fshot* can be defined as follows:

$$F\_{\text{shot}} = 2 \cdot \frac{precision \cdot recall}{precision + recall} \tag{11}$$

As mentioned in earlier sections, the precision, recall, and f1 measure would not suffice to validate the accuracy of the LSU boundary detection algorithm. The reason for this is that humans and algorithms employ different ways of perceiving story units. Humans can relate changes in time and location to discontinuities in meaning, whereas an algorithm solely depends on visual dissimilarity to identify discontinuities. This semantic gap makes it impossible for algorithms to achieve fully correct detection results. Therefore, as suggested in [9], we use coverage and overflow metrics to measure how well our LSU boundary detection algorithm performs with respect to human labelled LSUs, using visual features. That is, in addition to the precision, recall, and f1 measures, we propose to use *coverage* and *over f low* measures to evaluate the number of frames that were correctly clustered together.

Coverage C measures the quantity of frames belonging to the same scene correctly grouped together, while Overflow O evaluates to what extent frames not belonging to the same scene are erroneously grouped together. Formally, given the set of automatically detected scenes *s* = [*<sup>s</sup>*1,*s*2, ...,*sm*], and the ground truth *g* = [*<sup>s</sup>*1,*s*2, ...,*sn*], where each element of s and g is a set of shot indexes, the coverage of scene *s* is proportional to the longest overlap between *si* and *gt*:

$$\text{coverage} = \frac{\max\_{i=1\ldots n} \# (s\_i \cap g\_t)}{\# (g\_t)} \tag{12}$$

$$
overline{w} = \frac{\sum\_{i=1}^{m} \#(\mathbf{s}\_i / \mathbf{g}\_t) \cdot \min(\mathbf{1}, (\mathbf{s}\_i \cap \mathbf{g}\_t))}{\#(\mathbf{g}\_{t-1}) + \#(\mathbf{g}\_{t+1})} \tag{13}$$

*Fscene* combines the coverage and overflow measures and is the harmonic mean of the two. For coverage, values closer to 1 indicate better performance, and for overflow, values closer to 0 indicate better; thus we use 1 − *over f low* for calculating *Fscene*:

$$F\_{scenc} = 2 \cdot \frac{coverage \times (1 - overflow)}{coverage + (1 - overflow)} \tag{14}$$

For the experiments pertaining to object re-ID, we make use of *Accuracy* metrics. Accuracy is the most intuitive performance measure; it is simply the ratio of correctly predicted observations to total observations. In our scenario, the predicted observations are labelled as *True* if they are correctly predicted, and *False* otherwise. Therefore, if the total number *True* samples is denoted by *True*, and total number of *False* samples is denoted by *False*, then Accuracy can be calculated as follows:

$$Accuracy = \frac{True}{True + False} \tag{15}$$

#### **5. Results and Discussion**
