3.1. Model Overview
This paper proposes a new deep neural network model, VSGG-Net, for video scene graph generation (VidSGG). VSGG-Net includes: object tracklet detection to find spatio-temporal regions of objects of various sizes over the entire video range using a sliding window scheme; a tracklet pair proposal to select only those with high relatedness among object tracklet pairs appearing in the video; hierarchical context reasoning based on spatio-temporal graph neural network to extract rich context between two objects; and object and relationship classifications applying class weighting technique to solve the long-tailed relationship distribution problem. The structure of the proposed model is shown in
Figure 2. The proposed model can be viewed as a single pipeline consisting of four stages including object tracklet detection (OTD), tracklet pair proposal (TPP), context reasoning (CR), and object and relation classification (ORC).
In the OTD stage of the proposed model, the video is not divided into segments of a fixed length. Instead, windows of different sizes are moved on the video using a sliding-window technique; object tracklets of different duration are detected over the entire video range. A complete graph,
, was created by connecting the nodes representing these object tracklets.
is a graph created by assuming that at least one binary relationship exists between all pairs of objects in a video. However, as the number of objects appearing in the video, n, increases,
nC
2 = n(n − 1)/2, which is the number of all possible object pairs that can be related, increases exponentially. Therefore, a tracklet proposal that selects only tracklet pairs of objects most likely to have a relationship is important for an efficient VidSGG. In the TPP stage of the proposed model, the relatedness of each object tracklet pair is evaluated by combining the pretrained neural net scoring and statistical scoring based on the data set. Then, only pairs of object tracklets whose relatedness is higher than a certain level are selected to generate a sparse graph
. In general, it is important to utilize various spatial and temporal contexts to determine a specific relationship between two objects appearing in a video. In the CR stage of the proposed model, a spatio-temporal contextualized graph,
, containing abundant spatial and temporal contexts between object tracklets is derived through a hierarchical reasoning process using a spatio-temporal graph neural network. Finally, the feature representations of each object node and each relationship node of the spatio-temporal contextualized graph are used in the ORC stage of the proposed model. The final video scene graph
is generated by determining the object class and relationship type corresponding to the node. The notations used in this paper are summarized in
Table 1.
3.2. Object Tracklet Detection and Pair Proposal
In the OTD stage of the proposed model, VSGG-Net, an object detector, is used to find two-dimensional (2D) spatial regions of objects in each video frame. Object tracklets, which are the spatio-temporal regions, in which each object appears in the video, must be detected based on these. A sliding window scheme [
9] that moves multiple windows of different sizes across the entire video range is used to effectively detect object tracklets of various lengths instead of dividing the video into segments of a fixed size. In the proposed model, a Faster-RCNN [
17] with ResNet101 [
18] backbone is used as an object detector for object detection frame by frame. This object detector is used after training with datasets of MS-COCO [
16] and ILSVRC2016 [
19]. After object detection is performed for each frame, the same object is connected between neighboring frames over the entire range of the video to find object tracklets. The proposed model uses the Deep Sort [
20] algorithm for such object tracking. After the basic object tracklets are detected over the entire range of the video, a sliding window technique is applied to find object tracklets of various lengths based on these basic object tracklets. In order to find object tracklets of various lengths, windows of various sizes are set and used starting with a minimum size of 30 frames. After detecting object tracklets of various lengths over the entire range of the video using the sliding window technique, a complete graph,
, is generated, assuming that at least one relationship exists between all pairs of detected object tracklets. Each node of this graph
represents one object tracklet and each edge represents the relationship between the two objects.
As the number n of object tracklets detected in the entire video range, not in each segment range, is very large,
nC
2 = n(n − 1)/2, which is the number of all possible object tracklet pairs, places a heavy burden on the overall VidSGG task. Therefore, a task is performed in the object tracklet pair proposal (TPP) stage of the proposed model to select only pairs of object tracks having a high relationship after evaluating the relatedness of each pair of object tracklets. The relatedness scoring for each pair of object tracklets is performed by combining the trained neural network-based evaluation and the dataset-based statistical scoring. Through this process, edges with low relatedness are excluded in the TPP stage from the complete graph
obtained in the OTD stage, and a sparse graph
that is more compact is generated. The mechanism of object tracklet pair proposal of the proposed model, VSGG-Net, is schematically shown in
Figure 3.
Prior to evaluating the relationship between each object pair, temporal filtering (TF) is performed using
, which represents the temporal overlapping between two object tracklets. In the video, object tracklet pairs that do not overlap at all in time, such as
, are excluded from the set of candidate object tracklet pairs on the assumption that they cannot have any relationship. For object tracklet pairs that have passed temporal filtering, the relatedness between the two object tracklet pairs is evaluated. To evaluate the relatedness between object tracks, neural net scoring and statistical scoring are used together, as shown in
Figure 3. Neural net scoring uses a neural network that determines the suitability of the object tracklet pair based on the class distribution of each of the two object tracklets. For example, when there are three classes of objects: “cat”, “plate”, and “vegetable”, the suitability of the pair (“vegetable”, “plate”) is determined to be higher than that of (“cat”, “plate”) by the pretrained neural network. The neural network used for relatedness scoring is composed of two fully-connected layers, and is used after pretraining with the VidOR training dataset. If at least one relationship exists between two object tracklets on the scene graph of the VidOR training dataset, the corresponding object tracklet pair is regarded as a positive example for neural network training. Otherwise, it is regarded as a negative example. Therefore, the relatedness score for the object tracklet pair
using a neural network is calculated as Equation (1).
In Equation (1), represents a class distribution map of each object forming an object tracklet pair and describes a fully connected layer.
Another method of evaluating the relatedness between two object tracklets is statistical scoring. Each scene graph included in the VidOR training dataset can be viewed as a set of facts in the form of a triplet, such as <subject, relationship, object>. Statistical scoring evaluates the relationship between two objects according to the frequency indicating how often two objects co-occur as a subject and an object in this set of facts. For statistical scoring, each fact is divided into (subject, relation_predicate), and (relation_predicate, object) to create two co-occurrence matrices,
and
, as shown in
Figure 3. The matrix
, containing the relational scores for all possible (subject, object) pairs, is calculated by multiplying these two matrices. By normalizing the scores in
for each object tracklet pair
the statistical relatedness score,
is obtained.
For each object tracklet pair,
, the final relatedness score,
, is calculated by combining the neural net score
and the statistical score
as shown in Equation (2).
in Equation (2) means the weight can adjust the relative reflection ratio of neural net score and the statistical score according to the reliability of the two scoring methods. For the VidOR dataset, is set to 0.7 in the proposed model. Only object tracklet pairs with a total relatedness score higher than the threshold value () of 0.7 or higher than ) are proposed in this stage. In the TPP stage, all edges connecting two object tracklet nodes with no relationship or very low relationship are excluded from the complete graph to generate the sparse graph .
3.3. Context Reasoning and Classification
In order to effectively discriminate between various objects appearing in the video and their relationship, temporal contexts are needed in addition to various spatial contexts. In the CR stage of VSGG-Net, a CR based on a spatio-temporal graph neural network is applied to the sparse graph
generated by TPP to generate a context graph
containing rich spatio-temporal contexts.
Figure 4 shows the CR process of the proposed model.
As shown in
Figure 4, the context reasoning process of VSGG-Net is largely composed of the following stages: spatio-temporal attention, visual context reasoning, and semantic context reasoning. Visual context reasoning uses the visual information of each object tracklet, whereas semantic context reasoning uses the class distribution value of each object tracklet, which is the result of visual context reasoning. Therefore, the CR process of the proposed model is a hierarchical reasoning process consisting of lower-level visual context reasoning and higher-level semantic context reasoning. In addition, the CR of each level is performed iteratively using pre-calculated spatial attention
, temporal attention
, and the GCN [
21]. The GCN repeats the process of updating to include sufficient spatio-temporal contexts in each node by reflecting information of neighboring nodes to each node based on the temporal attention and the spatial attention. In the proposed model, a context graph is first constructed to perform CR. The context graph has two types of nodes, an object node, and a relationship node. It also has three types of edges that connect the pairs of (subject node, relationship node), (relationship node, object node), and (subject node, object node), respectively. Unlike the existing GCN [
21], the spatio-temporal GCN of the proposed model not only uses the spatial attention
and the temporal attention
, but also enables exchanging information between the node pairs (subject node, relationship node) and (relationship node, object node) as well as (subject node, object node) through the three types of edges.
The detailed CR process of each level is as follows. First, each subject and object node of the initial visual context graph for visual context reasoning are filled with I3D visual features of the object tracklets and CNN visual features of frames belonging to the tracklet range. Each relationship node is filled with I3D visual features of each subject tracklet and object tracklet that have corresponding relationship and relative features [
1] of the (subject, object) pair. Equation (3) represents the three types of relative features used in the proposed model. In Equation (3),
and
denote the center coordinates of the bounding box of the subject tracklet and the object tracklet, respectively;
and
denote the size of the bounding box of the subject tracklet and the object tracklet, respectively.
When the initial visual context graph is created, the spatial attention
and the temporal attention
to be applied to each edge connecting two object nodes are calculated. As two object tracklets
and
are located closer in space, the spatial attention
on the edge between the corresponding object nodes should be strengthened. Furthermore, as two object tracklets
and
are overlapped longer in time, the temporal attention
on the edge between the corresponding object nodes should be also strengthened. Therefore, the spatial attention
and the temporal attention
are computed using Equations (4) and (5), respectively.
In Equation (4),
denotes the distance between the centroids of the two object tracklets, and
denotes a set of other object tracklets spatially adjacent to the subject tracklet. In Equation (5),
denotes the degree of temporal overlap between two object tracklets, and
denotes a set of other object tracklets temporally adjacent to the subject tracklet. Using the pre-calculated spatial attention
and temporal attention
, each subject node and object node in the spatio-temporal context graph is updated, as shown in Equation (6). Each relationship node is updated, as shown in Equation (7), reflecting the contexts of the neighboring nodes.
In Equations (6) and (7), , , and denote a subject node, a relationship node, and an object node, respectively. Attention denotes the sum of spatial attention and temporal attention (). In Equations (6) and (7), denotes information received from neighboring subject and object nodes, and denotes information received from neighboring relation nodes. In Equations (6) and (7), , , , and denote weights between the subject–object, subject–relationship, relationship–object, and relationship–subject nodes, respectively. As expressed in Equations (6) and (7), visual context reasoning is performed while passing through the two spatio-temporal GCN layers.
When the visual context reasoning is completed, each object node in the visual context graph uses an object classifier and each relationship node uses a relationship classifier to calculate the class distribution to which the corresponding node belongs. In order to start performing semantic context reasoning, a new semantic context graph with the same structure as the visual context graph is generated. Each node of this semantic context graph is initialized with the probability distribution for each class of the visual context graph node corresponding to this node. Semantic context reasoning is performed through the two spatio-temporal GCN layers in the same way as visual context reasoning. In this case, the same spatial attention and temporal attention as for visual context reasoning are used. The final spatio-temporal context graph obtained through this process has a higher-level semantic context based on a lower-level visual context.
In the ORC stage of VSGG-Net, the objects constituting the video scene graph and the relationships between them are determined using the spatio-temporal context graph
. In this stage, the object nodes and relational nodes are classified into the most likely categories, based on the information of each node in the spatio-temporal context graph
.
Figure 5 shows the process of classifying objects and relationships in the proposed model. Each object node in the spatio-temporal context graph
passes through a softmax function and labeled with the object class of the highest score. Each relationship node passes through a softmax function and labeled with the top five relationship classes of the highest scores. For object classification, cross entropy is used as the loss function; for relationship classification, binary cross entropy is used as the loss function. This allows the proposed model to have various relationships for one object pair at the same time. In other words, the proposed model allows an object pair of (“child”, “dog”) to have multiple relationships such as child-caress-dog and child-next_to-dog at the same time as shown in
Figure 5.
In datasets such as VidOR and VidVRD, there are relationships that appear with high frequency, such as “next_to” and “in_front_of”, as well as many relationships appearing at a low frequency, such as “cut” and “shake_hand_with." Relationships with a low frequency of appearance inevitably have a lower recognition rate compared to those with a high frequency of appearance in the relationship classification process. In the ORC stage of VSGG-Net, a relationship class weighting technique, as shown in Equation (8), is applied to solve the long-tailed relationship distribution problem. This technique adjusts the weight of the loss of the relationship class in the loss function according to the frequency of appearance. According to this technique, the lower the frequency of appearance, the higher the weight of the relationship loss in the loss function, as shown in Equation (8). In Equation (8),
denotes the number of relationship instances corresponding to the relationship class
in the training dataset, and
denotes the weight of the relationship class
in the loss function
for learning the relationship classifier.
By applying this relationship class weighting technique, VSGG-Net can obtain a high classification performance even for relationships with a relatively low frequency of appearance.