The human skeleton can be scaled to different scales according to limb segments (e.g., legs, torso), we establish three scales for representing the human skeleton, namely, torso scale, limb scale, and joint scale. Therefore, we propose a multi-scale architecture that effectively utilizes the skeleton information at different scales for feature extraction. Using a multi-scale hypergraph convolution module, we construct a multi-scale hypergraph module to extract spatial information from different scales based on the designed joint hypergraph.
The steps of constructing the multi-scale hypergraph convolution network are divided into three steps: Firstly, the multi-scale segmentation operator is constructed. Then, the single-scale graph convolution and hypergraph convolution modules are constructed for the spatial information extraction at each scale. Finally, the single-scale hypergraph fusion operator is designed to fuse the information from different scales.
3.2.1. Constructing Multi-Scale Segmentation Operator Construction
In contrast to previous approaches [
35], which categorize joints based on joints alone to obtain multi-scale maps, our method categorizes human joints into three scales based on human skeleton relationships. These scales include the torso, limb, and joint scales. The torso scale focuses on the global information of the human skeleton, the limb scale focuses on the overall connections among limb segments, and the joint scale focuses on the connectivity among joint points. In our paper, we construct a multi-scale spatial map, as shown in
Figure 4. Specifically, the joint scale contains 17 joints, while the limb scale and torso scale each consist of 11 and 5 joint points, respectively.
The human skeleton map contains rich connectivity among joints, and by aggregating multiple connected joints in proximity, different-scale skeleton maps representing global information can be obtained.
Maximum pooling is used in a general approach to classify different scales of skeleton maps [
35], and maximum pooling selects joints that contain the most information among the neighboring connected joints as the representative to obtain the torso scale map that contains global information. However, maximum pooling tends to ignore the joints with less information. Therefore, we use average pooling to aggregate the information between adjacent connected joints. Compared to maximum pooling, average pooling can focus on the information in each joint point, making the limb-scale and torso-scale skeleton maps more complete global information.
Constructing the multi-scale segmentation operator is specifically divided into two steps. Initially, we define the spatial map convolution. Then, design the scale transformation operator for scaling purposes. The specific steps are as follows:
Step 1. Calculate spatial graph convolution.
We represent the human skeleton joints as a spatial graph, where the joints are used as graph nodes and the neighboring connections among the nodes are used as the edges of the graph, defining
, where
,
is the human skeleton graph containing
joints in
frames, and we define the adjacency matrix
as shown in Equation (5).
where
is the path between node
and node
. To solve the problem of too little information, we superimpose the neighbor matrix
obtained from different values of
. The calculation process is shown in Equation (6).
Meanwhile, considering the spatio-temporal graph
as a single-scale down graph information, we set
as the motion tensor, and based on the decomposability assumption, a spatial graph convolution is defined, as shown in Equation (7).
where
denotes the spatial graph convolution for decomposition and
and
denote the graph filters. Equation (7) indicates that the spatio-temporal convolution map can be decomposed into a spatial and temporal graph convolution. Based on Equation (7), the spatial convolution processes each data frame individually, and works as shown in Equation (8) for the
th timestamped segment in
.
where
, the
th fragment
is the trainable weight matrix corresponding to the
th order. Obtained after the above process is the spatial feature
obtained by the spatial convolution operator.
Step 2. Scale conversion operator.
In order to convert the obtained joint-scale spatial graph into any set scale, we proposes a trainable average pooling operator, let
be the spatial data at the joint scale,
be the spatial graph adjacency matrix, and at the
th spatial scale, the spatial pooling operator
is expressed as shown in Equation (9).
where
can be obtained from the above equation,
denotes the conversion of features from temporal dimension to spatial dimension.
is the trainable weights,
is the softmax operation performed on each dimension.
denotes the assignment of the
th joint of the joint scale to the
th group of the
th spatial scale. The original image features and spatial map adjacency matrix can be converted to any
spatial scale by the scale conversion operator obtained above, as shown in Equations (10) and (11).
After the above steps, the spatial features of the body parts in the th scale can be obtained by fusing the features of the plurality of body joints by Equation (14). A new connectivity spatial matrix diagram for the th scale of the coarsened scale can be obtained by Equation (15). for representing the physical connections of the multi-joint set at the th scale.
3.2.2. Single-Scale Graph Convolution and the Hypergraph Convolution Module Construction
To fully extract the spatial features of the 3D human skeleton at each scale, we propose a single-scale graph convolution module. Its structure is shown in
Figure 5.
We use the limb scale in it as an example, where the trainable neighbor matrix in the single-scale graph convolution is
.
is the number of limb scales, and all the joints of this matrix are connected to each other, which need to be obtained by training. During the training process, the weights between each neighboring joint point in
are adaptively adjusted, which is calculated as shown in Equation (12).
where
is the spatial matrix represented by the input graph and
is the trainable parameter matrix, after which spatial features are extracted from the limb scale. Then, the obtained spatial feature matrix
is input into two parallel convolutional layers with convolutional kernel size 1 to obtain the intermediate features, and then the two sets of intermediate features are multiplied together to output the adjacency matrix, which is computed as shown in Equation (13).
where
is the batch size. Similarly, we designed the Single-Scale hypergraph convolution module. During the training process, its computation is shown in Equation (14).
where
is the association matrix of the hypergraph,
is the matrix represented by the input hypergraph, and
is the trainable parameter matrix after which spatial features are extracted from the limb scale. After the above steps, the spatial feature matrix
in
scale can be obtained.
We compare the difference between Equations (12) and (14), where the trainable matrix in Equation (14) yields richer information about crotch-joint interactions, due to the fact that when we designed the hypergraph structure, the Laplace matrix of the hypergraph (shown in Equation (4)) is more biased towards focusing on interactions between remote joint points.