Inspired by the application of the transformer in image classification tasks [
24], we introduced a transformer-based architecture for hand gesture recognition tasks. The proposed network uses only a self-attention mechanism and standard feed-forward neural networks without any recurrent units or convolutional operations. The architecture of our proposed network is illustrated in
Figure 1. The network is composed of three submodules: a content-based adaptive sampler, a temporal attention-based encoder, and a classification MLP head.
3.1. Content-Adaptive Sampler
Gestures can be roughly categorized into two components: appearances and motion. Understanding the motion in gesture videos relies heavily on long-range temporal information. Learning video representations that capture long-range temporal information is a pivotal challenge for gesture recognition. To use the dynamic information from the whole video for video-level prediction, a sparse temporal sampling strategy is currently the dominant approach. The main idea is to divide a video into k segments evenly and then randomly select n frames from each segment. We visualized the sampling results of the sparse temporal sampling strategy, as shown in
Figure 2.
Figure 2a presents the sampling results in a NVIDIA dynamic hand gestures (NVGesture) dataset when
and
.
Figure 2b indicates the results when
and
. We found that when
k is set too low, the sampling results do not effectively represent the whole video; when
k is set too high, too many frames without gestures are sampled, increasing the input length of the model.
To sample the video sequence efficiently, we proposed a dynamic sampling strategy based on content adaptation. We use a sliding window to perform gesture detection over the incoming video frames. As shown in
Figure 1, frames without gestures will not be sampled by the sampler, i.e., they cannot pass the sampler (shown as a light-colored crossed-out arrow). When the sampler detects a gesture, it starts sampling evenly (no more gesture detection) according to the time strategy until it reaches the established maximum number of frames. If the frame count of a video is less than the maximum frame count set by the sampler, we will pad the input sequence with zeros. The pseudo-code of our proposed content-based adaptive sampling algorithm can be found in Algorithm 1. Algorithm 1 can be briefly described as follows: first perform continuous gesture detection on the incoming frames and record the index of the current frames; when a gesture is detected, then stop gesture detection; the index of the gesture-containing frame and the time sampling ‘step’ are used to generate the final list of sampled frames. The input of Algorithm 1 is a gesture video and the output is a list of sampled frames.
Figure 2c presents the sampling results of our adaptive sampler. Another role of the sampler is to act as a switch for the encoder. When a gesture is detected in the video sequence, the encoder is activated and starts receiving the input from the sampler.
Being able to detect hand gestures of various sizes is a challenge for hand detectors. In practice, we have selected the single-shot multibox detector [
38] as our hand detector module because of its accuracy and real-time performance. Before feeding the sampled frames into the encoder, they need to be mapped into a sequence of tokens. In our work, we use a pre-trained DenseNet121 [
10] as a 2D spatial feature extractor to map the extracted frames into meaningful features.
Algorithm 1 Content-adaptive sampler. |
Input: gesture video data |
Output: frame index list |
- 1:
initialization [frame list], counter=0, flag=True - 2:
while cap is opened & flag do - 3:
if a gesture is not detected then - 4:
counter ← counter + 1 - 5:
continue - 6:
else - 7:
flag ← False - 8:
end if - 9:
end while - 10:
for each ‘index’ i in range(length n) do - 11:
[frame list] ← i*step + counter - 12:
i ← i + 1 - 13:
end for - 14:
return [frame list]
|
3.2. Attention-Based Encoder
The architecture of the transformer encoder is shown in
Figure 3. As shown in
Figure 3a, the encoder consists of a stack of
N identical layers. Following the suggestions in reference [
21], in our work we set
N to 6. Each layer consists of two submodules: a multi-head self-attention module, and a position-wise fully feed-forward neural network. Each submodule is connected using a residual connection (denoted as “Add” in
Figure 3), followed by layer normalization. The multi-head self-attention module uses a self-attention mechanism to perform a new representation of the input sequence; the feed-forward network uses a fully connected feed-forward neural network to further transform the input vector sequence.
Self-Attention. Mapping a query and a group of key-value pairs to an output is the main role of the attention function. In other words, a given query can interact with the keys to guide the biased selection of values. If a key is closer to a given query, then more attention weight is assigned to the value of the key. The self-attention mechanism treats the representation of each position in the sequence as query and the representation of all positions as key and value. The self-attention model computes a weighted sum of the values of each position by calculating the degree of match between the current position and all positions, which is the attention weight in the attention mechanism. Mathematically, given a query
q and
m key-value pairs
, the attention function
f is instantiated as a weighted sum of the values, which can be defined as following Equation (
1) [
21]:
where query
q, key
k, value
v, and output are all vectors. The attention weight of query
q and key
is obtained by mapping the two vectors into scalars using the attention-scoring function
a and then by SoftMax operation:
where
a indicates the attention-scoring function. From Equation (
2), it is clear that choosing different attention-scoring function
a leads to different attentional behaviors. Additive attention and scaled dot-product attention are the two most commonly used attention-scoring functions. The scaled dot-product attention of queries
q and keys
k can be expressed as follows:
where
d denotes the length of the query
q and the key
k. Since the time complexity of the self-attention operation of the original transformer model is
(
n is the length of the input sequence), this limits its ability to process long sequences. To address this limitation, we use an efficient self-attention variant called longformer [
37] as our attention module. Longformer combines local and global attention, and its temporal complexity is linearly related to the length of the sequence, which allows long-range dependence problems to be better solved.
Multi-Head Attention. Another important technique used in the transformer is the multi-head attention mechanism. To capture the various ranges of dependencies within a sequence, we can transform queries, keys, and values using
h sets of linear projections learning independently. As shown in
Figure 3b, these
h groups of transformed queries, keys, and values are then fed in parallel to the scaled dot-product attention. Finally, the outputs of
h scaled dot-product attention are concatenated together and transformed by another linear projection that can be learned to produce the final output. Mathematically, given a query
q, a key
k, and a value
v, each attention head
can be computed using the following Equation (
4) [
21]:
where
W is the weight parameter to be learned, and
f is scaled dot-product attention function.
Positional Embeddings. Unlike RNNs, which process video frames one by one, the self-attention is computed in parallel. Since videos are ordered sequences of frames, the transformer model needs to consider order information. To use the sequence order information, the transformer introduces positional encodings into the representation of the original input to inject relative or absolute positional information. There are various ways to calculate positional encodings, which can be obtained by learning or being fixed directly [
39]. In this work, we use fixed positional encoding based on sine and cosine functions of different frequencies as shown in Equation (
5) and (
6):
where
PE() denotes the function of positional encodings,
denotes the position,
i represents the dimension in the positional encoding vector, and
d is a base parameter of transformer that denotes the size of the hidden layer at each position. We do this via positional encoding. We simply embed the positions of the frames present within videos with an embedding layer. We then add these positional embeddings to the precomputed CNNs features.
MLP Head. As in BERT [
27] and ViT [
24], we add a special token CLS in front of each input sequence. After the input sequence is propagated through the transformer layers, the output of hidden state associated with this CLS is considered to be the final representation of the entire input sequence. Finally, this CLS token is fed to a classification MLP head for processing to obtain the final gesture prediction. The MLP blocks contain two linear layers with a GELU [
40] nonlinearity and dropout [
41] between them. To summarize, The input token representation is first processed by layer normalization, then encoded by the transformer, and finally the CLS tokens are sent to the MLP head to predict the results.