Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition

Bao, Jianjun; Luo, Ke; Kou, Qiqi; He, Liang; Zhao, Guo

doi:10.3390/app15063230

Open AccessArticle

Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition

by

Jianjun Bao

¹,

Ke Luo

¹,

Qiqi Kou

^1,2,*

,

Liang He

³ and

Guo Zhao

⁴

¹

Tiandi (Changzhou) Automation Co., Ltd., Changzhou 213015, China

²

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

³

School of Artificial Intelligence, Beihang University, Beijing 100191, China

⁴

Beijing Huahang Institute of Radio Measurement, Beijing 100013, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3230; https://doi.org/10.3390/app15063230

Submission received: 19 January 2025 / Revised: 8 March 2025 / Accepted: 12 March 2025 / Published: 16 March 2025

(This article belongs to the Special Issue Communication Systems and Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Multi-view image classification tasks require the effective extraction of both spatial and temporal features to fully leverage the complementary information across views. In this study, we propose a lightweight yet powerful model, Multi-head Sparse Structural Attention-based Vision Transformer (MSSAViT), which integrates Structural Self-Attention mechanisms into a compact framework optimized for multi-view inputs. The model employs a fixed MobileNetV3 as a Feature Extraction Module (FEM) to ensure consistent feature patterns across views, followed by Spatial Sparse Self-Attention (SSSA) and Temporal Sparse Self-Attention (TSSA) modules that capture long-range spatial dependencies and inter-view temporal dynamics, respectively. By leveraging these structural attention mechanisms, the model achieves the effective fusion of spatial and temporal information. Importantly, the total model size is reduced to 6.1 M with only 1.5 M trainable parameters, making it highly efficient. Comprehensive experiments demonstrate the proposed model’s superior performance and robustness in multi-view classification tasks, outperforming baseline methods while maintaining a lightweight design. These results highlight the potential of MSSAViT as a practical solution for real-world applications under resource constraints.

Keywords:

3D object recognition; sequential views; self-attention mechanism; vision transformer

1. Introduction

With the growing demand for understanding real-world 3D environments, application scenarios associated with 3D object recognition are receiving increasing attention, including applications in 3D scene understanding [1,2] and 3D object recognition systems [3], which enable the task-specific comprehension of physical environments. Furthermore, fusion information from subsidiary sensors can improve the positioning accuracy of Ultra-Wideband (UWB) technology in scenarios affected by poor communication network signals [4]. Recently, deep learning has been very successful at recognizing 3D objects by aggregating features learned from multiple views, especially in developing lightweight algorithmic frameworks optimized for resource-limited hardware deployments, such as edge devices and embedded vision systems. However, mitigating the accuracy-efficiency trade-off in multi-view 3D recognition task remains a key research challenge.

Most multiple-view-based studies focus on the feature extraction and fusion of unordered view images, but when humans identify a 3D object, their cognitive process is not limited to observing from a single or multiple viewpoints, especially for those 3D objects that are difficult to distinguish. Instead, they usually observe the 3D object from multiple angles and try to find the changing patterns of key salient features of the 3D object from the continuous observation sequence, which serves as an important basis for object category recognition. Follow this idea, the convolution operation, as a special feature extractor, can extract diverse feature descriptions through multi-channel convolution operations. Especially for the basic feature extraction networks that have been pretrained on large-scale image datasets, they can efficiently extract feature descriptions at different frequencies, providing a rich feature basis for subsequent recognition tasks. Moreover, the vision-based transformer structure, with its excellent global feature analysis capability, has opened up a new way for the representation and learning of continuous observation features [5,6]. It can effectively learn the changing patterns of features at different frequencies in the temporal dimension, thereby providing support for achieving high precision of 3D object recognition.

In this paper, we propose a Multi-head Sparse Structural Attention-based Vision Transformer (MSSAViT) that processes a sequence of consecutive views to achieve effective 3D object observation and classification. By capturing the 3D object through a well-designed trajectory, we acquire a series of 2D view images. These views are fed into a lightweight MobileNetV3-based [7] Convolutional Neural Network (CNN) module that extracts spatial feature maps for each individual view. Notably, rather than exclusively utilizing high-level features from the final fully connection layer, our methodology maintains and processes hierarchical feature representations from intermediate network layers, ensuring the simultaneous preservation of fine-grained local details and semantically rich global information. Moreover, we carefully reviewed typical lightweight backbone feature extraction networks. Under the same feature map sizes, the Mobilenet series and Squeezenet have relatively small parameter scales. However, the model size will also increase with the enlargement of the feature map and the number of channels. By comparison, the Mobilenet series has fewer channels in its feature maps than other models. Therefore, we chose the Mobilenet series as the backbone feature extraction network. Considering its performance on ImageNet, we finally selected MobilenetV3 as the backbone feature extraction network.

In [6], Kim M. et al. introduced a novel Vision Transformer (ViT) mechanism, named Structural Self-Attention (StructSA), which effectively leverages diverse structural patterns for visual representation learning. The extracted multi-view feature maps are processed by a group of enhanced StructSA modules incorporating Temporal Sparse Structural Attention (TSSA) and Spatial Sparse Structural Attention (SSSA) components to jointly model global relationships across sequential views and local correlations within individual spatial frames. The Sparse Structural Attention mechanism is embedded into these modules, leveraging learnable sparse masks to restrict the Query–Key connections. This design paradigm enhances model sensitivity to informative viewpoints while suppressing noise from irrelevant feature dimensions. Furthermore, a hierarchical multiscale feature extraction and fusion strategy is implemented to progressively integrate coarse-grained global context with fine-grained local patterns, thereby improving the discriminative capability for subtle inter-class variations in highly dynamic scenarios.

In our work, we introduce a novel self-attention mechanism to address the challenges of effectively modeling multi-view sequences for 3D object recognition. As shown in Figure 1, the self-attention mechanism operates bidirectionally across two key dimensions: spatially, it processes feature maps within the same frequency space of individual views to comprehensively model spatial relationships, while temporally, it analyzes sequential views in the shared feature space to capture dynamic inter-view patterns. By integrating Spatial and Temporal Self-Attention, structural attention ensures that features extracted from the same appearance feature space are coherently modeled across both dimensions, enhancing the network’s ability to learn discriminative representations for 3D object recognition. This dual-dimensional attention mechanism not only improves feature expressiveness but also establishes a computationally efficient framework for processing sequential views in dynamic scenarios.

We evaluated our MSSAViT method on the ModelNet datasets for 3D shape classification tasks. In addition, the experimental results demonstrate that MSSAViT achieves promising improvement on 3D shape classification and comparable performance with recent methods using varying types of 3D data. The key contributions of our work are as follows:

We propose a lightweight multi-view classification model.
We present CNN-ViT, a lightweight hybrid architecture integrating Structural Self-Attention mechanisms to jointly model spatial and temporal structural patterns in sequential view convolutional features. The modular design enables efficient feature extraction and fusion.
We improve Sparse Structured Self-Attention for computational efficiency.
To address the computational limitations of self-attention mechanisms, we introduce an improved Sparse Structured Self-Attention module. By employing a learnable sparse mask to restrict Query–Key connections, the proposed module dramatically reduces computational complexity while maintaining high expressiveness. This design not only alleviates the computational burden but also prevents overfitting caused by redundant feature modeling, ensuring a balance between efficiency and accuracy.
We realize Multiscale Structural Feature Representation.
In our work, the SSSA and TSSA modules progressively reduce the resolution of feature maps. This architectural design significantly reduces computational costs and parameter counts while aggregating multiscale feature representations. The resolution-reducing Structural Self-Attention mechanism enhances both the efficiency and representational capacity of the model, providing a compact yet expressive feature representation for multi-view classification tasks.

This paper is organized as follows. Section 2 systematically reviews related works in 3D shape classification recognition. Section 3 provides the detailed configuration of our proposed methodology. Section 4 illustrates implementation details and main results with other state-of-the-art methods. Further, Section 5 conducts a systematic parameter comparison while presenting ablation studies with quantitative metrics. Section 6 synthesizes the core contributions and critically analyzes technical limitations.

2. Related Works

Three-dimensional object recognition is a fundamental task in computer vision, with applications spanning robotics, autonomous driving, and augmented reality. Traditional methods, such as PointNet [8] and PointNet++ [9], process raw 3D point clouds to learn geometric features. While effective, these approaches can be sensitive to data sparsity and occlusions, and their computational complexity may limit scalability. To address these challenges, multi-view methods project 3D objects into 2D images captured from various perspectives. For instance, multi-view-based methods aggregate features across views using CNNs or ViTs, leveraging 2D image processing techniques to achieve notable accuracy [10]. However, the static aggregation of view-level features restricts the capacity to capture the dynamic dependencies that exist across different views.

The exciting progress in the recognition of 3D objects based on deep learning with different representations, such as meshes, point clouds, voxels, and multi-view, is as follows.

The first presentation of a 3D object is a 3D mesh, which is a structure consisting of various polygons. However, the obvious obstacle is the irregular topology and arbitrary resolution, which is impossible to feed into traditional deep learning models. To tackle this issue, Han et al. proposed two different convolutional restricted Boltzmann machine-based deep learning methods to learn 3D local and global features [11,12]. Feng et al. presented a novel MeshNet to solve the complexity and irregularity problem of meshes by introducing face units and a feature-splitting mechanism. Their experimental results indicated the effectiveness of 3D shape representation [13]. In [14], Dai J.C. et al. proposed a novel deep learning approach that leverages a self-attention mechanism and edge attention module to learn global features, aggregates features of adjacent edges to learn local features, abandons pooling layers, and adopts a spatial position encoding module.

The point cloud is another 3D geometric data format that cannot be processed directly with typical convolutional architecture. PointNet [8] and its enhanced variant PointNet++ [9] were pioneering frameworks designed to learn spatial point features directly from raw point clouds, which significantly advanced research in 3D object detection and recognition. Li et al. presented a self-organizing network (SONet) that takes the point cloud and a self-organizing map as the input to learn global features [15]. Point2Sequence [16] employed an RNN-based sequence model for local regions in point clouds to capture the correlation by aggregating multiscale areas with attention. Moreover, X-3D captures explicit local structural information from the input 3D space and leverages it to generate dynamic kernels, where weight sharing is adopted across all neighboring points within the same local region to enhance computational efficiency [17].

Since meshes and point clouds are irregular formats for 3D representation, most researchers typically transform them into regular 3D voxel grids or a set of 2D view images, which are acceptable formats for CNN models. To this end, 3D ShapeNet was proposed to represent a 3D geometric shape as a probabilistic distribution of binary variables on a 3D voxel grid [18]. Moreover, methods have been implemented directly on the 3D objects, which usually represent them as a collection of voxels in 3D Euclidean space [19] and then analyze them with a 3D-DL network to learn the features for recognition [20,21].

Advancements in temporal modeling have significantly influenced tasks involving sequential views, such as SV2SL1 [22], 3D2SV [23], and MVCLN [24]. In the video-based action recognition task, several methods exploit the structure of spatial cross-correlations from consecutive frames to estimate optical flow [25,26] or to learn motion features for action recognition [27,28]. In [29], a spatiotemporal enrichment module was proposed that aggregates spatial and temporal contexts with dedicated local patch-level and global frame-level feature enrichment sub-modules. Moreover, a relational self-attention module was proposed in [30] to dynamically compute attention weights through spatiotemporal self-correlations. Moreover, video-based recognition and action detection methods, like I3D [31] and SlowFast [32], extend 3D CNNs to jointly model spatial and temporal dependencies, demonstrating state-of-the-art performance in video analysis. More recently, ViTs have revolutionized spatiotemporal modeling by leveraging attention mechanisms to capture long-range dependencies. Extensions of ViTs, such as TimeSformer [33] and Video Swin Transformer [34], incorporate spatiotemporal attention to further enhance performance on video tasks. Despite these advancements, applying such methods directly to 3D object recognition with sequential views remains challenging due to the high computational cost of attention mechanisms and the lack of effective multiscale feature integration.

3. Methodology

3.1. Sequential-View Capturing Schemes

In contrast to the traditional-view-based methods, which capture view images from different angles without regard to sequential characteristics, in our work, we provide two schemes along different special trajectories to capture a set of sequential view images.

As illustrated in Figure 2, (a) shows a top-view scheme where cameras are placed around the 3D object, elevated 60 degrees from the ground plane. However, the top-view scheme has blind spots, so the whole structure of the 3D object cannot be captured. Therefore, Figure 2b provides another scheme that avoids the occurrence of blind spots. The surround-view scheme provides two circle trajectories with a cross-view for covering enough object faces. In addition, all the cameras should point toward the centroid of the 3D object.

3.2. Network Architectures

3.2.1. Architecture of Proposed MSSAViT Model

In this subsection, we explore the architecture of MSSAViT for the recognition of 3D objects with sequential views. The whole architecture can be divided into two major parts: the feature extraction of single-view images based on a 2DCNN and the slow fusion of features based on a 3DCNN. In this paper, the sequential-view-based network is named MSSAViT, of which the whole architecture is shown in Figure 3.

As shown in Figure 3, the view images are rendered from the 3D space in a designed trajectory and sorted by capturing orders. Following common practices, each view image is divided in the backbone model to extract single-view features, but we extract the feature maps from the middle layer of the backbone instead of the final layer. In our work, to reduce the computation, we employed MobileNetV3, which is abbreviated to MBNet in Figure 3. Although the local feature correlation among view images has received attention, most research has focused on the high-level features and ignores the spatial and temporal feature richness in low-level feature maps. Thus, the low-level feature maps maintain more local spatial information than feature vectors at the final layer and could be concatenated in a specific order to build 4D pseudo-time feature maps. The SSSA and TSSA modules enhance feature extraction by leveraging structured attention mechanisms to capture spatial and temporal features, respectively. The SSSA module calculates the similarity between spatial positions to automatically perform weighted fusion of spatial information, emphasizing important regions and suppressing irrelevant ones, thereby helping the model better understand key spatial features. Meanwhile, the TSSA module extracts temporal features by computing similarities between different views, capturing dynamic variations and improving performance in multi-view scenarios. The combination of both modules enables the model to simultaneously capture spatial and temporal information, thus improving accuracy and robustness in multi-view inputs.

3.2.2. Feature Extraction Module

The Feature Extraction Module (FEM) serves as the foundational stage of the proposed framework, transforming multi-view sequential input images into a structured set of spatial feature maps by leveraging MobileNetV3. Then, the extracted features from this module are subsequently fed into the structural attention modules for further processing.

The input to the FEM is a sequence of 2D view images, denoted as

I = {I_{1}, I_{2}, \dots, I_{S}}

, where

I_{t} \in ℜ^{H \times W \times 3}

represents the t view captured from the 3D object,

S

is the total number of views, and

H \times W

are the height and width of the input images, equal to 224 in our work. These views are captured at regular intervals along a predefined trajectory around the 3D object, ensuring diverse perspectives.

Each input image

I_{t}

is passed through the MobileNetV3 backbone to extract spatial feature maps

F_{t} \in ℝ^{C \times N \times N}

. Here,

C

is the number of output channels representing distinct feature filters, and

N

is the spatial dimensions of the feature map, both of which depend on the output layer of the basic feature extraction network. For a complete continuous viewing process, the temporal features maps can be expressed as

F \in ℝ^{S \times C \times N \times N}

.

In this study, in fixing the parameters of MobileNetV3, every viewpoint or temporal frame is processed with identical weights, ensuring consistent feature extraction. This eliminates variability introduced by parameter updates during training, which is particularly important for tasks involving multi-view or temporal data. Consistent features enable better alignment and fusion of information across different perspectives, reducing noise and improving global pattern modeling. In addition, the alignment and concatenation of these features ensure consistency across views, enabling the model to capture both local and global dependencies in dynamic 3D object recognition tasks.

3.2.3. Sparse Attention Mechanism

In the classical transformer work, the attention weights are computed as the scaled dot product of

Q

and

K

, normalized by the dimensionality of the vector space

d_{k}

:

A = ε (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

where

ε

is a softmax function. To reduce computational complexity, a sparse attention mechanism is applied by introducing a randomly generated sparse mask. Let the Query tensor

Q \in ℜ^{N_{q} \times d_{k}}

and Key tensor

K \in ℜ^{N_{k} \times d_{k}}

, where

N_{q}

and

N_{k}

represent the number of positions for the Query and Key, respectively. Sparse mask generation can be mathematically expressed as follows:

A random matrix

R \in ℜ^{N_{q} \times N_{k}}

is generated with elements sampled from a uniform distribution. A binary sparse mask

M \in ℜ^{N_{q} \times N_{k}}

is then constructed based on a predefined sparsity threshold

τ = 0.1

:

M_{i j} = \{\begin{matrix} 1, i f R_{i j} < τ \\ 0, o t h e r w i s e \end{matrix}

(2)

The sparse mask

M

defines which Query–Key pairs are valid for attention computation. Specifically,

M_{i j} = 1

indicates that the Query at position i corresponds to the Key at position j, while

M_{i j} = 0

disables the connection, thereby enforcing sparsity.

The sparse attention weights are computed as

A_{s} = ε (\frac{Q K^{T}}{\sqrt{d_{k}}} ⊙ M) V

(3)

3.2.4. Temporal/Spatial Sparse Structural Attention Network

In the basic Sparse Attention mechanism, we introduce a Sparse Structural Attention Network (SSANet) that effectively incorporates rich structural patterns of Query–Key correlation into temporal and spatial contextual feature aggregation. As is shown in Figure 4, the SSANet consists of two steps: structural Query–Key attention and contextual value aggregation.

We assume that the inputting of features into SSANet is

X \in ℜ^{S \times N \times N}

, where

S

represents the number of views in the TSSA module, or

S

can be replaced with

C

, which represents the channels of feature maps in the SSSA module. Next, we take the TSSA module as an example, and the calculation flow is the same as that of the TSSA module.

In the traditional transformer structure, after the correlation calculation of Query–Key, the attention result is directly output, lacking the learning of structural features. To effectively leverage rich structural patterns, an improved Structural Self-Attention Network is proposed. Let

U

be the correlation matrix of Query–Key:

U = \frac{Q K^{T}}{\sqrt{d_{k}}}, U \in ℜ^{S \times S}

(4)

To capture the local temporal (or spatial) structural patterns, the Structural Query–Key Attention (SQKA) module is deployed on top of Query–Key correlation by

U_{d} = ε (c o n v (\overset{⌢}{U}, W_{u})), U_{d} \in ℜ^{D \times S \times S}

(5)

where

\overset{⌢}{U}

is a channel-expanded copy of the correlation matrix

U

, constructed by adding an extra dimension to enable 2D convolution,

W_{u} \in ℜ^{M \times D}

represents

D

learnable convolutional kernels, and

ε

denotes the softmax function applied to generate attention scores from the convolved features.

Meanwhile, the value contextual patterns can be calculated by

V_{d} = ε (c o n v (\overset{⌢}{V}, W_{v})), V_{d} \in ℜ^{D \times S \times N_{v}}

(6)

where the operation of

\overset{⌢}{V}

is the same to that of

\overset{⌢}{U}

, and

W_{v} \in ℜ^{M \times D}

denotes

D

convolutional kernels with size

M

. The

D

convolutional kernels are used to expand to multi-head structural attention. Then, the contextual value aggregation matrix

C

is deployed by Einstein Summation Convention:

C^{i j} = \sum_{d = 1}^{D} \sum_{k = 1}^{S} U_{d}^{i k} V_{d}^{k j}, C \in ℜ^{S \times N_{v}}

(7)

where

N_{v} = N \times N

in this work, and

U_{d} \in ℜ^{D \times S \times S}

generates temporal kernels dynamically aggregating the local context of value

V_{D, S, :}

in multi-head temporal space.

Then, the contextual value aggregation matrix

C

is reshaped to be

\overset{⌢}{C} \in ℜ^{S \times N \times N}

with a dimensional resetting layer. At last, a residual connection is used to connect the input original features

X

and

\overset{⌢}{C}

, which helps address the problems of vanishing gradients and degradation in the self-attention transformer structure.

3.2.5. Model Structure Overview

The structure of this model consists of multiple modules, each processing the input multi-view features through different operations to achieve the final goal of classification. Table 1 summarizes the layer-wise input and output dimensions of the proposed network architecture.

Our model adopts a hierarchical structure, progressively processing the input feature maps through multiple modules to accomplish the multi-view object classification task. The first layer is the FEM, which transforms the input multi-view feature maps into more compact feature representations. The following layers employ the Spatial Self-Attention (SSSA) and Temporal Self-Attention (TSSA) modules, capturing spatial and temporal features, respectively, while gradually reducing the spatial resolution of the feature maps and retaining their important information. Finally, the Classification Head uses the extracted features for object classification.

The input feature maps in the model undergo a transformation from S × 3 × H × W to S × C × N × N in the first layer. After multiple SSSA and TSSA layers, the spatial resolution of the feature maps progressively decreases from S × C × N × N to S × C × N/4 × N/4. The Classification Head then maps these features to the specific target classes to perform the classification task.

The hierarchical design of the model not only improves computational efficiency but also effectively captures the spatial–temporal feature relationships, enabling the model to perform well when processing multi-view inputs.

4. Experiments and Analysis

4.1. Dataset and Experimental Configure

We conducted comparative experiments against state-of-the-art methods in 3D object recognition on ModelNet10 and ModelNet40. The ModelNet40 dataset contains 12,311 3D objects distributed across 40 categories, while its subset ModelNet10 includes 4839 objects categorized into 10 classes. Figure 5 illustrates representative images generated through two distinct sequential-view capturing schemes discussed in Section 3.1. In the comparative experiments, the average class accuracy and average instance accuracy were employed as metrics for evaluating the performance of different methods. To ensure a fair comparison, ensemble methods were excluded from the evaluation, our approach was benchmarked against existing methods based on distinct 3D data representations.

4.2. Definition of Loss with Dynamic Weight Adjustment

During the training process, we observed that certain classes within the ModelNet dataset exhibited significant inter-class similarity. This phenomenon posed substantial challenges for the model, as it struggled to distinguish between visually similar classes, leading to poor classification performance for these hard-to-separate categories. Specifically, the model tended to focus disproportionately on easier-to-classify samples, which exacerbated the misclassification of difficult samples and further skewed the learning process.

To address this issue, we introduced a dynamic weight adjustment strategy into the training pipeline. The primary objective of this strategy is to adaptively reweight the contribution of each class in the loss function based on its classification difficulty. By dynamically increasing the weights of harder-to-classify classes, the model is encouraged to allocate more capacity to learning discriminative features for these challenging categories. This approach mitigates the negative impact of inter-class similarity and ensures that the model pays appropriate attention to both easy and difficult classes throughout the training process. To address class imbalance and dynamically adjust the importance of each class during training, we introduced a weight adjustment mechanism based on the classification difficulty of each class. Let

d_{i}

denote the classification difficulty of class i, where a higher value of

d_{i}

indicates a more challenging class. The weight

w_{i}

for class i is computed as follows:

w_{i} = \frac{1}{{(d_{i} + ξ)}^{\frac{1}{τ}}}

(8)

where

ξ

is a small constant added to avoid division by zero, and

τ

is a hyperparameter that controls the sensitivity of the weight adjustment to class difficulty.

The weight function reduces the influence of easy-to-classify classes and increases the focus on harder-to-classify classes by applying a temperature scaling factor to the difficulty. This ensures that the model is guided to focus more on difficult classes during training.

To maintain a balanced learning rate across all classes, the weights are normalized such that the sum of all class weights equals the total number of classes N. The normalized weight

w_{i}

for each class is calculated by

w_{i}^{'} = \frac{w_{i}}{\sum_{j = 1}^{N} w_{j}} \times N

(9)

This normalization ensures that the relative importance of each class is maintained while adjusting for any imbalance in class representation. The final loss function used for training is the weighted cross-entropy loss, which incorporates the dynamically adjusted and normalized class weights. The weighted cross-entropy loss for a given class i is defined as

L (y, \hat{y}) = - \sum_{i = 1}^{N} w_{i}^{'} \cdot y_{i} \cdot \log ({\hat{y}}_{i})

(10)

where

y_{i}

is the ground truth label for class i,

\hat{y}

is the predicted probability for class i, and

w^{'}

is the normalized weight for class i. By applying this weighted loss function, the model places greater emphasis on the harder-to-classify classes, as indicated by the dynamically adjusted weights. This mechanism helps the model focus on improving performance in classes where it is most likely to make errors, thereby enhancing overall classification accuracy.

4.3. Experimental Configuration

In this work, we employed the Adam optimizer with weight decay as the primary optimization method. The Adam optimizer was chosen for its adaptive moment estimation properties, which help accelerate convergence by adjusting the learning rate for each parameter based on the first and second moments of the gradients. The learning rate was initialized at 3 × 10⁻³, and weight decay was set to 1 × 10⁻⁴ to prevent overfitting by applying L2 regularization to the model parameters.

To improve convergence and avoid the risk of becoming stuck in local minima, we adopted the Cosine Annealing Warm Restarts strategy for the learning rate. This scheduler reduces the learning rate in a cosine fashion during each cycle, with periodic restarts to explore a wider range of solutions. We set the initial period, period multiplier, and minimum learning rate to 30, 3, and 1 × 10⁻⁵. This learning rate strategy is designed to enable fast exploration during early epochs, while facilitating more fine-grained convergence in later stages of training through periodic restarts and learning rate fluctuations.

To evaluate the classification performance of the 3D object recognition model, two standard metrics were adopted [20]: the average class accuracy, defined as the mean classification accuracy over all categories, and the average instance accuracy, defined as the overall accuracy across all test samples.

4.4. Main Results

We first evaluated the experiments of the MSSAViT on the ModelNet10 dataset; by taking advantage of the high complexity of the dataset, we can further discuss the performance of our MSSAViT compared with other state-of-the-art methods. In Table 2, because of the varying view capturing schemes, our method with the input of top-view sequential images is denoted as MSSAViT-TV, while MSSAViT-SV takes in surround-view sequential images as the input of the network.

Table 2 compares the performances of the proposed MSSAViT models with those of state-of-the-art methods across point-based, voxel-based, and view-based representations for 3D object classification. Our models, MSSAViT-TV and MSSAViT-SV, achieve 97.41% and 98.40% average class accuracy, respectively, with the sequential view configuration (MSSAViT-SV) reaching a 98.46% instance accuracy, rivaling or surpassing all other approaches. Point-based methods like Point2Sequence (95.10% class accuracy) and SO-Net (93.90%) require dense point clouds (1024 × 3 or 2048 × 3), resulting in significant computational costs. Voxel-based methods such as VoxNet (92.00%) and LightNet (93.39%) suffer from spatial resolution loss during voxelization, limiting their performance. Hybrid approaches, such as FusionNet (93.11%), also fall short of view-based methods.

View-based models consistently outperform other representations. MVMSAN achieves a 98.42% class accuracy with 20 views but at a high computational cost, while VCGR-Net offers a better balance with a 97.25% accuracy using 12 views. Our MSSAViT-SV model matches the performance of MVMSAN but requires only 12 views, demonstrating a superior trade-off between accuracy and efficiency. The confusion matrices of both our methods are shown in Figure 6.

The success of MSSAViT-SV is attributed to its Structural Self-Attention (SSSA) and Temporal Self-Attention (TSSA) mechanisms, which effectively capture spatial and sequential dependencies across views. Additionally, its lightweight architecture ensures high accuracy with significantly lower computational demands compared to multi-view methods like MVMSAN.

To further validate the performance of MSSAViT, we conducted more comparative experiments on the larger ModelNet40 dataset.

Table 3 highlights the performance comparison between our proposed models, MSSAViT-TV and MSSAViT-SV, and existing state-of-the-art methods across mesh-, point-, voxel-, and view-based representations for 3D object classification. Our models achieve 94.91% and 95.31% in average class accuracy, with 96.07% and 96.11% in average instance accuracy, respectively. These results demonstrate competitive or superior performance to existing methods, especially within view-based approaches.

Mesh-based methods, such as MeshNet, achieve moderate performance (91.90% class accuracy) but are constrained by the irregularity of mesh structures and computational complexity. Point-based methods, like Point2Sequence and PointNet++, provide solid accuracy (90.40% and 91.90%, respectively) but require dense point clouds, which can lead to increased memory and computational overhead. Voxel-based methods, such as VoxNet (83.00%) and LightNet (86.90%), suffer from quantization artifacts and the loss of spatial resolution during voxelization, limiting their effectiveness.

View-based approaches generally outperform mesh-, point-, and voxel-based representations due to their ability to capture rich spatial and structural features. Among these, MVCNN achieves a 90.10% class accuracy using 80 views, while MVMSAN improves to 95.68% with 20 views, marginally surpassing our method by 0.37% (95.68% vs. 95.31%). Despite the minor performance improvement, MVMSAN experiences a substantial increase in model size, expanding by 44.26% (from 6.1 million to 8.8 million parameters, as shown in Table 4), and incurs higher computational overhead. Similarly, MVDAN achieves 95.50% accuracy on 12 views, slightly exceeding our method by 0.19%, yet its parameter count is inflated dramatically by 20.31× compared to our lightweight architecture. Furthermore, VCGR-Net, MVContrast, MVCVT, DILF, and SelectiveMV all exhibit inferior performance to our approach, failing to achieve competitive accuracy even with comparable or higher computational complexity. Sequential-view methods, such as SV2SL (91.12%) and 3D2SV (91.51%), leverage temporal dependencies but fall short of capturing intricate structural details. The confusion matrices of both our methods are shown in Figure 7.

5. Discussion

5.1. Model Parameter Comparison

The comparison results in parameter counts with classical methods are shown in Table 4; only the average instance accuracy was chosen here as the evaluation metric. The comparative results demonstrate the remarkable efficiency and effectiveness of our proposed models, MSSAViT-TV and MSSAViT-SV, in multi-view 3D object classification tasks. Both models outperformed the majority of existing approaches across the ModelNet10 and ModelNet40 datasets while maintaining a significantly lower parameter count. For instance, MSSAViT-SV achieved an accuracy of 98.57% on ModelNet10 and 96.11% on ModelNet40, surpassing state-of-the-art methods such as MVMSAN (98.57% on ModelNet10, 96.96% on ModelNet40) despite utilizing only 6.1 M parameters, compared to MVMSAN’s > 8.8 M. Furthermore, in comparison to methods like FusionNet (93.11% on ModelNet10, 90.80% on ModelNet40, 118 M parameters) and MVCNN (90.10% on ModelNet40, >130 M parameters), our models achieve superior performance with drastically reduced computational costs. This efficiency stems from the integration of lightweight MobileNet backbones with our novel structural and temporal self-attention mechanisms, which effectively capture critical spatial and sequential dependencies. The results validate the potential of our approach to deliver high accuracy with minimal model complexity, making it well suited for real-world applications where computational resources are limited.

5.2. The Output Selection of FEM for Lightweight Model Design

In this study, MobileNetV3 was employed as a fixed feature extractor, responsible for extracting fundamental visual features from the input data without participating in the training process. This design ensures consistent feature extraction across all viewpoints, eliminating potential inconsistencies caused by parameter updates during training. Furthermore, the output size of the FEM can be flexibly adjusted among the following configurations: 7 × 7 × 160, 14 × 14 × 112, 28 × 28 × 40, and 56 × 56 × 24. The relationship between FEM size, total parameters, and trainable parameters is summarized in the Table 5.

The output size of the FEM determines the dimensions input into the subsequent Sparse Structural Attention module, which significantly affects the total and trainable parameters of the model. In particular, the SSANet relies heavily on the spatial size

N \times N

of the input; the dimensions

N_{q}

,

N_{k}

, and

N_{v}

of the Query, Key, and value are all equal to

N \times N

; and as the feature map size increases from 7 × 7 to 56 × 56, the quadratic growth in

N^{2}

leads to an exponential increase in parameters. At 7 × 7,

N^{2} = 49

results in a lightweight model with only 6.1 M total parameters. But at 56 × 56,

N^{2} = 3136

causes the total parameters to surge to 152.4 M

To strike a balance between performance and complexity, we selected 7 × 7 × 160 as the FEM configuration. Our model’s total parameter count is only 6.1 M, with 1.5 M trainable parameters, significantly reducing the computational and storage requirements.

5.3. Ablation Study on Structural Attention Modules

To validate the effectiveness of the proposed Spatial Sparse Structural Attention (SSSA) and Temporal Sparse Structural Attention (TSSA) modules, we conducted comprehensive ablation experiments on the ModelNet dataset. The experiments evaluated three configurations: (1) an SSSA-only model with spatial attention only, (2) a TSSA-only model with temporal attention only, (3) and the full model with both the SSSA and TSSA modules. All experiments used the surround-view capturing scheme with identical training protocols.

As shown in Table 6, we conducted ablation experiments using the MSSAViT-SV model as the benchmark model. On ModelNet10, MSSAViT-SV with only the SSSA module achieves 94.88% average class accuracy and 95.04% instance accuracy, while MSSAViT-SV with only the TSSA module demonstrates superior performance with 96.79% and 97.25%. Notably, their synergistic integration in the full MSSAViT architecture yields significant enhancements, reaching the best accuracies of 98.40% in average class accuracy and 98.46% in average instance accuracy, confirming the complementary nature of spatial- and temporal-scale modeling. We also find similar trends on the larger-scale ModelNet40 benchmark. The combined model outperforms the SSSA-only model with 3.26% average class accuracy and 1.94% average instance accuracy, and the TSSA-only model with 1.86% average class accuracy and 1.42% average instance accuracy. This progressive improvement across both datasets demonstrates the scalability of handling fine-grained inter-class variations. The experiments conclusively establish that the TSSA module contributes more dominantly to semantic feature discrimination than the SSSA module, and the joint optimization enables global temporal and local structural context aggregation beyond the single linear combination effects.

We conducted an in-depth analysis of the impact of the SSSA and TSSA modules on the recognition results using ModelNet10 as an example. Figure 8 shows the confusion matrices of the SSSA-only and TSSA-only models. Comparative analysis revealed that the largest recognition errors were concentrated in the desk, dresser, and nightstand categories. In Figure 9, we present typical samples of these three categories. Their main characteristic is that there is great similarity in appearance and structure, especially from the sides and top view. However, the significant difference that distinguishes a desk from the other two is the hollowed-out area under the desktop. In contrast, the difference between a dresser and nightstand lies in the table legs and the relationship between the tabletop and the main body (the tabletop is a bit wider than the main body). Through Spatial and Temporal Structural Attention, it is possible to learn local and global structural semantic features jointly, and the ablation experiment confirmed the effectiveness of these two modules.

6. Conclusions

In this paper, we proposed MSSAVit, a lightweight yet effective model for 3D object classification that combines Structural Self-Attention (SSSA) and Temporal Self-Attention (TSSA) modules. By leveraging MobileNetV3 as a fixed feature extractor, our model ensures consistent and efficient feature representation across multiple views while significantly reducing computational complexity. The hierarchical attention mechanisms in SSSA and TSSA progressively refine feature maps by capturing critical spatial and temporal dependencies. With these innovations, our model achieves promising results, improving accuracy compared to that of some existing methods while maintaining good balance between performance and efficiency. Overall, this work highlights the potential of structural attention in advancing lightweight and high-performance multi-view 3D classification tasks.

However, our algorithm also has two main drawbacks: (1) the learning of temporal features with a fixed number of frames is not refined enough; (2) the random operations in the sparse attention mechanism make it difficult for the model to converge quickly during training. In the future, we will continue to research how to learn key temporal feature combinations in continuous views more flexibly, and optimizing the sparse attention method is a key area for future improvement.

Author Contributions

Conceptualization, J.B. and Q.K.; methodology, G.Z.; validation, K.L.; formal analysis, L.H.; investigation, Q.K.; resources, J.B.; data curation, G.Z.; writing—original draft preparation, J.B.; writing—review and editing, Q.K.; visualization, L.H. and G.Z.; supervision, Q.K.; project administration, Q.K.; funding acquisition, J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the International Science and Technology Cooperation Project (Grant No. 2021-2-GH004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the website at https://modelnet.cs.princeton.edu (accessed on 5 February 2023).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relation-ships that could have appeared to influence the work reported in this paper. Jianjun Bao and Ke Luo are employees of Tiandi (Changzhou) Automation Co., Ltd. Guo Zhao is employee of Beijing Huahang Institute of Radio Measurement. The authors declare that this study received funding from the International Science and Technology Cooperation Project. The funder had no role in the design of the study; in the collection, analysis, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

UWB	Ultra-Wideband;
CNN	Convolutional Neural Network;
ViTs	Vision Transformers;
MSSAViT	Multi-head Sparse Structural Attention-based Vision Transformer;
FEM	Feature Extraction Module;
SSSA	Spatial Sparse Self-Attention;
TSSA	Temporal Sparse Self-Attention;
StructSA	structural self-attention;
SQKA	Sparse Query–Key Attention.

References

Wang, Q.; Song, J.; Du, C.; Wang, C. Online Scene Semantic Understanding Based on Sparsely Correlated Network for AR. Sensors 2024, 24, 4756. [Google Scholar] [CrossRef] [PubMed]
Luo, H.; Zhang, J.; Liu, X.; Zhang, L.; Liu, J. Large-Scale 3D Reconstruction from Multi-View Imagery: A Comprehensive Review. Remote Sens. 2024, 16, 773. [Google Scholar] [CrossRef]
Zhou, L.; Tan, J.; Fu, J.; Shao, G. Fast 3D Transmission Tower Detection Based on Virtual Views. Appl. Sci. 2025, 15, 947. [Google Scholar] [CrossRef]
Al-Okby, M.F.R.; Junginger, S.; Roddelkopf, T.; Thurow, K. UWB-Based Real-Time Indoor Positioning Systems: A Comprehensive Review. Appl. Sci. 2024, 14, 11005. [Google Scholar] [CrossRef]
Li, J.; He, X.; Zhou, C.; Cheng, X.; Wen, Y.; Zhang, D. ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 90–106. [Google Scholar]
Kim, M.; Seo, P.H.; Schmid, C.; Cho, M. Learning Correlation Structures for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 18941–18951. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.X.; Wang, W.J.; Zhu, Y.K.; Pang, R.M.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.C.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 17–18 December 2015; pp. 945–953. [Google Scholar]
Han, Z.Z.; Liu, Z.B.; Vong, C.M.; Bu, S.H.; Li, X.L. Unsupervised 3D local feature learning by circle convolutional restricted Boltzmann machine. IEEE Trans. Image Process. 2016, 11, 5331–5344. [Google Scholar] [CrossRef]
Han, Z.Z.; Liu, Z.B.; Han, J.W.; Vong, C.M.; Bu, S.H.; Chen, C.L.P. Mesh convolutional restricted Boltzmann machines for unsupervised learning of features with structure preservation on 3-D meshes. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2268–2281. [Google Scholar] [CrossRef]
Feng, Y.T.; Feng, Y.F.; You, H.X.; Zhao, X.B.; Gao, Y. MeshNet: Mesh neural network for 3D shape representation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; pp. 8279–8286. [Google Scholar]
Dai, J.; Fan, R.; Song, Y. MEAN: An attention-based approach for 3D mesh shape classification. Visual Comput. 2024, 40, 2987–3000. [Google Scholar] [CrossRef]
Li, J.X.; Chen, B.M.; Lee, G.H. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 9397–9406. [Google Scholar]
Liu, X.H.; Han, Z.Z.; Liu, Y.S.; Zwicker, M. Point2Sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; pp. 8778–8785. [Google Scholar]
Sun, S.F.; Rao, Y.M.; Lu, J.W.; Yan, H.B. X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5074–5083. [Google Scholar]
Wu, Z.R.; Song, S.R.; Khosla, A.; Yu, F.; Zhang, L.G.; Tang, X.O.; Xiao, J.X. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE International Conference on Computer Vision Workshops (CVPR), Boston, MA, USA, 11–12 June 2015; pp. 1912–1920. [Google Scholar]
Hamdi, A.; Giancola, S.; Ghanem, B. Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10 March 2021; pp. 1–11. [Google Scholar]
Qi, S.; Ning, X.; Yang, G.; Zhang, L.; Long, P.; Cai, W.; Li, W. Review of multi-view 3D object recognition methods based on deep learning. Displays 2021, 69, 102053. [Google Scholar] [CrossRef]
Wei, X.; Yu, R.; Sun, J. View-gcn: View-based graph convolutional network for 3d shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1850–1859. [Google Scholar]
Han, Z.Z.; Shang, M.Y.; Liu, Z.B.; Vong, C.M.; Liu, Y.S.; Zwicker, M.; Han, J.W.; Chen, C.L.P. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Process. 2019, 28, 658–672. [Google Scholar] [CrossRef]
Han, Z.Z.; Lu, H.L.; Liu, Z.B.; Vong, C.M.; Liu, Y.S.; Zwicker, M.; Han, J.W.; Chen, C.L.P. 3D2SeqViews: Aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Trans. Image Process. 2019, 28, 3986–3999. [Google Scholar] [CrossRef]
Liang, Q.; Wang, Y.; Nie, W.; Li, Q. MVCLN: Multi-View convolutional LSTM network for cross-media 3D shape recognition. IEEE Access 2020, 8, 139792–139802. [Google Scholar] [CrossRef]
Alexey, D.; Philipp, F.; Eddy, I.; Philip, H.; Caner, H.; Vladimir, G.; Patrick, V.D.S.; Daniel, C.; Thomas, B. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 2758–2766. [Google Scholar]
Yang, G.S.; Ramanan, D. Volumetric Correspondence Networks for Optical Flow. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 8–14. [Google Scholar]
Kong, Y.; Yun, F. Human action recognition and prediction: A survey. Int. J. Comput. Vision 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
Anirudh, T.; Sanath, N.; Salman, K.; Rao, M.A.; Fahad, S.K.; Bernard, G. Spatio-temporal relation modeling for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 19958–19967. [Google Scholar]
Kim, M.J.; Kwon, H.; Wang, C.Y.; Kwak, S.; Cho, M. Relational self-attention: What’s missing in attention for video understanding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 8046–8059. [Google Scholar]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–27 July 2017; pp. 6299–6308. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 813–824. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10 March 2021; pp. 10012–10022. [Google Scholar]
Yang, Y.Q.; Feng, C.; Shen, Y.R.; Tian, D. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 19–21 June 2018; pp. 206–215. [Google Scholar]
Maurana, D.; Scherer, S. Voxnet: A 3d convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [Google Scholar]
Zhi, S.F.; Liu, Y.X.; Li, X.; Guo, Y.L. LightNet: A lightweight 3d convolutional neural network for real-time 3d object recognition. In Proceedings of the 10th Eurographics Workshop on 3D Object Retrieval (3DOR), Lyon, France, 23–24 April 2017. [Google Scholar]
Sinha, A.; Bai, J.; Ramani, K. Deep learning 3D shape surfaces using geometry images. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 223–240. [Google Scholar]
Wang, W.; Wang, X.; Chen, G.; Zhou, H. Multi-view SoftPool attention convolutional networks for 3D model classification. Front. Neurorobotics 2022, 16, 1029968. [Google Scholar] [CrossRef]
Xu, R.; Mi, Q.; Ma, W.; Zha, H. View-relation constrained global representation learning for multi-view-based 3D object recognition. Appl. Intell. 2023, 53, 7741–7750. [Google Scholar] [CrossRef]
Hegde, V.; Zadeh, R. Fusionnet: 3d object classification using multiple data representations. arXiv 2016, arXiv:1607.05695. [Google Scholar]
Wang, C.; Cheng, M.; Sohel, F.; Bennamoun, M.; Li, J. NormalNet: A voxel-based CNN for 3D object classification and retrieval. Neurocomputing 2018, 323, 139–147. [Google Scholar] [CrossRef]
Zhou, W.; Jiang, X.; Liu, Y.H. MVPointNet: Multi-view network for 3D object based on point cloud. IEEE Sens. J. 2019, 19, 12145–12152. [Google Scholar] [CrossRef]
Wang, W.; Cai, Y.; Wang, T. Multi-view dual attention network for 3D object recognition. Neural Comput. Appl. 2022, 34, 3201–3212. [Google Scholar] [CrossRef]
Wang, L.; Xu, H.; Kang, W. MVContrast: Unsupervised Pretraining for Multi-view 3D Object Recognition. Mach. Intell. Res. 2023, 20, 872–883. [Google Scholar] [CrossRef]
Li, J.; Liu, Z.; Li, L.; Lin, J.; Yao, J.; Tu, J. Multi-view convolutional vision transformer for 3D object recognition. J. Vis. Commun. Image. Represent. 2023, 95, 103906. [Google Scholar] [CrossRef]
Ning, X.; Yu, Z.; Li, L.; Li, W.; Tiwari, P. DILF: Differentiable rendering-based multi-view Image–Language Fusion for zero-shot 3D shape understanding. Inf. Fusion 2024, 102, 102033. [Google Scholar] [CrossRef]
Alzahrani, M.; Usman, M.; Anwar, S.; Helmy, T. Selective Multi-View Deep Model for 3D Object Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 728–736. [Google Scholar]
Ren, H.; Wang, J.; Yang, M.; Velipasalar, S. PointOfView: A Multi-modal Network for Few-shot 3D Point Cloud Classification Fusing Point and Multi-view Image Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 784–793. [Google Scholar]

Figure 1. Spatial and Temporal Sparse Structural Self-Attention. After performing the ordered observation and rendering of 3D object models to obtain continuous view images and using a 2DCNN backbone network to extract features from each view image, followed by the feature channel alignment of the continuous view feature maps to ensure that the temporal features in each channel have the same frequency characteristics, the spatial structured feature learning is carried out for the feature maps of different channels in the same frame, and temporal structured feature learning is performed for the feature maps of the same channel in the temporal frames, thereby achieving the joint spatial–temporal modeling of the sequential view images.

Figure 2. Proposed sequential-view capturing schemes: (a) top-view scheme; (b) surround-view scheme.

Figure 3. Architecture of proposed MSSAViT in detail. By formulating a reasonable observation method to render ordered continuous view images and then using Mobilenet as the backbone feature extraction network to extract intermediate feature maps for each view image (rather than the feature vectors), we can construct a temporal multi-channel feature maps through concatenation. This is then fed into the core Multi-head Sparse Self-Attention Network for multiple rounds of feature learning. In the SSSA module, for the S-view feature maps, feature extraction is performed only on the channel-aligned feature maps under the same view through dimension transformation operations. In the TSSA module, temporal–structural attention feature learning is carried out for the S continuous-view images with the same frequency features. Based on this, through T rounds of feature learning, the spatial–temporal feature expression ability is enhanced, and finally, the category probability is output through the fully connected layer.

Figure 4. Temporal/Spatial Sparse Structural Attention. For the input feature map x, in accordance with the classic transformer principle, we utilize linear layers to construct the Query (Q), Key (K), and value (V) matrices. Subsequently, the correlation matrix of Q and K is obtained through matrix multiplication operations. Based on the SQKA module, the multi-head structured features of Q-K are derived. Simultaneously, the value V matrix is processed through a 2D convolutional layer to obtain value features with the same number of heads. Finally, the aggregation feature of each head is obtained through multi-head contextual value aggregation. Thereafter, the multi-head features are summed via an addition operation. Ultimately, a skip connection is added to the original input feature map x to enhance the convergence speed of the model.

Figure 5. Illustration of sequential-view capturing schemes.

Figure 6. Confusion matrices of (a) MSSAViT-TV and (b) MSSAViT-SV on Modelnet10.

Figure 7. Confusion matrices of (a) MSSAViT-TV and (b) MSSAViT-SV on Modelnet40.

Figure 8. Confusion matrices of (a) MSSAViT-TV (SSSA-only) and (b) MSSAViT-TV (TSSA-only) on Modelnet10.

Figure 9. Hard-to-classify samples.

Table 1. Overview of layer-wise model architecture.

No.	Layer/Module Name	Input Shape	Output Shape
1	FEM	S × 3 × H × W	S × C × N × N
2	SSSA	S × C × N × N	S × C × N × N
3	Dimensional Resetting Layer	S × C × N × N	C × S × N × N
4	TSSA	C × S × N × N	C × S × N × N
5	Dimensional Resetting Layer	C × S × N × N	S × C × N × N
6	SSSA	S × C × N × N	S × C × N/2 × N/2
7	Dimensional Resetting Layer	S × C × N/2 × N/2	C × S × N/2 × N/2
8	TSSA	C × S × N/2 × N/2	C × S × N/2 × N/2
9	Dimensional Resetting Layer	C × S × N/2 × N/2	S × C × N/2 × N/2
10	SSSA	S × C × N/2 × N/2	S × C × N/4 × N/4
11	Dimensional Resetting Layer	S × C × N/4 × N/4	C × S × N/4 × N/4
12	TSSA	C × S × N/4 × N/4	C × S × N/4 × N/4
13	Classification Head	C × S × H × W	Number of Classes

Table 2. Comparative results on ModelNet10.

Methods	Representation Type	Points/View Number	Average Class Accuracy (%)	Average Instance Accuracy (%)
Point2Sequence [16]	Points	1024 × 3	95.10	95.30
FoldingNet [35]	Points	-	-	94.40
VoxNet [36]	Voxel	-	92.00	-
LightNet [37]	Voxel	-	93.39	-
Geometry image [38]	View	1	88.40	-
MVMSAN [39]	View	20	98.42	98.57
VCGR-Net [40]	View	12	97.24	97.79
SV2SL [22]	SeqView	12	94.56	94.71
3D2SV [23]	SeqView	12	94.68	94.71
FusionNet [41]	Voxel + View	-/20	-	93.11
NormalNet [42]	Voxel + Normal	-	93.10	-
SO-Net [15]	Points	2048 × 3	93.90	94.10
MVPointNet [43]	Points + View	-/5	95.10	95.20
MSSAViT-TV (ours)	View	12	97.41	97.58
MSSAViT-SV (ours)	View	12	98.40	98.46

Table 3. Comparative results on ModelNet40.

Methods	Representation Type	Points/View Number	Average Class Accuracy	Average Instance Accuracy
MeshNet [13]	Mesh	-	91.90	-
Point2Sequence [16]	Point	1024 × 3	90.40	92.60
FoldingNet [35]	Point	-	-	88.40
PointNet [8]	Point	-	86.20	89.20
PointNet++ [9]	Point	-	-	91.90
VoxNet [36]	Voxel	-	83.00	-
LightNet [37]	Voxel	-	86.90	-
MVCNN [9]	View	80	90.10	-
Geometry image [38]	View	1	83.90	-
MVDAN [44]	View	12	95.50	96.60
MVMSAN [39]	View	6	-	96.84
MVMSAN [39]	View	12	-	96.80
MVMSAN [39]	View	20	95.68	96.96
VCGR-Net [40]	View	12	93.33	95.62
MVContrast [45]	View	12	90.24	92.54
MVCVT [46]	View	12	-	95.40
DILF [47]	View	6	-	94.80
SelectiveMV [48]	View	1	-	88.13
PointOfView [49]	Point + View	-/6	92.17	-
SV2SL [22]	SeqView	12	91.12	93.31
3D2SV [23]	SeqView	12	91.51	93.40
FusionNet [41]	Voxel + View	20	-	90.80
MSSAViT-TV (ours)	View	12	94.91	96.07
MSSAViT-SV (ours)	View	12	95.31	96.11

Table 4. Comparison results in parameters counts.

Methods	Representation Type	Backbone Network	Total Parameters	ModelNet10	ModelNet40
SV2SL [1]	SeqView	VGG	>130 M	94.60	93.31
3D2SV [2]	SeqView	Restnet101	>45 M	-	93.40
MVCNN [4]	View	VGG	>130 M	-	90.10
MVPointNet [38]	Points + View	VGG	>130 M	95.20	-
PVRNet [40]	Point + View	AlexNet	>60 M	-	93.60
FusionNet [31]	Voxel + View	-	118 M	93.11	90.80
3DGAN [20]	Voxel	-	11 M	91.00	83.30
MVDAN [44]	View	VGG	>130 M	-	96.60
MVMSAN [39]	View	ResNest14d	>8.8 M	98.57	96.96
MVContrast [45]	View	ResNet18	>11.7 M	-	92.54
SelectiveMV [48]	View	Restnet152	>60 M	-	88.13
MSSAViT-TV (ours)	SeqView	MobileNet	6.1 M	98.45	96.07
MSSAViT-SV (ours)	SeqView	MobileNet	6.1 M	98.57	96.11

Table 5. Impact of FEM output sizes on model parameters.

Output Size of the FEM	Total Parameters	Trainable Parameters
7 × 7 × 160	6.1 M	1.5 M
14 × 14 × 112	7.7 M	3.1 M
28 × 28 × 40	19.3 M	14.7 M
56 × 56 × 24	152.4 M	147.9 M

Table 6. Quantitative comparison of different module combinations.

Method	ModelNet10		ModelNet40
Method	Average Class Accuracy	Average Instance Accuracy	Average Class Accuracy	Average Instance Accuracy
MSSAViT-SV (SSSA-only)	94.88	95.04	92.05	94.17
MSSAViT-SV (TSSA-only)	96.79	97.25	93.45	94.69
MSSAViT (SSSA + TSSA)	98.40	98.46	95.31	96.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, J.; Luo, K.; Kou, Q.; He, L.; Zhao, G. Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition. Appl. Sci. 2025, 15, 3230. https://doi.org/10.3390/app15063230

AMA Style

Bao J, Luo K, Kou Q, He L, Zhao G. Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition. Applied Sciences. 2025; 15(6):3230. https://doi.org/10.3390/app15063230

Chicago/Turabian Style

Bao, Jianjun, Ke Luo, Qiqi Kou, Liang He, and Guo Zhao. 2025. "Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition" Applied Sciences 15, no. 6: 3230. https://doi.org/10.3390/app15063230

APA Style

Bao, J., Luo, K., Kou, Q., He, L., & Zhao, G. (2025). Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition. Applied Sciences, 15(6), 3230. https://doi.org/10.3390/app15063230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Head Structural Attention-Based Vision Transformer with Sequential Views for 3D Object Recognition

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Sequential-View Capturing Schemes

3.2. Network Architectures

3.2.1. Architecture of Proposed MSSAViT Model

3.2.2. Feature Extraction Module

3.2.3. Sparse Attention Mechanism

3.2.4. Temporal/Spatial Sparse Structural Attention Network

3.2.5. Model Structure Overview

4. Experiments and Analysis

4.1. Dataset and Experimental Configure

4.2. Definition of Loss with Dynamic Weight Adjustment

4.3. Experimental Configuration

4.4. Main Results

5. Discussion

5.1. Model Parameter Comparison

5.2. The Output Selection of FEM for Lightweight Model Design

5.3. Ablation Study on Structural Attention Modules

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI