MeshNet-SP: A Semantic Urban 3D Mesh Segmentation Network with Sparse Prior

Zhang, Guangyun; Zhang, Rongting

doi:10.3390/rs15225324

Open AccessArticle

MeshNet-SP: A Semantic Urban 3D Mesh Segmentation Network with Sparse Prior

by

Guangyun Zhang

and

Rongting Zhang

^*

School of Geomatics Science and Technology, Nanjing Tech University, Nanjing 211800, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(22), 5324; https://doi.org/10.3390/rs15225324

Submission received: 8 September 2023 / Revised: 28 October 2023 / Accepted: 7 November 2023 / Published: 11 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

A textured urban 3D mesh is an important part of 3D real scene technology. Semantically segmenting an urban 3D mesh is a key task in the photogrammetry and remote sensing field. However, due to the irregular structure of a 3D mesh and redundant texture information, it is a challenging issue to obtain high and robust semantic segmentation results for an urban 3D mesh. To address this issue, we propose a semantic urban 3D mesh segmentation network (MeshNet) with sparse prior (SP), named MeshNet-SP. MeshNet-SP consists of a differentiable sparse coding (DSC) subnetwork and a semantic feature extraction (SFE) subnetwork. The DSC subnetwork learns low-intrinsic-dimensional features from raw texture information, which increases the effectiveness and robustness of semantic urban 3D mesh segmentation. The SFE subnetwork produces high-level semantic features from the combination of features containing the geometric features of a mesh and the low-intrinsic-dimensional features of texture information. The proposed method is evaluated on the SUM dataset. The results of ablation experiments demonstrate that the low-intrinsic-dimensional feature is the key to achieving high and robust semantic segmentation results. The comparison results show that the proposed method can achieve competitive accuracies, and the maximum increase can reach 34.5%, 35.4%, and 31.8% in mR, mF1, and mIoU, respectively.

Keywords:

3D real scene; urban 3D mesh; semantic segmentation; sparse prior; low intrinsic dimension; convolutional neural network

1. Introduction

A textured urban 3D mesh, created mostly by a dense image match of oblique aerial images, is one of the final user products in the photogrammetry and remote sensing (PRS) community, and has been widely applied in city management [1], urban and rural planning [2], heritage protection [3], building damage assessment [4], estimation of the potential achievable solar energy of buildings [5], and so forth. Although 3D meshes have advantages in visualization over other 3D data (such as point clouds and voxels) [6], it is hard to use 3D meshes to conduct complex spatial analysis as they lack semantic information [7].

The semantic segmentation of 3D data (such as 3D point clouds and 3D meshes) is a central task in PRS. With the introduction of PointNet [8], a large number of deep learning methods for directly consuming unordered point clouds have emerged [9,10]. For example, in order to make a network provide high representativeness and remarkable robustness, Ma et al. [11] proposed an end-to-end feature extraction framework for 3D point-cloud segmentation by using dynamic point-wise convolutional operations in multiple scales. Lai et al. [12] presented a stratified transformer for 3D point-cloud segmentation. The stratified transformer addressed the issue that current methods failed to directly model long-range dependencies. Chibane et al. [13] provided a weakly supervised 3D semantic instance segmentation method (named Box2Mask) using bounding boxes. The core of the Box2Mask involves a deep model that directly votes for bounding box parameters, and a clustering method specifically tailored to bounding box votes. However, there is comparably limited research on the semantic segmentation of urban 3D meshes in the PRS field. The complexity and irregular geometric structure in 3D meshes make it challenging to perform convolution operations directly on them [6]. In the early days, the Markov random field-based random forest was used to perform semantic segmentation of 3D meshes, which regarded handcrafted features as inputs [14]. The handcrafted features included geometric features (elevation, planarity, and verticality) and photometric features (average color, standard deviation, and color distribution in the HSV color space). To the best of our knowledge, [14] was the first study that combined geometric and photometric features for the semantic segmentation of 3D meshes in the PRS community. With the development of deep learning technology, deep-learning-based methods for the semantic segmentation of 3D meshes were proposed. According to the type of input data, these methods can be grouped into three classes: center of gravity point (CoGP)-based, voxel-based, mesh-based, and view-based methods. CoGP-based methods [6,15,16,17,18,19] use the center of gravity (CoG) per facet to denote the facet and generate CoG point clouds. CoG usually contains the geometric and photometric features of the corresponding facet. CoG point clouds are then used as input of 1D CNN or the state-of-the-art point-cloud semantic segmentation network, such as PointNet++ [20], KPConv [21], etc. Voxel-based methods [22,23,24,25,26] convert the irregular 3D meshes into regular 3D grids, i.e., 3D voxels, and then 3D CNNs are applied to these 3D voxels. Different from CoGP-based and voxel-based methods, mesh-based methods [27,28,29,30,31,32,33,34,35] directly perform convolution on 3D meshes, using the topological information of vertices/edges/facets within 3D meshes. In view-based methods [36,37,38], different virtual views of the 3D mesh were used to render multiple 2D channels for training an effective 2D semantic segmentation model. The per-view predictions generated features that were fused on 3D mesh vertices to predict mesh semantic segmentation labels.

Among most of the methods mentioned above, texture information is one of the most important inputs, which has a significant effect on improving the accuracy of semantic 3D mesh segmentation [7,14,16,17,39]. Although the above methods emphasize the importance of texture information, they do not take into account the sparse characteristics of texture information. Researchers have demonstrated that nature image data show a low-dimensional structure despite the high dimensionality of traditional pixel representations, and the discriminative information of image data has sparse characteristics [40,41]. Moreover, it has been proven that deep networks can learn more easily from low-intrinsic-dimensional datasets, and the learned models generalize better from training to test data [40]. However, the presence of noise in image data can lead to an increase in its intrinsic dimensionality. This can pose challenges for training deep networks and result in decreased performance. The added noise introduces additional variability and complexity, making it harder for the network to extract meaningful features and patterns, thereby hindering the learning process and negatively impacting the network’s performance.

Sparse coding is one of the methods to acquire low-intrinsic-dimensional data from raw data [42,43]. The idea behind sparse coding is to find a concise and compact representation (i.e., sparse representation) of the raw data by using a small number of basis functions or features. The goal is to capture the essential information and discard the redundant or irrelevant components, thereby reducing the dimensionality of the data. Thus, sparse representation is a form of data that is typically low dimensional, capable of effectively representing raw images regardless of the presence of noise in the raw images. Classical sparse coding has been widely used in signal and image restoration/denoising tasks because of the ability to learn interpretable low-intrinsic-dimensional representations and the strong theoretical support [44]. Although deep-learning-based image restoration/denoising methods have surpassed the performance of classical sparse coding methods in modern image datasets (such as ImageNet and CIFAR), deep learning networks are still “black boxes” that are not clearly understood. Thus, the research on integrated sparse coding and deep learning networks has attracted significant attention [45,46,47] due to their complementary advantages, which is one of the prominent research directions for constructing end-to-end deep network integrating image denoising and high-level tasks in CV field. However, to the best of our knowledge, the research on end-to-end deep learning networks integrating sparse prior for semantic 3D mesh segmentation is blank in the PRS field.

In this paper, a deep learning architecture is proposed for the semantic segmentation of urban 3D meshes. The contributions of our work can be summarized as follows.

(1): Considering the importance of texture images for the semantic segmentation of urban 3D meshes, we propose a differentiable sparse coding (DSC) subnetwork to obtain low-intrinsic-dimensional features from texture images by using the unrolled optimization algorithm. Moreover, we propose a semantic feature extraction (SFE) subnetwork to extract high-level features that are used to predict a label for each facet.
(2): We propose an end-to-end deep learning architecture (named MeshNet-SP) that integrates the DSC subnetwork and the SFE subnetwork to perform semantic segmentation of urban 3D meshes. And an end-to-end training strategy is also proposed.
(3): Comprehensive experiments are conducted to demonstrate that the proposed end-to-end deep learning architecture can achieve competitive results in semantic segmentation of urban 3D meshes despite the presence of noise in the texture images.

2. Method

In this section, we describe the proposed end-to-end architecture for semantic urban 3D mesh segmentation. We evaluate the proposed architecture in Section 3, as well as ablated models where the differentiable sparse coding module is introduced or not. We assess the performance of the proposed model on SUM (a benchmark dataset of semantic urban meshes) [48], showing the competitive results of the proposed method.

The architecture of the proposed MeshNet-SP is shown in Figure 1, which combines jointly differentiable sparse coding (DSC) modules and semantic feature extraction (SFE) modules. The proposed MeshNet-SP takes both geometrical information (such as coordinates of CoGs and normals of facets) and texture information of urban 3D mesh as input and outputs a label for each facet within an urban 3D mesh. The DSC module exploits sparse prior and a differential structure using MLP. This allows our model to extract low-intrinsic-dimensional features from raw texture images for improving semantic segmentation accuracy. The SFE module extracts high-level semantic features of urban 3D mesh by constructing graph structure from urban 3D mesh and performing convolution operations on the constructed graph.

We base the DSC module on an optimization algorithm

Λ

that solves the problem of getting sparse representations from raw texture images. Usually, the obtained sparse representations are low-intrinsic-dimensional data [42,43]. We express the joint representation and perception problem as a bi-level optimization problem

min_{D, ω} L ((Λ (y, D), g), c, ω) s . t . Λ (y, D) = \underset{z}{arg min} G (z, y, D)

(1)

where, here,

Λ

minimizes a sparse representation problem

G

. The outputs of this DSC module are sparse codes

Λ (y, D)

of raw texture information, which are concatenated with geometrical information g of the urban 3D mesh and input into later SFE modules, and associated semantic segmentation loss

L

, which is used to calculate the loss between predict labels and ground-truth labels c. Here, the model parameters

ω

of semantic urban 3D mesh segmentation are absorbed in

L

as an argument.

For the nested objective

G

, we follow the sparse coding model as the architecture backbone. The sparse coding model aims to discover a latent sparse representation

z \in R^{l}

that can be utilized to reconstruct an input

y \in R^{d}

by using a learned decoder

D

. We address the sparse representation problem by using an unrolled iterative optimization algorithm. To achieve this, we parameterize the encoder and decoder with unknown, learned parameters, and truncate the iterations to yield the operator

Λ

.

Any differentiable semantic 3D mesh segmentation method can be utilized in the proposed stack. In this paper, we develop a semantic 3D mesh segmentation network by stacking several semantic feature extraction modules. The semantic segmentation loss is a standard cross-entropy loss.

2.1. Differentiable Sparse Coding Module

2.1.1. Sparse Coding System with Variance Regularization

Usually, when both the sparse code z and the dictionary

D

are unknown, z and

D

can be obtained simultaneously by alternately performing the sparse coding algorithm and the dictionary learning algorithm.

Traditionally, sparse coding algorithms utilize an

l_{1}

sparsity penalty and a linear decoder

D ∋ R^{d \times l}

to conduct inference and identify a latent sparse representation

z ∋ R^{l}

of a given texture information

y ∋ R^{d}

. This representation is founded by minimizing the energy function given below:

G (z, y, D) = f (z) + g (z) = \frac{1}{2} {∥y - z D∥}_{2}^{2} + λ {∥z∥}_{1} .

(2)

In (2), the term

f (z) = \frac{1}{2} {∥y - z D∥}_{2}^{2}

shows the reconstruction error for the raw texture information by utilizing the spares code z and dictionary

D

. The term

g (z) = λ {∥z∥}_{1}

is a regularization term. It penalizes the sparse code z using the

l_{1}

norm.

λ

controls the sparsity level of sparse code z. The larger the

λ

, the more sparse the sparse code becomes. In essence, the columns of the dictionary

D

are selectively linear combined by each sparse code z. Finding a optimal sparse code

z^{*}

is to discover a solution of optimization problem of (2), i.e.,

z^{*} = \underset{z}{arg min} G (z, y, D) = \underset{z}{arg min} f (z) + g (z)

(3)

The dictionary learning algorithm uses gradient-based optimization method to update the elements of dictionary

D

by minimizing the MSE between the reconstructed texture information

\hat{Y} = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{N}} ∋ R^{d}

and the training raw texture information

Y = {y_{1}, y_{2}, \dots, y_{N}} ∋ R^{d}

from the sparse codes

Z = {z_{1}^{*}, z_{2}^{*}, \dots, z_{N}^{*}} ∋ R^{l}

obtained from the sparse coding algorithm, i.e., (4),

L_{D} (D, Z^{*}, Y) = \underset{D}{arg min} \frac{1}{N} \sum_{i = 1}^{N} {∥y_{i} - D z_{i}^{*}∥}_{2}^{2}

(4)

Typically, in order to prevent a collapse in the

l_{1}

norm of the codes and successfully train a sparse coding system, it is necessary to regularize the dictionary

D

by bounding the Euclidean norms of the dictionary’s elements. However, it is hard to conduct normalization procedures for sparse coding systems in which the decoder is a non-linear multi-layer neural network. Similar to [49], we apply variance regularization to each latent code component to prevent a collapse in the

l_{1}

norm of the codes. To this end, a regularization term

γ

is added to the energy function in (2). The regularization term

γ

ensures that the variance of all latent components across a mini-batch of codes is greater than a pre-set threshold

\sqrt{T}

. Thus, the (2) is rewritten as

\tilde{G} (z, Y, D) = \tilde{f} (z) + \tilde{g} (z) = \sum_{i = 1}^{n} \frac{1}{2} {∥y_{i} - D z_{. i}∥}_{2}^{2} + \sum_{j = 1}^{l} β {[{(T - \sqrt{var (z_{j .})})}_{+}]}^{2} + \sum_{i = 1}^{n} λ {∥z_{. i}∥}_{1}

(5)

where
$\tilde{f} (z) = \sum_{i = 1}^{n} \frac{1}{2} {∥y_{i} - D z_{. i}∥}_{2}^{2} + \sum_{j = 1}^{l} β {[{(T - \sqrt{var (z_{j .})})}_{+}]}^{2},$
$\tilde{g} (z) = \sum_{i = 1}^{n} λ {∥z_{. i}∥}_{1} .$

The first term of

\tilde{f} (z)

is the sum of reconstruction error for data samples

Y = {y_{1}, \dots, y_{n}}

based on each code

z_{. i} \in R^{l}

within a min-batch. The second term of

\tilde{f} (z)

is the added regularization term

γ

involving the variance of latent component

z_{j .} \in R^{n}

across the min-batch. In variance

var (z_{j .}) = \frac{1}{n} \sum_{i = 1}^{n} {(z_{j i} - u_{j})}^{2}

,

u_{j}

is the mean across the j-th component. In this paper, a fast iterative shrinkage threshold algorithm (FISTA) is applied to solve the optimization problem of (6). The details of FISTA can be found in [50].

arg min \tilde{G} (z, Y, D) = arg min \tilde{f} (z) + \tilde{g} (z)

(6)

2.1.2. Architecture of Differentiable Sparse Coding Module

The architecture of the DSC module is shown in Figure 2. Given texture information and a fixed decoder

D

, the FISTA algorithm is applied to obtain a sparse code

z^{*}

which can best reconstruct the texture information using

D^{'} s

elements. The encoder

ε

is trained to predict the sparse code

z^{*}

, which is the output of FISTA. On the other hand, the decoder is trained by minimizing the mean square error (MSE) between the reconstructed texture information and the raw texture information by using the sparse code

z^{*}

. The details are described as follows.

Inspired by [51], an architecture of the encoder

ε

is designed based on the unrolled FISTA shown in Algorithm 1. As shown in Figure 2, the encoder

ε

has two multi-layer perceptron (MLP) layers

U \in R^{d \times l}

and

S \in R^{l \times l}

, a bias term

b \in R^{l}

, and non-linear functions ReLU. The encoder

ε

is similar to a recurrent neural network, which can be trained by using mini-batch gradient descent to minimize the MSE between FISTA’s output

z^{*} \in R^{l \times n}

and the encode’s output

z_{ε}

from

Y = {y_{1}, y_{2}, \dots, y_{N}} ∋ R^{d}

, i.e.,

L_{ε} (ε, Z^{*}, Y) = \underset{ε}{arg min} \frac{1}{N} \sum_{i = 1}^{N} {∥ε (y_{i}) - z_{. i}^{*}∥}_{2}^{2}

(7)

In this paper, a non-linear decoder

D

, consisting of two MLP layers, a bias term following the first MLP layer, and a non-linear activation function ReLU (see Figure 2), is regarded as the dictionary in the sparse coding system. The first MLP layer maps the outputs of FISTA to the hidden representations. And the second MLP layer maps the hidden representations to the reconstructed texture information. The non-linear decoder

D

is trained using gradient descent to minimize the MSE between raw texture information and the reconstructed texture information (see (4)).

Algorithm 1 Unrolled fast iterative shrinkage threshold algorithm.

Input:: Texture information $y \in R^{d}$ ; number of iterations L; Parameters $: U \in R^{d \times l}; S \in R^{l \times l}; b \in R^{l}$ ;
Output:: Sparse code $z_{ε} \in R^{l}$ ;
1:: $u = U y + b$ ;
2:: $z_{0} = R e L U (u)$ ;
3:: for $e a c h i \in [1, L]$ do
4:: $z_{i} = R e L U (u + S z_{i - 1})$ ;
5:: end for
6:: return $z_{ε} = z_{L}$

2.2. Semantic Feature Extraction Module

The purpose of the cascade of the prior DSC module in our architecture is to obtain low-intrinsic-dimensional data, which are good identifiable features for semantic urban 3D mesh segmentation, from raw texture images. The prior steps from Algorithm 1 must be flexible enough to learn the low-intrinsic-dimensional data from raw texture images but also project on a subset according to the semantic urban 3D mesh segmentation loss

L_{s}

. The obtained low-intrinsic-dimensional data (i.e., sparse codes of raw texture images) are concatenated with geometrical information g of the urban 3D mesh and input into the later SFE to get high-level semantic features. In this paper, we construct the SFE module using the CoGP-based method. The architecture of the SFE module is shown in Figure 3. Every facet within an urban 3D mesh is represented by a CoGP with features consisting of the corresponding sparse codes of raw texture images, the coordinates of the CoGP, and the normal vector of the facet. In order to learn the local features of the urban 3D mesh, we apply the knn method to build a directed graph per CoGP, and propose an edge convolution method performing on the direction graphs. At last, the outputs of edge convolution are aggregated by using a symmetry function. The details are described as follows.

Edge Convolution

Given

G = (υ, ς)

is a directed graph constructed by the i-th CoGP

p_{i}

and its k neighbor CoGPs in a min-batch including n CoGPs.

υ

and

ς

are the set of vertexes and edges of the directed graph G, respectively. The edge features between the i-th CoGP

p_{i}

and its neighbors can be learned by a shared MLP. The edge convolution operation can be described mathematically as

ϵ_{i j} = R e L U (φ (f_{j} - f_{i}) + ψ f_{i}), j = 1, \dots, k

(8)

where

ϵ_{i j}

is the learned edge feature between the i-th CoGP

p_{i}

and the j-th neighbor;

f_{i}

and

f_{j}

are the feature vectors of the i-th CoGP

p_{i}

and the j-th neighbor, respectively;

φ (φ_{1}, \dots, φ_{m_{o}})

and

ψ (ψ_{1}, \dots, ψ_{m_{o}})

are learnable parameters;

m_{o}

is the output dimension of the shared MLP. In order to aggregate the k edge features between the i-th CoGP

p_{i}

and its k neighbor CoGPs, we use a symmetric function, i.e., maximum function, to perform the pooling operation. The pooling operation can be described mathematically as

f_{i}^{'} = max (ϵ_{i j}), j = 1, \dots, k

(9)

where

f_{i}^{'}

is the higher-level feature for the i-th CoGP.

As shown in Figure 1, the proposed architecture contains three SFE modules. Their outputs are concatenated by a skip connection way, which is useful to solve the vanishing gradient problem. After obtaining the outputs of the last SFE module, the outputs of these three SFE modules are concatenated and then put into a SoftMax layer to predict the label per facet. The cross-entropy loss is regarded as the prediction loss. The object of prediction is to minimize the cross-entropy loss, i.e.,

L_{C} = arg min - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{M} s_{i c} log (p_{i c})

(10)

where N and M are the number of samples in a mini-batch and classes, respectively;

s_{i c}

is a sign function, which is 1 when the true label of the i-th samples is the same as label c, otherwise 0;

p_{i c}

is the prediction probability of the i-th sample belonging to class c.

The main term of the MeshNet-SP loss function is the

L_{C}

defined in (10), which is complemented by the dictionary learning term

L_{D}

(4) and the sparse coding term

L_{ε}

(7). The complete loss function can be expressed as

L = L_{C} + L_{D} + L_{ε}

(11)

2.3. End-to-End Training Strategy

We propose a training strategy to speed up the training of MeshNet-SP. This strategy involves training the DSC subnetwork and the SFE subnetwork sequentially, followed by end-to-end fine-tuning. After the sequential training, a certain number of epochs of the training of (1) is conducted with initial values set as

ϑ_{1}

and

ω_{1}

. The gradient backpropagation from the SFE subnetwork to the DSC subnetwork is derived by the chain rule of derivatives, which allows the DSC subnetwork to output useful sparse codes for the semantic urban 3D mesh segmentation task. The proposed training strategy is more efficient than directly solving (1) due to the heavy computational loads of the DSC subnetwork and the relatively slow convergence of the SFE subnetwork.

3. Experiments

3.1. Dataset

The proposed MeshNet-SP is evaluated on a semantic urban meshes (SUM) benchmark dataset [48]. The SUM dataset encompasses an area of approximately 4 square kilometers in Helsinki, Finland, and consists of six classes: ground, vegetation, building, water, car, and boat. The textured mesh data were derived from oblique aerial images with a ground sample distance of 7.5 cm using ContextCapture software. The SUM dataset comprises 64 tiles. We randomly selected 12 tiles for training and another 12 tiles for testing from the SUM dataset provided by [48], taking into account the computer’s processing capacity. Figure 4 shows the spatial distribution of the selected training dataset (yellow areas) and testing dataset (blue areas). The distribution of each class in the training and validation datasets is illustrated in Figure 5, with a total of 7,479,164 faces (3,787,315 for training and 3,691,849 for testing). Figure 5 presents detailed statistics on class frequencies in both datasets: ground (17.17%, 17.02%), vegetation (22.19%, 23.57%), building (54.21%, 56.63%), water (1.17%, 0.31%), car (2.95%, 2.22%), and boat (2.32%, 0.26%). Figure 5 demonstrates that there exists a significant imbalance among different classes regarding their relative number of facets; specifically, the proportion of water-related facets only accounts for merely around one percent or even less than that across both datasets while buildings occupy over half of all facets therein, which poses a great challenge to semantic segmentation tasks involving textured meshes with unbalanced classes as such.

3.2. Experimental Design

We evaluate the proposed MeshNet-SP by setting three category configurations. Table 1 shows the details of different experiments. First, we evaluate our joint DSC subnetwork and SFE subnetwork on the semantic segmentation of urban 3D meshes. We include ablated studies that show the importance of low-intrinsic-dimensional data obtained by the proposed DSC subnetwork to improve the accuracy of semantic urban 3D mesh segmentation. Second, we assess the robustness of the proposed MeshNet-SP for semantic urban 3D mesh segmentation under varying levels of textured image noise. Third, we analyze the influence of sparse code’s dimension on the accuracy of semantic urban 3D mesh segmentation.

3.3. Evaluation Metrics

In this paper, similar to [19,48], over accuracy (OA), Kappa coefficient (Kappa), mean precision (mP), mean recall (mR), mean F1 score (mF1), and mean intersection over union (mIoU) are used to quantitatively evaluate the proposed method. Considering the triangle facets within a mesh have different sizes, we calculate these evaluation indices by the area of triangle facets instead of the number of triangle facets. Mathematically, these evaluation indices can be expressed as

O A = \frac{\sum_{i = 1}^{c} T P_{i}}{S}

(12)

K a p p a = \frac{O A - p_{e}}{1 - p_{e}}, p_{e} = \frac{\sum_{i = 1}^{c} (T P_{i} + F N_{i}) \times (T P_{i} + F P_{i})}{S \times S}

(13)

m P = \frac{1}{c} \sum_{i = 1}^{c} \frac{T P_{i}}{T P_{i} + F P_{i}}

(14)

m R = \frac{1}{c} \sum_{i = 1}^{c} \frac{T P_{i}}{T P_{i} + F N_{i}}

(15)

m F 1 = \frac{1}{c} \sum_{i = 1}^{c} \frac{2 P_{i} R_{i}}{P_{i} + R_{i}}

(16)

m I o U = \frac{1}{c} \sum_{i = 1}^{c} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(17)

where c is the number of class; S is the total area of facets within the mesh;

T P_{i}

,

F P_{i}

, and

F N_{i}

are the areas of true positive, false positive, and false negative for the i-th class, respectively;

P_{i}

and

R_{i}

are the precision and recall for the i-th class, respectively.

3.4. Implementation Details

The proposed MeshNet-SP is implemented by Pytorch [52] on a 64-bit Windows 10 operating system. The MeshNet-SP is trained and tested on a machine equipped with an NVIDIA Quadro RTX 4000 GPU with 8 GB memory, and two 16-core Intel(R) Xeon(R) Gold 5218 CPUs with a 2.3 GHz main frequency and 128 GB RAM.

During the training and testing stages, dropout and learning ratios are empirically set to be 0.6 and 0.001, respectively. Considering the hardware processing capability, the training and testing dataset are split as small patches, and the batch size is set to be six. In the training phase, the first 50 epochs are used for training the DSC subnetwork, followed by 500 epochs for training the SFE subnetwork, and the last 50 epochs are used for end-to-end fine-tuning. The dimension of sparse code is set to be 128. During training, the mean utilization and memory usage of CPU and GPU are (9.0%, 47.4%) and (87.9%, 61.1%), respectively; and the maximum utilization and memory usage of CPU and GPU are (100.0%, 78.4%) and (94.0%, 63.3%), respectively. It should be noted that the results of each experimental configuration are based on a single experiment.

3.5. Results and Analysis

Table 2 summarizes the results of the ablation experiments with various configurations. From Table 2, we can see that, in each category experiment, the proposed MeshNet-SP can obtain the highest accuracy in all metrics, such as IoU per class, OA, Kappa, mP, mR, mF1, and mIoU. These results demonstrate the effectiveness and robustness of the proposed MeshNet-SP. In particular, the proposed MeshNet-SP can maintain relatively high accuracy in IoU for the difficulty classes (car and boat) by extracting low-intrinsic-dimensional features. Specifically, we describe our findings from these ablation experiments below.

Compared to raw image data, the low-intrinsic-dimensional data are more useful to improve the precision of semantic 3D mesh segmentation. The proposed method, which incorporates the DSC subnetwork, is compared to the method without such a subnetwork; see the first and second rows of Table 1. Exps.1 and Exps.2 take both geometric information (coordinates of CoGs and normals of facets) and texture information of the urban 3D mesh as input. In Exps.1, texture information is processed by the DSC subnetwork to produce low-intrinsic-dimensional data, which are then concatenated with geometric information and used to train the SFE subnetwork. Meanwhile, in Exps.2, the SFE subnetwork is trained directly on raw texture information along with geometric information. We observe that Exps.1 converges faster than Exps.2, which is consistent with the conclusion that deep networks can learn more easily from low-intrinsic-dimensional datasets [40]. We note that our joint network (i.e., Exps.1) obtains higher OA, Kappa, mP, mR, mF1, and mIoU compared to Exps.2, which increase by 1.45%∼3.36%. Figure 6 and Figure 7 show the partial visualization results and normalized confusion matrix of Exps.1 and Exps.2, respectively. From Figure 6 and Figure 7, it can be found that Exps.2 produces more prediction errors than Exps.1 does, especially in water class and car class, of which the recall index decreases 11% and 5% compared to Exps.1, respectively. Moreover, according to the normalized confusion matrices shown in Figure 7, it can be found that the water and car classes are mainly misclassified as the ground class, while the boat class is mainly misclassified as the building class. The reason may be that the geometric features of water are similar to the ground’s, i.e., they are flat in the local area. Another main reason is that the number of samples of the water, car, and boat classes is small, their discriminant features are effectively learned hardly by the deep-learning-based method. To some extent, the proposed MeshNet-SP decreases the misclassification of these classes. These results validate that the DSC subnetwork producing low-intrinsic-dimensional data can efficiently improve the accuracy of the semantic 3D mesh urban segmentation.

The proposed MeshNet-SP has higher robustness for semantic urban 3D mesh segmentation under varying levels of texture image noise. In order to evaluate the robustness of the proposed MeshNet-SP, four experiments (i.e., Exps.3∼Exps.6) are set. In Exps.3 and Exps.4, the DSC subnetwork is applied to get low-intrinsic-dimensional data, while Exps.5 and Exps.6 are without such a subnetwork. The experiments involve the addition of varying levels of noise, such as 1× standard deviation (1

σ

) and 2× standard deviation (2

σ

), which follow a Gaussian distribution, to the raw texture images. The results in Table 2 and Figure 8 and Figure 9 validate the effectiveness of our proposed MeshNet-SP to obtain low-intrinsic-dimensional data from raw/noised texture images to improve the accuracy of the semantic 3D mesh segmentation. We can see in Table 2 and Figure 8 that while the accuracy of the models without the DSC subnetwork drastically decreases over the 0

σ

∼2

σ

noise levels, the accuracy of our proposed MeshNet-SP remains stable. Specifically, from 0

σ

to 2

σ

, the accuracy of the models without the DSC subnetwork decreases by ∼30%, where the Kappa coefficient decreases by ∼41%. The primary factor contributing to the decline in the accuracy of the model without the DSC subnetwork is the presence of noise in texture image data, which can result in an increase in its intrinsic dimensionality. However, the proposed MeshNet-SP can still effectively obtain low-intrinsic-dimensional data through the utilization of the DSC subnetwork. From the partial visualization results (see Figure 10) and the normalized confusion matrices (see Figure 9), we can see that as noise levels increase, the number of misclassification in the water, car, and boat classes increases dramatically in Exps.5 and Exps.6 due to the absence of the DSC subnetwork. Meanwhile in Exps.3 and Exps.4 with the DSC subnetwork, as noise levels increase, the increase of misclassification of water, car, and boat classes is relatively small. These results validate that the DSC subnetwork producing low-intrinsic-dimensional data obtained can make MeshNet-SP more robust.

The increase of the sparse code’s dimension would not always improve the performance of the proposed MeshNet-SP for semantic 3D mesh segmentation. In order to analyze the influence of sparse code’s dimension on the accuracy of semantic urban 3D mesh segmentation, Exps.3, Exps.7, and Exps.8 are conducted, where the dimension of sparse code is 128, 256, and 512, respectively. From Table 2, we observe that the performance of the proposed MeshNet-SP decreases along with the increase in the dimension of the sparse code. From the partial visualization results (see Figure 11) and the normalized confusion matrices (see Figure 12), we observe that the number of difficulty classes (water, car, and boat classes) misclassified as ground are also increased with the increase in the dimension of the sparse code. The main reason for this phenomenon is that the intrinsic dimensionality of the higher-dimensional sparse code may be greater than that of the lower-dimensional sparse code. These results validate again that low-intrinsic-dimensional data can improve the performance of deep learning networks.

3.6. Comparison to other Competition Methods

To evaluate the performance of the proposed MeshNet-SP, we compare the proposed MeshNet-SP with seven current state-of-the-art 3D semantic segmentation methods that can process large-scale urban datasets, i.e., Wilk et al. [19], Gao et al. [48], Hu et al. [53], Thomas et al. [21], Landrieu et al. [54], Qi et al. 2017b [20], and Qi et al. 2017a [8]. Specifically, Wilk et al. [19] proposed a hybrid method for semantic urban mesh segmentation. The main idea of the hybrid method was to semantically segment the point clouds sampled from urban mesh and the oblique images, using a fully convolutional neural network [55] and a pyramid scene parsing network (PSP-Net) [56], respectively; and then map the acquired labels of point clouds and oblique images to mesh. Gao et al. [48] released the urban 3D mesh data used in this paper, and proposed a pipeline to perform semantic 3D mesh segmentation. In the pipeline, they first used region growing to group triangle facets into homogeneous regions, and then extracted 11 types of geometric and radiometric features from those mesh segments. Finally, these geometric and radiometric features were concatenated into a 44-dimensional feature vector, which is used by a random forest (RF) classifier. The rest compared methods (RandLA-Net [53], KPConv [21], SPG [54], PointNet++ [20], and PointNet [8]) belonged to point-based approaches, where points were sampled from the facets and used to generate point clouds.

The results of the comparison with other competition methods are presented in Table 3, which clearly demonstrate the superior performance of the proposed method in this article over all other competing methods. Specifically, the MeshNet-SP method surpasses Gao et al.’s [48] proposed method by a significant margin of 10.0% mR, 6.2% mF1, and 2.5% mIoU. Furthermore, it outperforms the remaining six methods with improvements ranging from 5.5% to 34.5% in terms of mR, from 2.3% to 35.4% in terms of mF1, and from −0.3% to 31.8% in terms of mIoU. Moreover, from Table 3, we can see that boat and car are difficult classes for all kinds of methods because of the small sample size. In particular, the methods of Qi et al.(2017a) [8] and Qi et al.(2017b) [20] almost cannot acquire these two classes. Meanwhile, the proposed MeshNet-SP can acquire the highest IoU for boat and car classes.

4. Conclusions

The semantic segmentation of urban 3D meshes is a key task in the photogrammetry and remote sensing field. Urban 3D meshes are usually irregular, which makes it hard to use traditional convolution networks to perform semantic segmentation, and have abundant texture information, which is useful for improving the accuracy of semantic segmentation. How to robustly obtain high-accuracy semantic segmentation results for urban 3D meshes has become a challenging issue. In this paper, we propose a MeshNet-SP consisting of the differentiable sparse coding (DSC) subnetwork and the semantic feature extraction (SFE) subnetwork. The DSC subnetwork is used to produce low-intrinsic-dimensional data from raw texture images. Pioneer researchers have evaluated that low-intrinsic-dimensional data are helpful in making deep learning networks train easier and achieve higher accuracy. The SFE subnetwork constructs a graph based on the center of gravity points of facet by the k nearest neighbors (knn) method, and performs edge convolution on that graph to obtain high-level features used to predict a label per facet. The effectiveness and robustness of the proposed MeshNet-SP have been evaluated through ablation experiments. From the results of ablation experiments, we observe that the low-intrinsic-dimensional data (sparse codes) produced by the DSC subnetwork are key for obtaining high and robust semantic segmentation results, but the accuracy is not proportional to the dimension of sparse codes. Moreover, compared to other competition methods, the proposed MeshNet-SP can achieve higher accuracy for semantic urban 3D mesh segmentation. Specifically, the maximum increase reaches 34.5%, 35.4%, and 31.8% in mR, mF1, and mIoU, respectively.

Author Contributions

Conceptualization, G.Z. and R.Z.; methodology, G.Z. and R.Z.; validation, R.Z.; writing—original draft preparation, R.Z.; writing—review and editing, G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Jiangsu Province (Grant Number BK20230338); National Natural Science Foundation of China (Grant Number 41601365) and the Key Laboratory Independent Research Foundation (Grant Number 2022-ZZKY-JJ-05-02).

Data Availability Statement

The public urban 3D mesh dataset released by Gao et al. [48] can be downloaded at https://3d.bk.tudelft.nl/projects/meshannotation/, (accessed on 5 October 2021).

Acknowledgments

We would like to thank the National Natural Science Foundation of China for providing the funding for this project. The authors appreciate Gao et al. [48] for releasing the public urban 3D mesh dataset.

Conflicts of Interest

The authors declare no conflict of interest.

References

Skondras, A.; Karachaliou, E.; Tavantzis, I.; Tokas, N.; Valari, E.; Skalidi, I.; Bouvet, G.A.; Stylianidis, E. UAV Mapping and 3D Modeling as a Tool for Promotion and Management of the Urban Space. Drones 2022, 6, 115. [Google Scholar] [CrossRef]
Chen, Y.; Feng, M. Urban form simulation in 3D based on cellular automata and building objects generation. Build. Environ. 2022, 226, 109727. [Google Scholar] [CrossRef]
Gong, Y.; Zhang, F.; Jia, X.; Huang, X.; Li, D.; Mao, Z. Deep Neural Networks for Quantitative Damage Evaluation of Building Losses Using Aerial Oblique Images: Case Study on the Great Wall (China). Remote Sens. 2021, 13, 1321. [Google Scholar] [CrossRef]
Hong, Z.; Yang, Y.; Liu, J.; Jiang, S.; Pan, H.; Zhou, R.; Zhang, Y.; Han, Y.; Wang, J.; Yang, S.; et al. Enhancing 3D reconstruction model by deep learning and its application in building damage assessment after earthquake. Appl. Sci. 2022, 12, 9790. [Google Scholar] [CrossRef]
Zhang, Y.; Dai, Z.; Wang, W.; Li, X.; Chen, S.; Chen, L. Estimation of the Potential Achievable Solar Energy of the Buildings Using Photogrammetric Mesh Models. Remote Sens. 2021, 13, 2484. [Google Scholar] [CrossRef]
Grzeczkowicz, G.; Vallet, B. Semantic Segmentation of Urban Textured Meshes Through Point Sampling. arXiv 2023, arXiv:2302.10635. [Google Scholar] [CrossRef]
Lehner, H.; Dorffner, L. Digital geoTwin Vienna: Towards a digital twin city as Geodata Hub. J. Photogramm. Remote Sens. Geoinf. Sci. 2020, 88, 63–75. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Griffiths, D.; Boehm, J. A review on deep learning techniques for 3D sensed data classification. Remote Sens. 2019, 11, 1499. [Google Scholar] [CrossRef]
Xie, Y.; Tian, J.; Zhu, X.X. Linking points with labels in 3D: A review of point cloud semantic segmentation. IEEE Geosci. Remote Sens. Mag. 2020, 8, 38–59. [Google Scholar] [CrossRef]
Ma, L.; Li, Y.; Li, J.; Tan, W.; Yu, Y.; Chapman, M.A. Multi-scale point-wise convolutional neural networks for 3D object segmentation from LiDAR point clouds in large-scale environments. IEEE Trans. Intell. Transp. Syst. 2019, 22, 821–836. [Google Scholar] [CrossRef]
Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; Jia, J. Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 8500–8509. [Google Scholar]
Chibane, J.; Engelmann, F.; Anh Tran, T.; Pons-Moll, G. Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October; Springer: Berlin/Heidelberg, Germany, 2022; pp. 681–699. [Google Scholar]
Rouhani, M.; Lafarge, F.; Alliez, P. Semantic segmentation of 3D textured meshes for urban scene analysis. ISPRS J. Photogramm. Remote Sens. 2017, 123, 124–139. [Google Scholar] [CrossRef]
Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.; Bronstein, M.M. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5115–5124. [Google Scholar]
Tutzauer, P.; Laupheimer, D.; Haala, N. Semantic Urban Mesh Enhancement Utilizing a Hybrid Model. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, 4, 175–182. [Google Scholar] [CrossRef]
Laupheimer, D.; Eddin, M.S.; Haala, N. The Importance of Radiometric Feature Quality for Semantic Mesh Segmentation. Wiss.-Tech. Jahrestag. Dgpf 2020, 29, 205–218. [Google Scholar]
Laupheime, D.; Eddin, M.S.; Haala, N. On the association of LiDAR point clouds and textured meshes for multi-modal semantic segmentation. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 2, 509–516. [Google Scholar] [CrossRef]
Wilk, Ł.; Mielczarek, D.; Ostrowski, W.; Dominik, W.; Krawczyk, J. Semantic urban mesh segmentation based on aerial oblique images and point clouds using deep learning. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 43, 485–491. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Riegler, G.; Osman Ulusoy, A.; Geiger, A. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3577–3586. [Google Scholar]
Wang, P.S.; Liu, Y.; Guo, Y.X.; Sun, C.Y.; Tong, X. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Trans. Graph. 2017, 36, 1–11. [Google Scholar] [CrossRef]
Wang, Z.; Lu, F. Voxsegnet: Volumetric cnns for semantic part segmentation of 3d shapes. IEEE Trans. Vis. Comput. Graph. 2019, 26, 2919–2930. [Google Scholar] [CrossRef] [PubMed]
Hu, Z.; Bai, X.; Shang, J.; Zhang, R.; Dong, J.; Wang, X.; Sun, G.; Fu, H.; Tai, C.L. Voxel-mesh network for geodesic-aware 3D semantic segmentation of indoor scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 2022, 1–12. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Long, W.; Shu, Z.; Yi, S.; Xin, S. Voxel-Based 3D Shape Segmentation Using Deep Volumetric Convolutional Neural Networks. In Proceedings of the Advances in Computer Graphics: 39th Computer Graphics International Conference, CGI 2022, Virtual Event, 12–16 September 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 489–500. [Google Scholar]
Yi, L.; Su, H.; Guo, X.; Guibas, L.J. Syncspeccnn: Synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2282–2290. [Google Scholar]
Hanocka, R.; Hertz, A.; Fish, N.; Giryes, R.; Fleishman, S.; Cohen-Or, D. Meshcnn: A network with an edge. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Singh, V.V.; Sheshappanavar, S.V.; Kambhamettu, C. MeshNet++: A Network with a Face. In Proceedings of the ACM Multimedia, Virtual Event, 20–24 October 2021; pp. 4883–4891. [Google Scholar]
Dong, Q.; Wang, Z.; Li, M.; Gao, J.; Chen, S.; Shu, Z.; Xin, S.; Tu, C.; Wang, W. Laplacian2mesh: Laplacian-based mesh understanding. IEEE Trans. Vis. Comput. Graph. 2023, 2023, 1–13. [Google Scholar] [CrossRef] [PubMed]
Masci, J.; Boscaini, D.; Bronstein, M.; Vandergheynst, P. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Washington, DC, USA, 7–13 December 2015; pp. 37–45. [Google Scholar]
Lahav, A.; Tal, A. Meshwalker: Deep mesh understanding by random walks. ACM Trans. Graph. 2020, 39, 1–13. [Google Scholar] [CrossRef]
Lei, H.; Akhtar, N.; Mian, A. Picasso: A cuda-based library for deep learning over 3d meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 19–25 June 2021; pp. 13854–13864. [Google Scholar]
Hu, S.M.; Liu, Z.N.; Guo, M.H.; Cai, J.X.; Huang, J.; Mu, T.J.; Martin, R.R. Subdivision-based mesh convolution networks. ACM Trans. Graph. 2022, 41, 1–16. [Google Scholar] [CrossRef]
Knott, M.; Groenendijk, R. Towards Mesh-Based Deep Learning for Semantic Segmentation in Photogrammetry. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, 2, 59–66. [Google Scholar] [CrossRef]
Kundu, A.; Yin, X.; Fathi, A.; Ross, D.; Brewington, B.; Funkhouser, T.; Pantofaru, C. Virtual multi-view fusion for 3d semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 518–535. [Google Scholar]
Lawin, F.J.; Danelljan, M.; Tosteberg, P.; Bhat, G.; Khan, F.S.; Felsberg, M. Deep projective 3D semantic segmentation. In Proceedings of the Computer Analysis of Images and Patterns: 17th International Conference, CAIP 2017, Ystad, Sweden, 22–24 August 2017; Part I 17. Springer: Berlin/Heidelberg, Germany, 2017; pp. 95–107. [Google Scholar]
Jaritz, M.; Gu, J.; Su, H. Multi-view pointnet for 3d scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Zhang, R.; Zhang, G.; Yin, J.; Jia, X.; Mian, A. Mesh-based DGCNN: Semantic Segmentation of Textured 3D Urban Scenes. IEEE Trans. Geosci. Remote Sens. 2023. [Google Scholar] [CrossRef]
Pope, P.; Zhu, C.; Abdelkader, A.; Goldblum, M.; Goldstein, T. The Intrinsic Dimension of Images and Its Impact on Learning. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021; pp. 1–17. [Google Scholar]
Chen, Y.; Paiton, D.; Olshausen, B. The sparse manifold transform. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Elad, M.; Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 2006, 15, 3736–3745. [Google Scholar] [CrossRef]
Tropp, J.A. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inf. Theory 2006, 52, 1030–1051. [Google Scholar] [CrossRef]
Mairal, J.; Bach, F.; Ponce, J. Sparse modeling for image and vision processing. Found. Trends® Comput. Graph. Vis. 2014, 8, 85–283. [Google Scholar] [CrossRef]
Sun, X.; Nasrabadi, N.M.; Tran, T.D. Supervised deep sparse coding networks for image classification. IEEE Trans. Image Process. 2019, 29, 405–418. [Google Scholar] [CrossRef]
Li, M.; Zhai, P.; Tong, S.; Gao, X.; Huang, S.L.; Zhu, Z.; You, C.; Ma, Y. Revisiting sparse convolutional model for visual recognition. Adv. Neural Inf. Process. Syst. 2022, 35, 10492–10504. [Google Scholar]
Evtimova, K.; LeCun, Y. Sparse coding with multi-layer decoders using variance regularization. arXiv 2021, arXiv:2112.09214. [Google Scholar]
Gao, W.; Nan, L.; Boom, B.; Ledoux, H. SUM: A benchmark dataset of semantic urban meshes. ISPRS J. Photogramm. Remote Sens. 2021, 179, 108–120. [Google Scholar] [CrossRef]
Bardes, A.; Ponce, J.; LeCun, Y. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In Proceedings of the 10th International Conference on Learning Representations, ICLR, Virtual Event, 25–29 April 2022. [Google Scholar]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
Gregor, K.; LeCun, Y. Learning fast approximations of sparse coding. In Proceedings of the 27th international Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 399–406. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. 2017. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 13 2020; pp. 11108–11117. [Google Scholar]
Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4558–4567. [Google Scholar]
Dominik, W.; Bożyczko, M.; Tułacz-Maziarz, K. Deep learning for automatic LiDAR point cloud processing. Czas. Arch. Fotogram. Kartogr. i Teledetekcji 2021, 33, 13–22. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]

Figure 1. The architecture of the proposed MeshNet-SP.

Figure 2. The architecture of the DSC module.

Figure 3. The architecture of the SFE module.

Figure 4. The spatial distribution of the selected training dataset (yellow area) and testing dataset (blue area). The training dataset and testing dataset contain 12 tiles, respectively.

Figure 5. Proportion of each class in the training and testing datasets: ground (17.17%, 17.02%), vegetation (22.19%, 23.57%), buildings (54.21%, 56.63%), water (1.17%, 0.31%), cars (2.95%, 2.22%), and boats (2.32%, 0.26%). Unlabeled faces are not considered.

Figure 6. Results of the semantic 3D mesh segmentation on the validation dataset with Exps.1 and Exps.2.

Figure 7. Normalized confusion matrix of Exps.1 and Exps.2.

Figure 8. Variation trend of semantic segmentation accuracy along with varying noise levels of the texture images.

Figure 9. Normalized confusion matrix of Exps.3∼Exps.6.

Figure 10. Results of the semantic 3D mesh segmentation on the validation dataset with Exps.3 ∼ Exps.6.

Figure 11. Results of the semantic 3D mesh segmentation on the validation dataset with Exps.7 and Exps.8.

Figure 12. Normalized confusion matrix of Exps.7 and Exps.8.

Table 1. Details of different configurations. Exps.i: experiment i, i = 1, 2, ⋯, 8.

#		DSC Subnetwork		Textured Image Noise Level			Sparse Code Dimension
#		with	without	0.0 $σ$	1.0 $σ$	2.0 $σ$	128	256	512
Category 1 *	Exps.1	✓	×	✓	×	×	✓	×	×
Category 1 *	Exps.2	×	✓	✓	×	×	✓	×	×
Category 2	Exps.3	✓	×	×	✓	×	✓	×	×
	Exps.4	✓	×	×	×	✓	✓	×	×
	Exps.5	×	✓	×	✓	×	✓	×	×
	Exps.6	×	✓	×	×	✓	✓	×	×
Category 3	Exps.7	✓	×	×	✓	×	×	✓	×
Category 3	Exps.8	✓	×	×	✓	×	×	×	✓

* These experiments can be grouped into three categories. Category 1 evaluates the importance of the DSC subnetwork. Category 2 assesses the robustness of MeshNet-SP under varying textured image noise. Category 3 evaluates the influence of sparse code’s dimension on the accuracy of semantic urban 3D mesh segmentation.

Table 2. Accuracies (%) based on the area of triangle facets. The results reported in this table are IoU of per class, overall accuracy (OA), Kappa coefficient (Kappa), mean precision (mP.), mean recall (mR), mean IoU (mIoU), and mean F1 score (mF1). Exps.i: experiment i, i = 1, 2, ⋯, 8.

#		Ground	Vegetation	Building	Water	Car	Boat	OA	Kappa	mP	mR	mF1	mIoU
Category 1	Exps.1	81.04	84.93	91.75	57.84	57.45	39.38	93.17	88.42	79.90	80.58	79.97	68.73
Category 1	Exps.2	79.91	80.95	89.56	53.18	53.19	35.43	91.72	85.96	77.73	77.83	77.33	65.37
Category 2	Exps.3	80.93	84.95	91.71	59.56	58.23	32.88	93.09	88.37	77.97	81.63	79.13	68.04
	Exps.4	79.66	82.29	88.15	59.38	54.52	10.44	90.74	84.79	73.32	81.46	72.78	62.41
	Exps.5	73.67	48.37	74.55	36.27	27.37	10.96	79.85	64.63	64.06	56.28	58.57	45.20
	Exps.6	64.59	36.37	54.73	21.63	7.83	2.64	65.32	45.20	48.87	43.38	42.97	31.30
Category 3	Exps.7	79.23	81.51	90.00	46.75	50.72	34.21	91.77	86.02	74.82	79.30	75.83	63.74
Category 3	Exps.8	78.44	79.13	88.49	46.00	48.84	33.95	90.75	84.55	73.55	77.63	74.92	62.48

Table 3. Accuracy comparisons among different methods. The results reported in this table are IoU of per class, overall accuracy (OA), mean recall (mR), mean F1 score (mF1), and mean IoU (mIoU).

#	Ground	Veg.	Building	Water	Car	Boat	OA	mR	mF1	mIoU
Gao et al. [48]	83.3	90.5	92.5	86.0	37.3	7.4	93.0	70.6	73.8	66.2
Wilk et al. [19]	83.8	88.1	91.5	85.0	49.9	15.7	91.4	75.1	77.7	69.0
Qi et al.(2017a) [8]	56.3	14.9	66.7	83.8	0.0	0.0	71.4	46.1	44.6	36.9
Qi et al.(2017b) [20]	68.0	73.1	84.2	69.9	0.5	1.6	85.5	57.8	57.1	49.5
Hu et al. [53]	38.9	59.6	81.5	27.7	22.0	2.1	74.9	53.3	49.9	38.6
Landrieu et al. [54]	56.4	61.8	87.4	36.5	34.4	6.2	79.0	64.8	59.6	47.1
Thomas et al. [21]	86.5	88.4	92.7	77.7	54.3	13.3	93.3	73.7	76.7	68.8
Ours	81.04	84.9	91.8	57.8	57.5	39.4	93.2	80.6	80.0	68.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, G.; Zhang, R. MeshNet-SP: A Semantic Urban 3D Mesh Segmentation Network with Sparse Prior. Remote Sens. 2023, 15, 5324. https://doi.org/10.3390/rs15225324

AMA Style

Zhang G, Zhang R. MeshNet-SP: A Semantic Urban 3D Mesh Segmentation Network with Sparse Prior. Remote Sensing. 2023; 15(22):5324. https://doi.org/10.3390/rs15225324

Chicago/Turabian Style

Zhang, Guangyun, and Rongting Zhang. 2023. "MeshNet-SP: A Semantic Urban 3D Mesh Segmentation Network with Sparse Prior" Remote Sensing 15, no. 22: 5324. https://doi.org/10.3390/rs15225324

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MeshNet-SP: A Semantic Urban 3D Mesh Segmentation Network with Sparse Prior

Abstract

1. Introduction

2. Method

2.1. Differentiable Sparse Coding Module

2.1.1. Sparse Coding System with Variance Regularization

2.1.2. Architecture of Differentiable Sparse Coding Module

2.2. Semantic Feature Extraction Module

Edge Convolution

2.3. End-to-End Training Strategy

3. Experiments

3.1. Dataset

3.2. Experimental Design

3.3. Evaluation Metrics

3.4. Implementation Details

3.5. Results and Analysis

3.6. Comparison to other Competition Methods

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI