In this section, we introduce the network structure, the constructed 2D-3D consistent loss function and training strategy of the proposed 2D3D-DescNet in detail.
3.1. 2D3D-DescNet
To learn the 2D and 3D local feature descriptors for 2D images and 3D point clouds, we propose a novel network, 2D3D-DescNet (shown in
Figure 1), consisting of a feature extractor, a cross-domain image-wise feature map extractor, and a metric network. It should be noted that the inputs of 2D3D-DescNet are matching 2D image patches (
P) and 3D point cloud volumes (
V), while the non-matching samples are generated during the training process.
The pipeline of the GAN strategy and the consistent loss function are as follows:
Step 1: Input 3D point cloud volume
V and 2D image patch
P. The matching and non-matching 2D and 3D pairs are constructed by the hard triplet margin loss
(Equation (
2)).
Step 2: Use the 3D encoder and 2D encoder (feature extractor module) to extract the 3D feature and 2D feature .
Step 3: Use the 3D decoder to reconstruct the 3D feature
as
. The loss function is chamfer loss
(Equation (
1)).
Step 4: Concatenate 3D feature
and 2D feature
and then send it into the metric network to compute the similarity of the 3D point cloud volume
V and the 2D image patch
P. The loss function of the metric network is cross-entropy loss
(Equation (
6)).
Step 5: Use the feature map decoder to reconstruct the 3D feature and 2D feature as an image-wise feature map and .
Step 6: Use the discriminator to compute the similarity of the image-wise feature map and .
Step 7: Compute the adversarial loss, which consists of the discriminator
(Equation (
4)) and the generator
(Equation (
5)).
Step 8: Finally, obtain the cross-dimensional 2D and 3D feature descriptors and .
3.1.1. Feature Extractor
The feature extractor contains a 2D encoder and a 3D autoencoder to learn the 128-dimensional 2D and 3D feature descriptors between the input paired matching 2D image patches and the 3D point cloud volumes. The size of the input 2D image patches is , and the input 3D point cloud volumes are sampled to 1024 points with coordinate information and color information, i.e., .
For the 2D encoder, the detailed structure is C(32,4,2)-BN-ReLU-C(64,4,2)-BN-ReLU-C(128,4,2)-BN-ReLU-C(256,4, 2)-BN-ReLU-C(256,128,4), where C(n,k,s) denotes a convolution layer with n filters of kernel size k × k with stride s, BN is batch normalization, and ReLU is the non-linear activate function.
As for the 3D autoencoder, the 3D encoder is PointNet [
47], and the last fully connected layer is changed to a 128-dimensional layer. The detailed structure of the 3D decoder is FC(128,1024)-BN-ReLU-FC(1024,1024)-BN-ReLU-FC(1024,1024×6)-Tanh, where FC(p,q) represents the input p-dimensional feature vector map to the q-dimensional feature vector through a fully connected network, and Tanh is the non-linear activate function.
3.1.2. Cross-Domain Image-Wise Feature Map Extractor
To overcome the dimension differences between 2D images and 3D point clouds, we proposed to map the 2D images and 3D point clouds to a unified two-dimensional feature space, called cross-domain image-wise feature maps. (1) The 2D and 3D feature descriptors are learned by the feature extractor. (2) The cross-domain image-wise feature maps are generated by a constructed shared decoder, named the feature map decoder, as shown in
Figure 1. Essentially, there are two steps here: first, the 2D image passes through a 2D encoder to extract features, then a 2D decoder is performed to obtain a 2D feature map; second, the 3D point cloud passess through a 3D encoder to extract features, then a 3D decoder is performed to obtain a 3D feature map. The 2D feature map and 3D feature map are collectively referred to as the cross-domain image-wise feature map. (3) The GAN is embedded into the 2D3D-DescNet, and the discriminator distinguishes the source of the image-wise feature maps. Thus, during training, the paired matching image-wise feature maps of the paired matching 2D image patches and 3D point cloud volumes will become similar such that through the shared feature map decoder, we can conclude that the 2D and 3D feature descriptors learned by 2D3D-DescNet are similar.
Three-dimensional point clouds encapsulate a wealth of intricate two-dimensional information, making them a valuable resource for feature extraction. Through the use of deep neural networks, 2D images can facilitate the transfer of these features, enhancing our overall understanding and analysis of the data. Specifically, the image-wise feature map derived from 3D data is capable of extracting 2D features that correspond closely to those obtained from 2D data. This ensures that the image-wise feature maps generated from both 3D and 2D datasets encapsulate the same feature information, promoting consistency and accuracy in data representation.
In the process, 2D image-wise feature maps are generated from 2D image patches. These patches undergo processing by a shared feature map decoder, which uses 2D feature descriptors as inputs. This decoder effectively translates the 2D characteristics into a comprehensive feature map. Similarly, 3D image-wise feature maps are extracted from the volumetric data of 3D point clouds. The shared feature map decoder, utilizing 3D feature descriptors as inputs, processes these volumetric data points to produce a detailed and accurate feature map. Thus, the use of a shared feature map decoder for both 2D and 3D data ensures that the resulting image-wise feature maps are consistent and comparable, facilitating robust feature transfer and integration across different data types.
In detail, the cross-domain image-wise feature map extractor contains a shared feature map decoder and a discriminator, as follows:
For the feature map decoder, the inputs are the paired 2D and 3D feature descriptors learned from the feature extractor, and the outputs are image-wise feature maps, for which the size is . The detailed structure is C(256,4,4)-BN-ReLU-C(128,4,2)-BN-ReLU-C(64,4,2)-BN-ReLU-C(32,4,2)-Sgimoid, where Sgimoid is the non-linear activate function.
The discriminator, consisting of a 2D encoder and a fully connected network (FCN), receives the extracted image-wise feature maps to distinguish which domain data the extracted image-wise feature maps come from. In detail, the structure of 2D encoder is the same as the 2D encoder of feature extractor, and the detailed structure of FCN is FC(128,256)-ReLU-BN-Dropout-FC(256,128)-ReLU-BN-Dropout-FC(128,64)-ReLU-BN-Dropout-FC(64,32)-ReLU-BN-Dropout-FC(32,2)-SoftMax.
3.1.3. Metric Network
The metric network receives the concatenation of the 2D and 3D feature descriptors learned by the feature extractor and computes their similarity. Specifically, the structure of metric network is the same as the FCN of the discriminator in the cross-domain image-wise feature map extractor, and the non-linear activation function is Softmax. The detailed structure of the metric network is as follows: FC(128,256)-ReLU-BN-Dropout-FC(256,128)-ReLU-BN-Dropout-FC(128,64)-ReLU-BN-Dropout-FC(64,32)-ReLU-BN-Dropout-FC(32,2)-SoftMax.
3.3. Chamfer Loss
To optimize the 3D autoencoder and keep the structure information of the learned 3D features, the 3D decoder is used for reconstructing the point cloud from the learned 3D feature descriptors. In detail, we measure point sets via chamfer distance to optimize the 3D autoencoder:
where
V and
are the input point cloud volumes and output reconstructed point cloud volumes with 1024 points, respectively.
3.3.1. Hard Triplet Margin Loss
To narrow the domain gap between the 2D images and 3D point clouds, we aim to make the learned 2D and 3D feature descriptors as close as possible in the Euclidean space so that they can be used for retrieval. Inspired by HardNet [
48] and SOSNet [
15], we use the hard triplet margin loss with a second-order similarity (SOS) regularization to constrain the 2D and 3D feature descriptors learned by 2D3D-DescNet.
Essentially, the hard triplet loss enables the matching 2D feature descriptors and 3D feature descriptors to become similar in high-dimensional space, while the non-matching pairs become alienated. Moreover, the hard triplet loss solves the problem of unstable performance caused by the randomness of negative sample sampling. In addition, the second-order similarity regularization can be used to expand the inter-class distance between descriptors of the same domain, thus improving the invariance of the jointly learned 2D and 3D local feature descriptors.
In detail, for a batch of the training data
, where
P and
V are matching 2D image patches and 3D point cloud volumes,
n is the number of samples in
B. Then, through the 2D3D-DescNet, the matching 2D and 3D local feature descriptors
are computed; meanwhile, the closest non-matching 2D and 3D local descriptors, namely,
and
, are constructed. Finally, the hard triplet margin loss with an SOS regularization is defined as follows:
where
, and
x and
y denote the D-dimensional feature descriptors.
3.3.2. Adversarial Loss
The 2D3D-DescNet is optimized using the mini-max two-player game framework, a foundational concept in generative adversarial networks (GANs), which is composed of two neural networks: the generator and the discriminator. These networks are trained simultaneously in a competitive setting. The generator’s goal is to produce data that is as realistic as possible, while the discriminator’s task is to distinguish between real data and data generated by the generator. This adversarial process drives both networks to improve continuously, leading to the generation of highly realistic data.
In the context of 2D3D-DescNet, the generator and discriminator work together to align 2D image patches with 3D point cloud volumes. The binary cross entropy (BCE) loss is used as the loss function to judge the performance of both networks. BCE is particularly suited for binary classification tasks, where the output is a probability value between 0 and 1. It measures the performance of the generator in producing realistic data and the discriminator’s ability to distinguish between real and generated data. Although MSE is easy to calculate, it is very sensitive to outliers, which means that if there are outliers in the data, MSE may be dominated by these values and ignore other data points, resulting in an increase in the loss value, which affects the training of the model. In summary, in the context of this paper, if MSE loss is used, it only makes a simple pixel value comparison and cannot determine the similarity of 2D and 3D image-wise feature maps from a global perspective. We use the discriminator method to solve this problem effectively. Thus, binary cross entropy (BCE) is used for judging the generation ability of the generator and the discriminatory ability of the discriminator. The 2D and 3D image-wise feature maps are evaluated by a discriminator instead of mean square error, as follows:
where
P and
V are the inputs of the paired matching 2D image patches and 3D point cloud volumes;
G and
D are the generator and the discriminator, respectively;
denotes parameters of
D, and
denotes parameters of network framework, except
D; label 1 denotes the image-wise feature of
V; label 0 denotes the image-wise feature of
P.
In this setting, the generator aims to create feature maps from 2D images that are so realistic that the discriminator cannot distinguish them from feature maps derived from 3D point clouds. The discriminator, on the other hand, strives to accurately identify whether the feature maps are from the 2D images or the 3D point clouds. This adversarial relationship pushes the generator to produce increasingly realistic feature maps, improving its ability to generate data that mimic the real 3D features.
The generator in 2D3D-DescNet also functions as a feature extractor. It includes a feature extractor, a feature map decoder, and a 2D encoder in the discriminator, as illustrated in
Figure 1. The optimization objective for the generator is to generate feature maps that can deceive the discriminator. The loss function for the generator is as follows:
where
and
are the 128-dimensional features output by the 2D encoder in the discriminator
D; and the definition of
is the same to the hard triplet margin loss. The triplet margin loss ensures that the features extracted from matching 2D and 3D data are closely aligned, while features from non-matching pairs are pushed apart. This helps in learning discriminative features for accurate matching.
The hard triplet margin loss
is defined as follows:
This loss function encourages the network to minimize the distance between the features and matching pairs while maximizing the distance from the hardest negative samples. By incorporating this loss, the generator is encouraged to produce feature maps that are not only realistic but also well-aligned with their corresponding 3D features.
To summarize, 2D3D-DescNet leverages the adversarial training strategy of GANs to optimize the feature extraction process for matching 2D images and 3D point clouds. The use of BCE loss for both the generator and discriminator ensures that the network can effectively distinguish between real and generated data. The additional triplet margin loss helps in aligning features from 2D and 3D data, improving the discriminative power of the learned features. This combination of adversarial training and feature alignment techniques allows 2D3D-DescNet to achieve robust and accurate matching between 2D images and 3D point clouds, making it a powerful tool for tasks that require cross-dimensional data integration.
3.3.3. Cross-Entropy Loss
For the embedded metric network, metric learning is used for measuring similarity between the learned local 2D and 3D feature descriptors. Specifically, since the input samples of 2D3D-DescNet are all matching 2D image patches and 3D point cloud volumes, we introduce a strategy by which to construct the positive and negative samples for the training of the metric network. For a mini-batch, when fixing a batch of 2D feature descriptors, first, the corresponding matching batch of 3D feature descriptors is constructed to provide positive samples; second, the non-matching batch of 3D feature descriptors is constructed to provide negative samples, which are sampled from the other’s mini-batch. Based on this strategy, the same number of positive and negative samples are constructed for metric learning.
The cross-entropy loss is used for optimizing the metric network. In detail,
n pairs of 2D-3D samples are fed into the network in one batch.
is the 0/1 label that indicates whether the input pair
, which is the concatenation of the 3D feature descriptor
and the 2D feature descriptor
, is matching or not. Label 1 means matching and label 0 means non-matching.
is the output of metric network. The cross-entropy loss is defined according to the following formula:
For the input
, two-dimensional vector
is computed as the similarity of the input pair
:
3.3.4. Total Loss
Finally, the 2D3D-DescNet is optimized according to the constructed 2D-3D consistent loss function, which has been described in detail above.
In detail, the 2D-3D consistent loss function is further divided into two categories during optimization. First, for 2D and 3D local feature descriptor learning, we include chamfer loss, hard triplet margin loss with SOS, cross-entropy loss, and the generator of adversarial loss; thus, these 4 losses are optimized together, as follows:
where
,
,
, and
denote the proportion of the loss functions and have been set as 1, 1, 0.5, and 0.25, respectively, in our experiments. Second, to ensure the performance of the discriminator, the parameters of the discriminator are updated in 5 steps, and other parameters of the 2D3D-DescNet are updated every step; thus, the discriminator is optimized individually, as follows: