Supervised Contrastive Learning for 3D Cross-Modal Retrieval

Choo, Yeon-Seung; Kim, Boeun; Kim, Hyun-Sik; Park, Yong-Suk

doi:10.3390/app142210322

Open AccessArticle

Supervised Contrastive Learning for 3D Cross-Modal Retrieval

¹

Contents Convergence Research Center, Korea Electronics Technology Institute (KETI), Seoul 03924, Republic of Korea

²

Artificial Intelligence Research Center, Korea Electronics Technology Institute (KETI), Seongnam 13509, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(22), 10322; https://doi.org/10.3390/app142210322

Submission received: 1 October 2024 / Revised: 2 November 2024 / Accepted: 5 November 2024 / Published: 10 November 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Interoperability between different virtual platforms requires the ability to search and transfer digital assets across platforms. Digital assets in virtual platforms are represented in different forms or modalities, such as images, meshes, and point clouds. The cross-modal retrieval of three-dimensional (3D) object representations is challenging due to data representation diversity, making common feature space discovery difficult. Recent studies have been focused on obtaining feature consistency within the same classes and modalities using cross-modal center loss. However, center features are sensitive to hyperparameter variations, making cross-modal center loss susceptible to performance degradation. This paper proposes a new 3D cross-modal retrieval method that uses cross-modal supervised contrastive learning (CSupCon) and the fixed projection head (FPH) strategy. Contrastive learning mitigates the influence of hyperparameters by maximizing feature distinctiveness. The FPH strategy prevents gradient updates in the projection network, enabling the focused training of the backbone networks. The proposed method shows a mean average precision (mAP) increase of 1.17 and 0.14 in 3D cross-modal object retrieval experiments using ModelNet10 and ModelNet40 datasets compared to state-of-the-art (SOTA) methods.

Keywords:

cross-modal; object retrieval; contrastive learning

1. Introduction

During the times of social distancing, metaverse platforms emerged, digitally replicating social experiences. Platforms such as Roblox, Fortnite, Decentraland, and Meta Horizon provided interactive virtual environments for social interaction. In the current post-pandemic world, the adoption and progress of the metaverse have become slower, but immersive virtual experiences are provided through space computing use cases. Interoperability between different virtual platforms, in particular, the search and transfer of digital assets across platforms, is an active area of research [1]. This involves the need to search and analyze virtual objects in different modalities, such as rendered two-dimensional (2D) images, three-dimensional (3D) meshes, and 3D point clouds. Developments in deep learning-based multimodal feature extraction have enabled the 3D cross-modal retrieval of virtual objects. Three-dimensional cross-modal retrieval is a task aimed at searching for specific objects in different modalities or formats. To obtain accurate search results, it is important to understand and differentiate the context of the 3D information from various modalities.

Object retrieval is a search and ranking task that analyzes the feature similarity of a given set of objects. Objects belonging to the same category or class display similar feature distributions. Traditionally, object retrieval has focused on efficiently increasing inter-class variance and decreasing intra-class variance. If inter-class variation is small, the class boundaries become less distinct. If intra-class variation is large, objects with distinct features belonging to the same class may be categorized as belonging to different classes. Therefore, training for object retrieval aims to simultaneously increase inter-class variance and decrease intra-class variance. In the multimodal environment, additional considerations need to be made for object retrieval. In cross-modal object retrieval, objects in different modalities with the same class or category should exhibit similar feature distributions. Therefore, inter-modal variance has to be decreased to extract similar distributed features between modalities.

Several studies were carried out to address the issues in cross-modal retrieval. Center features are representative features for each class in the embedding space. There have been attempts to standardize center features to encourage feature convergence [2,3]. However, the approaches relying on center features have several issues. First, center features are highly sensitive to hyperparameters, e.g., batch sizes, so it is important to select suitable hyperparameters. Second, there is a lack of contrastive characteristics between classes and modalities. Cross-modal center loss performs convergence only by classes and does not conduct comparisons with other modalities of classes [4]. This reduces the robustness of data variations during evaluation. As a result, the weakness of center features may lead to performance degradation in the evaluation stage where the feature distributions diverge. The Robust Discriminative Learning with Noisy Labels for 2D-3D Cross-Modal Retrieval framework (RONO) [2] introduced noisy labels to extract robust features for variations during the training process, but the fundamental problems behind the center features remained unresolved. Other cross-modal methods aimed to represent feature distributions within a comprehensive context through label prediction via a projection head [5,6,7]. The projection head is only used during training and does not influence the object retrieval process.

In this paper, we propose a novel 3D cross-modal retrieval method based on cross-modal supervised contrastive learning (CSupCon) and the fixed projection head (FPH) strategy. CSupCon employs contrastive learning with positive and negative samples per class using data augmentation to learn to discriminate among the different classes, and it mitigates sensitivity issues in contrast to center loss. Furthermore, within the CSupCon process, we adopt a class margin named Softmargin to enhance retrieval performance by making similarity comparison more challenging. The FPH strategy reduces the influence of the projection head compared to conventional methods, enabling focused training in the backbone network. Our paper contributes to the 3D cross-modal retrieval task as follows:

The proposed method employs contrastive learning in a supervised 3D cross-modal environment to maximize the difference between the features of different classes. It covers the weakness of center loss, such as dependency on batch size, and enhances inter-class variance while effectively reducing inter-modal variance.
Through the FPH strategy, the influence of the projection head is decreased to allow better emphasis on the backbone network during the training phase using label prediction.
The proposed method uses augmented data from each modality, proactively adapting to diverse variations during the evaluation process.
The newly adopted Softmargin enhances feature learning to extract features with greater similarity within CSupCon.

2. Related Works

2.1. Contrastive Learning

Contrastive learning is a method that maximizes or minimizes similarities by comparing multiple positive and negative samples. This method is commonly adopted in self-supervised learning scenarios where labels do not exist. Due to this unique attribution, contrastive learning can be utilized in various tasks requiring classification and recognition. Contrastive learning methods, like the Simple Framework for Contrastive Learning of Visual Representations (SimCLR) [8], are dominantly applied in self-supervised learning tasks in image modality. SimCLR proposes a contrastive learning method using data augmentation, which employs augmented positive samples generated from the same image and negative samples from other images in batches during training. Subsequently, other approaches have been proposed to use contrastive attributions more efficiently or to exclude negative samples entirely [9,10,11]. After SimCLR, some self-supervised learning-based cross-modal tasks incorporated data augmentation to inherently compare features across modalities. CrossPoint, which employs data augmentation to compare the relevance between point cloud and image modalities, calculates self-similarity by augmenting point data and extracts distinctive object features [12]. The point features are compared with those from the image modality to conduct 3D cross-modal classification. Similarly, multimodal contrastive training performs cross-modal computations using augmented data to compare inter- and intra-modal similarities [13]. Hence, contrastive learning-based methods are the most accessible for self-supervised learning tasks where labels are absent. However, there have been attempts to extend contrastive learning beyond self-supervised learning. Some supervised learning approaches employ contrastive learning to classify data embedded with high-dimensional features using its contrastive attributions. Supervised contrastive learning (SupCon), for instance, is a method that leverages supervised learning by using contrastive learning for ground truth labels [14]. It applies contrastive learning by utilizing positive samples from objects of the same class. Hence, contrastive learning can be used in supervised, self-supervised, and cross-modal environments, and its advantages can be maximized depending on its usage.

2.2. Cross-Domain and -Modal Feature Learning

Cross-domain adaptation involves extracting shared, common features across different domains, making its task similar to cross-modal approaches. Various methods have been proposed to extract and utilize features in the cross-domain environments. Q. Wang et al. proposed a method to train on labeled and unlabeled data from the target domain while preserving as much structural information from the source domain as possible [15]. Feature distributions in the source domain were analyzed to extract features from the target domain consisting of labeled and unlabeled data while preserving its structural similarity. C. Zhu et al. aimed to train on unlabeled data in the target domain using existing data from the source domain [16]. They proposed a method that utilizes the original backbone model, the source domain fine-tuned model, and the source and target domain fine-tuned model to extract features and learn to cluster these features by class in the latent space. Cross-modal retrieval tasks have been continuously studied in the past. The most challenging aspect of cross-modal retrieval is the imbalanced feature distribution caused by the differences between modalities. Ultimately, it all boils down to reducing the heterogeneous gap between modalities and contextually integrating the features. The initial supervised cross-modal retrieval methods have been mainly studied for image–text or image–point [5,6,7,17,18,19]. These studies aimed to address the aforementioned issues by performing contextual adjustments that can extract discriminative features through semantic label prediction, displaying similar feature distribution across different modalities. Simultaneously, they tried to decrease the distance of inter-modal embedding features, enabling fine-tuned adjustments at the backbone level. As a result, they were successful in representing similar features for the same object class across different modalities. Among these methods, Deep Supervised Cross-Modal Retrieval (DSCMR) mitigates the heterogeneous gap between text and image modalities by acquiring and comparing features in both label and embedding spaces [7]. Furthermore, Jing et al. proposed a method named cross-modal center loss (CLF) which established anchor features called cross-modal center features to generate distinct feature clustering across multi-modal environments [3]. As a result, it led to the convergence of all features from each modality, exhibiting robust feature clustering against modal variance. Later, RONO further expanded CLF by including label noise [2]. RONO enhances distinguishability using noisy labels, creating features that are robust against various changes during evaluation. Previous approaches, CLF and RONO, aim to extract discriminative features and to represent modal coherent features in cross-modal environments to carry out 3D object retrieval tasks using label prediction and cross-modal center loss. For label prediction, feature projection was employed to obtain contextual feature similarity. However, since the projection head is not used in object retrieval tasks, updating the projection head through backpropagation may lead to retrieval performance degradation. Furthermore, the cross-modal center loss used to address inter-class and inter-modal variance relies heavily on center features that may become biased depending on the hyperparameters. Therefore, this paper proposes a new method for 3D object retrieval using both supervised and contrastive learning in a cross-modal environment.

3. Proposed Method

The key aspect of three-dimensional (3D) cross-modal retrieval is maximizing inter-class variance while minimizing inter-modal variance. By maximizing inter-class variance, the contextual features can be distinguishably represented compared to the other classes. Simultaneously, the modal features can be represented similarly for the same object classes by minimizing inter-modal variance. Therefore, data augmentation methods should be employed to provide variations in data during evaluation. Consequently, for supervised 3D cross-modal retrieval tasks, it is desirable to employ a contrastive learning approach where features from the same classes are pulled toward each other while pushing away features from different classes for all modalities using the augmented data.

3.1. Preliminaries

In this paper, we propose a new 3D cross-modal retrieval approach, cross-modal supervised contrastive learning (CSupCon), which expands on the ideas presented in supervised contrastive learning (SupCon) [14]. First, we explain the equations for SupCon in a single modality. Given a set of N training data, two different data augmentation methods were applied to generate a 2-pair dataset, resulting in

2 N

augmented data in total. Using these variables, the conventional self-supervised learning approach, NT-Xent loss, from SimCLR [8] can be defined as follows:

L_{n t} (N) = \frac{1}{2 N} \sum_{n = 1}^{N} [ℓ (2 n - 1, 2 n, N) + ℓ (2 n, 2 n - 1, N)]

(1)

The exponential similarity function

E

can be expressed as shown in Equation (2):

E (i, j) = exp (s i m (z_{i}, z_{j}) / τ)

(2)

where

s i m

and

τ

denote the cosine similarity function and temperature parameter which adjusts scalability, respectively. Equation (3) shows the loss function which formulates Equation (1). The loss function is defined as follows:

ℓ (i, j, N) = - log \frac{E (i, j)}{\sum_{k = 1}^{2 N} 1_{(i \neq k)} \cdot E (i, k)}

(3)

where z represents projected features through the multilayer perceptron head, and

\{0, 1\} \in 1

is the indicator function which returns 1 if the condition is satisfied and 0 otherwise. Moreover, if we refer to

Y = {\{y_{i}\}}_{i = 1}^{N}

as ground truth labels,

\tilde{y_{i}}

is the number of the same labels in batches, given as

\tilde{y_{i}} = \sum_{j = 1}^{N} 1_{(y_{i} = y_{j})}

. Using the definitions above, the SupCon loss

L_{s c}

can be formulated as follows:

L_{s c} = \sum_{i = 1}^{2 N} \frac{1}{2 \tilde{y_{i}} - 1} \sum_{j = 1}^{2 N} 1_{(i \neq j)} \cdot 1_{(y_{i} = y_{j})} \cdot ℓ (i, j, N)

(4)

As described above, the NT-Xent loss generates multiple augmented data to suggest a contrastive learning approach applicable to self-supervised learning. SupCon has adopted this approach for supervised learning and demonstrated improved performance by incorporating additional instances from the same class as positive samples while training.

3.2. Cross-Modal Supervised Contrastive Learning

To perform retrieval across several modalities, we propose a new method named CSupCon which builds upon SupCon. Cross-modal retrieval tasks require additional comparison among the different modalities. Consequently, additional datasets are generated in each modality through data augmentation. Figure 1 illustrates this process. The proposed CSupCon loss, denoted as

L_{c s c}

, is defined as follows:

L_{c s c} = \sum_{i = 1}^{2 N M} \frac{1}{2 \tilde{y_{i}} - 1} \sum_{j = 1}^{2 N M} 1_{(i \neq j)} \cdot 1_{(y_{i} = y_{j})} \cdot ℓ (i, j, N M)

(5)

where M represents the number of modalities present. Ultimately, the proposed approach generates augmented data, denoted as

d^{'}

and

d^{″}

, for each modality. The augmented data

d^{'}

and

d^{″}

are combined into D and processed by SupCon, which applies positive and negative sampling from all modalities. The proposed CSupCon enables contrastive sampling regardless of modality, thereby enhancing inter-class variation and decreasing inter-modal variation. Furthermore, our proposed method defines relationships between augmented data of different modalities (

d_{i}^{'} \leftrightarrow d_{j}^{″}

), allowing robust adaptation for variation in data during evaluation. In addition, most traditional methods employ the projected feature z to compute and modify features. However, it is more efficient to directly utilize the embedding features v via the encoder network in 3D cross-modal retrieval tasks empirically. Hence, the exponential similarity function E can be redefined as shown in Equation (6).

E (i, j) = exp (s i m (v_{i}, v_{j}) / τ)

(6)

3.3. Label Prediction Using Fixed Projection Head

The projection head for semantic label prediction significantly impacts inter-class variance. The process for label prediction reflects the projection of low-level embedding features onto high-level characteristics. Therefore, label prediction is a convenient method to unify contextual similarity across all modalities. The projection head affects the embedding features during the projecting process. As a result, label prediction training influences backbone network training due to disturbances by the projection head. To prevent such a phenomenon during training, the proposed method employs a fixed projection head (FPH) strategy which fixes the projection network as shown in Figure 2. The projection network is restricted from influencing label prediction, and training can be focused entirely on learning the backbone network. In our strategy, the FPH network is a simple network composed of Linear–Rectified Linear Unit (ReLU)–Linear layers. Given a 1 × 512 input, it generates 1 × 40 output features.

3.4. Loss Functions

The final training loss of the proposed 3D cross-modal retrieval consists of 3 losses: label prediction loss, CSupCon loss, and Mean Square Error (MSE) loss. Label prediction is employed to enhance inter-class variance. This is given the highest priority. CSupCon loss applies contrastive learning across different modalities. MSE loss is used to emphasize the inter-modal variance to pull each modality. To improve retrieval performance, we incorporate an additive margin named Softmargin into the training process of CSupCon to enhance similarity margins. If the additive margin is denoted as

η

, the exponential similarity function

E

can be modified as

E^{'}

by applying

η

, as follows:

E^{'} (i, j) = exp ((s i m (v_{i}, v_{j}) - η) / τ)

(7)

The loss function ℓ in Equation (8) is also modified as

ℓ^{'}

by applying

η

, as follows:

ℓ^{'} (i, j, N) = - log \frac{E^{'} (i, j)}{E^{'} (i, j) + \sum_{k = 1}^{2 N} 1_{(i \neq k)} \cdot E (i, k)}

(8)

By employing

η

, we can increase the difficulty of comparing similarities between features, thereby further enhancing the performance of contrastive learning. Therefore, the proposed approach performs

L_{c s c}

using the modified

ℓ^{'}

in Equation (8).

The loss function of the FPH strategy,

L_{f p h}

, can be represented as Mean Absolute Error (MAE), as shown in Equation (9), where

P

represents a classifier that is the same as the projection network.

L_{f p h} = \frac{1}{M N} \sum_{m}^{M} \sum_{n}^{N} |P (v_{n}^{m}) - y_{n}|

(9)

Lastly, the proposed method uses Mean Square Error (MSE) between modalities to enhance inter-modal variation.

L_{m s e} = \frac{1}{N \cdot (M - 1)!} \sum_{s = 1}^{M - 1} \sum_{m = s + 1}^{M} \sum_{n = 1}^{N} ∥ v_{n}^{s} - v_{n}^{m} ∥

(10)

Finally, our entire loss is computed as follows:

L = α L_{c s c} + β L_{f p h} + γ L_{m s e}

(11)

The weights

α

,

β

, and

γ

are empirically adjusted.

4. Experiments

In this section, we conduct ablation studies on the components of the proposed methods and explain the datasets used. In addition, quantitative experiments are performed to compare our method with state-of-the-art three-dimensional (3D) cross-modal retrieval approaches.

4.1. Datasets

ModelNet10 and ModelNet40 are 3D computer-aided design (CAD) object datasets with 10 and 40 categories, respectively [20]. ModelNet40 consists of 9840 training and 2468 test data. ModelNet10 includes 3991 training and 908 test data. The datasets used in the experiments include images, meshes, and point clouds which are processed by cross-modal center loss (CLF) [3]. The datasets are available at https://github.com/LongLong-Jing/Cross-Modal-Center-Loss (accessed on 27 September 2024).

4.2. Experimental Details

The parameters used in our experiments were set empirically as follows. The features used for training were 512-dimensional vectors obtained from the backbone network, throughout 1000 epochs. The weights in Equation (11),

α

,

β

, and

γ

, were 10, 1, and 1, respectively. All experiments were performed on multiple NVIDIA TITAN Xp GPUs. In all experiments, the features from images were acquired from four viewpoints. In the data augmentation step, we applied weakly augmented data, which are almost identical to the input, and strongly augmented data. The image modality adopted random crop, random flip, and resizing. The mesh modality employed face jittering and a maximum sampling of 1024. Lastly, the point cloud modality employed random translation, rotation, and jitter. Unlike the original training strategy of supervised contrastive learning (SupCon), we adopted only the representation feature learning stage of SupCon. We used ResNet, MeshNet, and DGCNN as the backbone network for image, mesh, and point modalities [21,22,23].

4.3. Ablation Study: Impact of Compositions

To verify the effectiveness of the proposed method, experiments were conducted on ModelNet40 dataset for evaluation based on the mean average precision (mAP) metric. The results demonstrated that the proposed method achieved better performance than the existing methods. To provide a detailed analysis of the results, ablation studies addressing these aspects are presented in this section.

Impact of Batch Size: The proposed method, cross-modal supervised contrastive learning (CSupCon), was evaluated based on positive and negative samples within a batch. Contrastive learning tends to be less sensitive to the influence of batch size. To verify our assumption, we conducted experiments by varying the batch size. The results are illustrated in Table 1. While the center loss-based CLF [3] was heavily influenced by changes in batch size, our proposed contrastive learning-based method was relatively less influenced by batch size. Consequently, the proposed method can be employed regardless of batch size.

Impact of Loss Functions: To verify the influence of each objective function in the proposed method, we conducted an ablation study on the impact of these losses to analyze their significance. The results are presented in Table 2. Estimating the overall context using label prediction through the fixed projection head (FPH) strategy and performing contrastive learning for each data through CSupCon displayed performance as expected. In addition, improving the inter-modal variation by adding MSE loss also aided performance.

Impact of Fixed Projection Head: The proposed method uses the FPH strategy which fixes the projection head during label prediction. To prove the efficiency of the strategy, we performed an ablation study. Table 3 shows performance differences before and after applying the FPH strategy. The results show that the FPH strategy improved the extraction of distinguishable features.

Impact of Softmargin: The proposed method involves applying a margin during similarity comparison, to enhance performance by making the comparison more challenging. As shown in Table 4, the results were improved for retrieval tasks with Softmargin applied. Therefore, raising the difficulty level during similarity comparison using Softmargin improved the overall performance.

4.4. Experimental Results for Comparison

We conducted a comprehensive performance evaluation of the proposed method with previous approaches for 3D cross-modal retrieval tasks. To demonstrate the effectiveness of contrastive learning methods, we compared two cross-modal center loss-based methods, CLF [3] and RONO [2].

Quantitative Comparison: Table 5 and Table 6 represent the results of quantitative comparisons. The proposed method achieved the highest mAP performance on the ModelNet10 and ModelNet40 datasets. Based on the comparison results, we can verify the following:

The proposed contrastive learning-based method exhibits better results than the center loss-based methods, as it is less sensitive to external elements. Therefore, the proposed method is suitable for 3D cross-modal retrieval.
Traditional methods based on center loss do not actively exclude other classes. On the other hand, our proposed method, CSupCon, based on contrastive learning, actively compares with other classes using data augmentation, obtaining better results during evaluation. As shown in Table 5 and Table 6, performance improvements in cross-modal retrieval are observed in four out of six tasks for ModelNet40 and six out of six tasks for ModelNet10. The highest mAPs for cross-modal retrieval tasks are underlined in the tables.

4.5. Qualitative Results

Visualizations of Experimental Results: To verify the proposed method visually, we exhibit the outcomes of feature clusters and retrieved resultant objects. Figure 3 exhibits the retrieval result of test data in the ModelNet40 dataset using t-distributed stochastic neighbor embedding (T-SNE) [24].

Furthermore, Figure 4 presents the retrieval results in different modalities. As shown in the figure, it can be observed that the proposed method succeeds in cross-modal retrieval regardless of the modality. Based on these results, we can conclude that our method, based on contrastive learning, successfully performs retrieval.

Limitations: We have confirmed that our proposed method successfully performs cross-modal retrieval and achieves state-of-the-art performance. However, as shown in Figure 5, the retrieval also has unresolved issues. Learning is not as effective for classes with a small number of training data. Consequently, additional approaches are required to address the long-tailed issue commonly encountered in deep learning.

5. Conclusions

In this paper, we introduce a novel three-dimensional (3D) cross-modal retrieval method that uses cross-modal supervised contrastive learning (CSupCon) and the fixed projection head (FPH) strategy to learn distinguishable features across different modalities by leveraging the contrastive learning approach that is commonly used in single-modal environments. CSupCon extracts robust features for cross-modal evaluation by analyzing relationships among augmented data from different modalities. Moreover, CSupCon actively includes and excludes features from other classes during evaluation whereas traditional methods do not. In addition, the FPH strategy allows for more focused learning on the backbone network compared to conventional methods. The experimental results show that the proposed method results in the highest mAP compared to existing methods on the ModelNet10 and ModelNet40 datasets, verifying its effectiveness. Therefore, employing the proposed method can be beneficial for 3D cross-modal retrieval tasks in 3D virtual spaces. As future work, we plan to expand our research to real-world scenarios using online 3D asset datasets for additional performance evaluations. Additionally, we intend to incorporate various modalities, such as textual object descriptions, to facilitate cross-modal retrieval.

Author Contributions

Conceptualization, Y.-S.C.; methodology, Y.-S.C. and B.K.; software, Y.-S.C.; validation, Y.-S.C., B.K., H.-S.K. and Y.-S.P.; formal analysis, Y.-S.C. and B.K.; investigation, Y.-S.C. and B.K.; resources, Y.-S.P.; data curation, H.-S.K.; writing—original draft preparation, Y.-S.C.; writing—review and editing, Y.-S.C. and Y.-S.P.; visualization, Y.-S.C.; supervision, H.-S.K.; project administration, Y.-S.P.; funding acquisition, Y.-S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2024 (Project Name: Open Metaverse Asset Platform for Digital Copyrights Management, Project Number: RS-2022-KC000812 (R2022020034), Contribution Rate: 100%).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, L.; Ni, S.-T.; Wang, Y.; Yu, A.; Lee, J.-A.; Hui, P. Interoperability of the metaverse: A digital ecosystem perspective review. arXiv 2024, arXiv:2403.05205. [Google Scholar]
Feng, Y.; Zhu, H.; Peng, D.; Peng, X.; Hu, P. RONO: Robust Discriminative Learning with Noisy Labels for 2D-3D Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11610–11619. [Google Scholar]
Jing, L.; Vahdani, E.; Tan, J.; Tian, Y. Cross-Modal Center Loss for 3D Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3141–3150. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 499–515. [Google Scholar]
Wang, B.; Yang, Y.; Xu, X.; Hanjalic, A.; Shen, H.T. Adversarial Cross-Modal Retrieval. In Proceedings of the 25th ACM International Conference on Multimedia, New York, NY, USA, 23–27 December 2017; pp. 154–162. [Google Scholar]
Zhang, C.; Song, J.; Zhu, X.; Zhu, L.; Zhang, S. HCMSL: Hybrid Cross-Modal Similarity Learning for Cross-Modal Retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–22. [Google Scholar] [CrossRef]
Zhen, L.; Hu, P.; Wang, X.; Peng, D. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 17–20 June 2019; pp. 10386–10395. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 1597–1607. [Google Scholar]
Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 15750–15758. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.; Azar, M.G.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 21271–21284. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
Afham, M.; Dissanayake, I.; Dissanayake, D.; Dharmasiri, A.; Thilakarathna, K.; Rodrigo, R. CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 9892–9902. [Google Scholar]
Yuan, X.; Lin, Z.; Kuen, J.; Zhang, J.; Wang, Y.; Maire, M.; Kale, A.; Faieta, B. Multimodal Contrastive Training for Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6991–7000. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 18661–18673. [Google Scholar]
Wang, Q.; Breckon, T.P. Cross-domain structure preserving projection for heterogeneous domain adaptation. Pattern Recognit. 2022, 123, 108362. [Google Scholar] [CrossRef]
Zhu, C.; Wang, Q.; Xie, Y.; Xu, S. Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation. Inf. Sci. 2024, 662, 120223. [Google Scholar] [CrossRef]
Cheng, Q.; Tan, Z.; Wen, K.; Chen, C.; Gu, X. Semantic Pre-Alignment and Ranking Learning with Unified Framework for Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6503–6516. [Google Scholar] [CrossRef]
Wei, Y.; Zhao, Y.; Lu, C.; Wei, S.; Liu, L.; Zhu, Z.; Yan, S. Cross-Modal Retrieval with CNN Visual Features: A New Baseline. IEEE Trans. Cybern. 2017, 47, 449–460. [Google Scholar] [CrossRef] [PubMed]
Zeng, Z.; Xu, N.; Mao, W.; Zeng, D. An Orthogonal Subspace Decomposition Method for Cross-Modal Retrieval. IEEE Intell. Syst. 2022, 37, 45–53. [Google Scholar] [CrossRef]
Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D Shapenets: A Deep Representation for Volumetric Shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar]
Feng, Y.; Feng, Y.; You, H.; Zhao, X.; Gao, Y. Meshnet: Mesh neural network for 3D shape representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8279–8286. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Justin, M.S. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Van Der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Traditional SimCLR [8] method using single modal data augmentations (left). The augmented data (marked with symbols ′ and ″) are adapted in supervised learning to aggregate representation features. Supervised contrastive learning (SupCon) [14] adapted SimCLR into supervised learning tasks in single modality (middle). Our method, CSupCon, applied contrastive learning to cross-modal tasks (right). The numbers represent different data instances. The rectangles and circles represent different modalities, and the different colors represent different classes. The blue and red lines indicate positive and negative instances.

Figure 2. Overview of the proposed method. In the feature extraction stage, an augmented data instance

x^{'}

and another augmented data instance

x^{″}

from input x, embedding features

v^{'}

and

v^{″}

, are extracted from each modality using its corresponding backbone network. The proposed method, cross-modal supervised contrastive learning (CSupCon), pushes the features away for different classes and pulls the features towards each other for the same classes. On the other side, in the fixed projection head (FPH) strategy, the

v^{'}

features are used to predict semantic labels for classification.

Figure 2. Overview of the proposed method. In the feature extraction stage, an augmented data instance

x^{'}

and another augmented data instance

x^{″}

from input x, embedding features

v^{'}

and

v^{″}

, are extracted from each modality using its corresponding backbone network. The proposed method, cross-modal supervised contrastive learning (CSupCon), pushes the features away for different classes and pulls the features towards each other for the same classes. On the other side, in the fixed projection head (FPH) strategy, the

v^{'}

features are used to predict semantic labels for classification.

Figure 3. The visualization result of feature clustering from the ModelNet40 test data.

Figure 4. The results of cross-modal retrieval using the proposed method from the ModelNet40 test data.

Figure 5. The result of cross-modal retrieval on the ModelNet40 test data by class. The illustration depicts sorted classes based on the amount of training data and their corresponding mAPs. In general, results are not favorable for classes with a small (less than 200 in this example) number of training data (i.e., classes included in the blue dotted rectangle).

Table 1. Ablation study to verify the experimental results according to batch sizes on the ModelNet40 dataset. The numbers in bold represent the highest average mAP in the evaluation.

Retrieval Tasks			Batch Sizes
			24		48		96		128
			CLF	Ours	CLF	Ours	CLF	Ours	CLF	Ours
Image	→	Image	63.56	90.07	85.64	90.08	90.23	90.43	90.30	90.37
Image		Mesh	73.22	89.72	86.94	89.45	89.59	89.73	89.80	89.81
Image		Point	72.08	90.25	85.59	89.68	89.04	90.08	88.30	90.69
Mesh		Image	88.44	88.35	88.91	89.13	88.51	88.95	89.10	88.15
Mesh		Mesh	68.81	89.10	86.50	89.46	88.11	89.04	87.30	88.78
Mesh		Point	84.60	89.13	86.67	89.12	87.37	88.84	88.20	89.01
Point		Image	82.44	89.47	85.44	89.19	87.04	89.86	88.70	89.92
Point		Mesh	67.46	89.69	84.67	89.12	87.11	89.62	88.30	89.88
Point		Point	83.56	90.18	86.62	89.33	87.58	90.12	88.10	90.88
Average mAP			76.02	89.55	86.33	89.40	88.29	89.63	88.68	89.72

Table 2. Experimental results of objective functions.

L_{1}

,

L_{2}

, and

L_{3}

represent the label prediction loss

L_{f p h}

,

L_{c s c}

loss, and

L_{m s e}

loss, respectively. In all experiments, the batch size was set to 128.

Table 2. Experimental results of objective functions.

L_{1}

,

L_{2}

, and

L_{3}

represent the label prediction loss

L_{f p h}

,

L_{c s c}

loss, and

L_{m s e}

loss, respectively. In all experiments, the batch size was set to 128.

Retrieval Tasks			Loss Functions
Retrieval Tasks			$L_{1}$	$L_{1} + L_{2}$	$L_{1} + L_{2} + L_{3}$
Image		Image	88.28	89.95	90.37
Image		Mesh	87.31	89.55	89.81
Image		Point	86.77	90.07	90.69
Mesh		Image	86.09	88.47	88.15
Mesh	→	Mesh	86.37	89.29	88.78
Mesh		Point	85.22	89.05	89.01
Point		Image	83.44	89.59	89.92
Point		Mesh	82.75	89.82	89.88
Point		Point	82.33	90.31	90.88
Average mAP			85.40	89.57	89.72

Table 3. Ablation study to verify the performances using the FPH strategy on the ModelNet40 dataset. In all experiments, the batch size was set to 128.

Retrieval Tasks			FPH Strategy
Retrieval Tasks			w/o FPH	w/ FPH
Image		Image	89.32	90.37
Image		Mesh	89.04	89.81
Image		Point	88.83	90.69
Mesh		Image	88.88	88.15
Mesh	→	Mesh	89.50	88.78
Mesh		Point	88.61	89.01
Point		Image	89.26	89.92
Point		Mesh	89.43	89.88
Point		Point	89.58	90.88
Average mAP			89.16	89.72

Table 4. Ablation study to verify the performances w/ and w/o Softmargin. In all experiments, the batch size was set to 128.

Retrieval Tasks			Softmargin
Retrieval Tasks			w/o Margin	w/ Margin
Image		Image	89.67	90.37
Image		Mesh	89.20	89.81
Image		Point	90.12	90.69
Mesh		Image	87.49	88.15
Mesh	→	Mesh	88.32	88.78
Mesh		Point	88.59	89.01
Point		Image	89.41	89.92
Point		Mesh	89.71	89.88
Point		Point	90.51	90.88
Average mAP			89.22	89.72

Table 5. The quantitative comparison with previous cross-modal center loss-based methods with a 128 batch size. The number in bold represents the highest average mAP on the ModelNet40 dataset. The highest mAPs for cross-modal retrieval tasks are underlined.

Retrieval Tasks			Methods
Retrieval Tasks			CLF	RONO	Ours
Image		Image	90.30	91.10	90.37
Image		Mesh	89.80	90.10	89.81
Image		Point	88.30	89.10	90.69
Mesh		Image	89.10	89.90	88.15
Mesh	→	Mesh	87.30	90.10	88.78
Mesh		Point	88.20	88.30	89.01
Point		Image	88.70	89.10	89.92
Point		Mesh	88.30	89.40	89.88
Point		Point	88.10	89.10	90.88
Average mAP			88.68	89.58	89.72

Table 6. The quantitative comparison with previous cross-modal center loss-based methods with a 128 batch size. The number in bold represents the highest average mAP on the ModelNet10 dataset. The highest mAPs for cross-modal retrieval tasks are underlined.

Retrieval Tasks			Methods
Retrieval Tasks			CLF	RONO	Ours
Image		Image	90.30	91.30	91.72
Image		Mesh	90.70	90.60	91.47
Image		Point	89.50	89.80	91.64
Mesh		Image	88.90	89.60	90.93
Mesh	→	Mesh	91.60	91.90	91.19
Mesh		Point	90.00	90.40	91.20
Point		Image	88.70	89.50	91.35
Point		Mesh	89.30	90.30	91.64
Point		Point	88.50	89.20	92.04
Average mAP			89.72	90.29	91.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Choo, Y.-S.; Kim, B.; Kim, H.-S.; Park, Y.-S. Supervised Contrastive Learning for 3D Cross-Modal Retrieval. Appl. Sci. 2024, 14, 10322. https://doi.org/10.3390/app142210322

AMA Style

Choo Y-S, Kim B, Kim H-S, Park Y-S. Supervised Contrastive Learning for 3D Cross-Modal Retrieval. Applied Sciences. 2024; 14(22):10322. https://doi.org/10.3390/app142210322

Chicago/Turabian Style

Choo, Yeon-Seung, Boeun Kim, Hyun-Sik Kim, and Yong-Suk Park. 2024. "Supervised Contrastive Learning for 3D Cross-Modal Retrieval" Applied Sciences 14, no. 22: 10322. https://doi.org/10.3390/app142210322

APA Style

Choo, Y.-S., Kim, B., Kim, H.-S., & Park, Y.-S. (2024). Supervised Contrastive Learning for 3D Cross-Modal Retrieval. Applied Sciences, 14(22), 10322. https://doi.org/10.3390/app142210322

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Supervised Contrastive Learning for 3D Cross-Modal Retrieval

Abstract

1. Introduction

2. Related Works

2.1. Contrastive Learning

2.2. Cross-Domain and -Modal Feature Learning

3. Proposed Method

3.1. Preliminaries

3.2. Cross-Modal Supervised Contrastive Learning

3.3. Label Prediction Using Fixed Projection Head

3.4. Loss Functions

4. Experiments

4.1. Datasets

4.2. Experimental Details

4.3. Ablation Study: Impact of Compositions

4.4. Experimental Results for Comparison

4.5. Qualitative Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI