TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding

Wang, Juan; Wang, Zhijie; Miyazaki, Tomo; Fan, Yaohou; Omachi, Shinichiro

doi:10.3390/s24196166

Open AccessArticle

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding

by

Juan Wang

^1,†,

Zhijie Wang

^2,†

,

Tomo Miyazaki

¹

,

Yaohou Fan

¹

and

Shinichiro Omachi

^1,*

¹

Department of Communications Engineering, Graduate School of Engineering, Tohoku University, Sendai 9808579, Japan

²

RIKEN AIP, Tokyo 1030027, Japan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2024, 24(19), 6166; https://doi.org/10.3390/s24196166

Submission received: 5 August 2024 / Revised: 11 September 2024 / Accepted: 21 September 2024 / Published: 24 September 2024

(This article belongs to the Special Issue Object Detection via Point Cloud Data)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Three-dimensional (3D) Scene Understanding achieves environmental perception by extracting and analyzing point cloud data with wide applications including virtual reality, robotics, etc. Previous methods align the 2D image feature from a pre-trained CLIP model and the 3D point cloud feature for the open vocabulary scene understanding ability. We believe that existing methods have the following two deficiencies: (1) the 3D feature extraction process ignores the challenges of real scenarios, i.e., point cloud data are very sparse and even incomplete; (2) the training stage lacks direct text supervision, leading to inconsistency with the inference stage. To address the first issue, we employ a Masked Consistency training policy. Specifically, during the alignment of 3D and 2D features, we mask some 3D features to force the model to understand the entire scene using only partial 3D features. For the second issue, we generate pseudo-text labels and align them with the 3D features during the training process. In particular, we first generate a description for each 2D image belonging to the same 3D scene and then use a summarization model to fuse these descriptions into a single description of the scene. Subsequently, we align 2D-3D features and 3D-text features simultaneously during training. Massive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art approaches.

Keywords:

open vocabulary; 3D Scene Understanding; multi-modal learning; contrastive learning; Masked Consistency; Textual Alignment

1. Introduction

Three-dimensional (3D) Scene Understanding is a fundamental computer vision task; its aim is to perform point-wise classification on input point cloud data. Three-dimensional Scene Understanding has various applications, including virtual reality, robot manipulation, and human–machine interaction [1,2]. Traditional fully supervised methods [3,4] rely on fully annotated data leading to time-consuming labeling and failure to recognize unseen categories. Some methods [5,6] divide datasets into seen and novel classes for training and testing, respectively—as shown in Figure 1a—to reduce the heavy annotation costs. However, these methods can still only work with fixed categories; more specifically, they need to know what the novel classes are and cannot handle queries with arbitrary text input. Open-vocabulary 3D Scene Understanding is proposed to address the above problems by enabling the model to recognize arbitrary input text, not limited to close-set dataset categories. The diversity of potential queries makes it challenging, and has prompted many studies [7,8,9] to explore these problems.

Existing state-of-the-art method OpenScene [7] leverages CLIP [10] to distill knowledge into the 3D domain via contrastive learning. CLIP is a model pre-trained on a large number of image-text pairs; OpenScene aligns the 3D features with 2D features extracted from CLIP model, establishing the connection between 3D data and text, as illustrated in Figure 1b. OpenScene does not require the use of any 3D-text data pairs for training, thus significantly reducing annotation costs.

After dissecting the previous work, we found two insufficiencies. Firstly, the 3D feature extraction process is not robust enough. Specifically, during the training of OpenScene, the features of the complete 3D point cloud are used, while in the real scenario, the point clouds of scenes are incomplete and partially missing. Secondly, OpenScene connects 3D features and text by aligning 2D and 3D features, lacking direct supervision from the text. These two limitations both constrain the performance of OpenScene. As shown in Figure 1c, to address the first problem, we used Masked Consistency (MC, refer to Section 3.2) in training, where part of the features are randomly masked during training to force the model to understand the global scene using only partial 3D features. For the second problem, we first generated text pseudo labels for the point cloud and used them to assist the training process, which is the Textual Alignment method (TA, refer to Section 3.3), allowing the text to directly supervise the training process of the 3D feature extractor.

In summary, the main contributions of this paper are three-fold:

We propose Masked Consistency training, making the extracted feature more robust for real-world 3D Scene Understanding tasks.
We use a generated text pseudo label to assist the training process, which is Textual Alignment, to enable better interaction between the 3D features and text features.
We conduct extensive experiments with the proposed method and update the state-of-the-art performance, demonstrating the effectiveness of the proposed method.

2. Related Work

2.1. Three-Dimensional Scene Understanding

Three-dimensional Scene Understanding focuses on understanding the semantic meaning of objects and environments from point clouds, with 3D semantics being a fundamental perception task for Scene Understanding. For semantic feature extraction and prediction from point clouds, existing methods have designed custom point convolutions applied to the raw point cloud [11,12,13]. Alternatively, some methods transform point clouds into 3D grids (voxels) and employ sparse convolutions [14], voxel-tailored networks [15], or transformers [16] for feature extraction. Despite achieving an outstanding performance on closed-set benchmark datasets, they often struggle with open-world settings, i.e., the recognition of unseen categories not annotated during training.

2.2. Two-Dimensional (2D) Scene Understanding

Two-dimensional (2D) Scene Understanding models are designed to allow users to interact with input images through text. Pioneering work in this field is CLIP [10]. It is trained on a vast dataset of image–text pairs, enabling it to understand the relationship between visual content and linguistic descriptions. CLIP’s versatility allows it to perform various tasks, such as zero-shot image classification and cross-modal retrieval, without task-specific fine-tuning. To ensure good generalization, CLIP requires training on a large number of image–text pairs. Recently, various new methods have been proposed for augmentation, including prompt learning [17,18], module fine-tuning [3,19], and knowledge distillation [20,21]. This work enables open-world tasks like segmentation [3,19,22], detection [20,21], and more.

2.3. Zero-Shot Learning for 3D Point Clouds

Inspired by the success of 2D Scene Understanding models [10,23], some works have adopted similar strategies for the zero-shot learning for 3D point clouds, which aims to classify 3D points without ground truths. For instance, PLA [5] first uses a caption model to generate pseudo textual descriptions for 2D images corresponding to 3D point clouds, then aligns these descriptions with 3D features in a contrastive learning fashion. However, challenges like inaccurate predictions persist due to coarse language supervision. RegionPLC [6] is proposed to address it, which introduces a pre-trained detection model to densely detect the input 2D image, thereby obtaining richer text descriptions. However, during the training process, both PLA and RegionPLC divide the dataset into seen classes and unseen classes; the ground truth is available for the former, while the latter uses the previously mentioned methods to generate pseudo labels. Subsequently, the model achieves point-wise classification through supervised training on both seen and unseen classes. These methods still rely on supervised training with 3D ground truth labels (seen classes) or pseudo labels (unseen classes); they can only segment pre-defined classes. OpenScene [7] enables class-agnostic classification without requiring labeled 3D data. It first uses a 3D feature extractor to obtain embeddings for each 3D point, and then a frozen image encoder of the pre-trained CLIP is applied to generate 2D embeddings corresponding to each 3D point. OpenScene aligns 2D and 3D features through contrastive learning. Since 2D and text features have already been aligned during CLIP’s training process, after the 2D–3D alignment of OpenScene, 3D features also establish a connection with text, thereby enabling point-wise classification. However, OpenScene does not consider the inherent sparsity of point cloud data, and it also lacks direct supervision from textual information during the training process, resulting in suboptimal performance. The proposed method alleviates the two aforementioned problems by introducing the masking policy and pseudo-textual descriptions during training, thereby achieving better scene understanding capabilities.

3. Method

Open-vocabulary 3D Scene Understanding models, such as OpenScene, use images as supervision, allowing the training process to proceed without any semantic-level annotations. However, the lack of textual information during training results in sub-optimal performance. In contrast, PLA introduces a language-driven training process to achieve open-vocabulary recognition, but the requirement for predefined categories limits the model’s scalability. Therefore, the proposed method aims to combine the strengths of various modalities (such as images, point clouds, and text) to train a model that enhances robustness and performance without predefined categories. Our framework builds upon OpenScene; thus, we will first revisit OpenScene.

3.1. Revisiting OpenScene

As illustrated in Figure 1b, OpenScene first extracts point-wise 3D features

F^{3 D}

and 2D features

F^{2 D}

of a 3D scene from unlabeled 3D and 2D data for training, respectively. Thus, we can obtain

F^{3 D}

via:

F^{3 D} = ϵ_{θ}^{3 D} (P)

(1)

where

P \in R^{M \times 3}

represents the input point clouds of a scene with M points,

ϵ_{θ}^{3 D} : R^{M \times 3} \mapsto R^{M \times C}

is a trainable 3D encoder, which adopts MinkowskiNet18A [15] to get per-point features

F^{3 D}

, C is a feature dimension, and

F^{3 D} = {f_{1}^{3 D}, \dots, f_{M}^{3 D}}

represent 3D features with M points.

For each 3D point, the corresponding 2D pixels can be calculated according to the intrinsic matrix and world-to-camera extrinsic. Then a frozen 2D encoder

ϵ_{θ}^{2 D}

and an average pooling operator

P

are employed to extract multi-view pixel-wise features

f

and fuse them, respectively. This process can be expressed by the following equation:

f^{2 D} = P (f_{1}, \dots, f_{K})

, where a total number of K pixels can be associated with one point.

F^{2 D}

can be obtained by repeating the fusion process for each point:

F^{2 D} = {f_{1}^{2 D}, \dots, f_{M}^{2 D}} \in R^{M \times C}

(2)

where M presents the points number of the input point clouds; it means the length of

F^{2 D}

vector is consistent with

F^{3 D}

. So that 2D features

F^{2 D}

can be aligned with 3D feature

F^{3 D}

. It is worth noting that OpenScene utilizes the LSeg [3], a CLIP variant fine-tuned for the pixel-wise classification task. We follow the same setting for fair comparisons. The final step is to align the 2D and 3D modalities via contrastive learning between 2D features

F^{2 D}

and 3D features

F^{3 D}

; the objective function is:

L = 1 - c o s (F^{2 D}, F^{3 D})

(3)

where

c o s

means the cosine similarity calculation operation. Since the 2D features

F^{2 D}

have already been aligned with the textual features

F^{t e x t}

during the CLIP pre-training, aligning the 3D features

F^{3 D}

with

F^{2 D}

via Equation (3) will also enable

F^{3 D}

to be consistent with

F^{t e x t}

. This allows the model to achieve open vocabulary scene understanding in 3D scenarios during testing, as shown in Figure 1b.

3.2. Three-Dimensional Feature for Masked Consistency Training

Point clouds in the real world are often sparse or even partially missing. The model needs to understand the complete scene through incomplete 3D features. OpenScene fails to consider this characteristic of 3D data, leading to sub-optimal performance. Motivated by this finding, we propose a new training policy to improve the model, named Masked Consistency (MC). Without bells and whistles, we mask some features of

F^{3 D}

randomly to get a new representation

F_{M a s k}^{3 D}

, which can be represented by the following equation:

F_{M a s k}^{3 D} = F^{3 D} ⊙ M_{r}

(4)

where

M_{r}

is a randomly generated [0, 1] mask with the same spatial size of

F^{3 D}

, and the r is a scalar between 0 and 1, which means the ratio of 0, i.e.,

M_{0.6}

is a mask where 60% of the mask is randomly filled with zeros, and the remaining parts are filled with ones. ⊙ means the Hadamard product. The proposed MC encourages the model to predict complete scene information through partial point cloud features, making the trained model more generalized and robust. We validate the influence of different mask ratios on model performance in Section 4.3.4.

3.3. Text Feature for Textual Alignment Training

OpenScene aligns the features of 2D and 3D data during training. Specifically, the 3D representation uses the 2D data as an intermediary to connect with the text, thereby acquiring scene understanding capabilities. This training workflow lacks direct interaction between the 3D data and the text, leading to sub-optimal performance. We aim to address this issue and enhance the model’s effectiveness. Specifically, we first generate corresponding pseudo-text features for the 3D point clouds and then directly align the 3D point clouds with these pseudo-text features to improve performance.

We borrow the workflow from PLA [5] to generate pseudo-text descriptions for point clouds. Specifically, for a 3D scene paired with multi-view RGB images, we use the ViT-GPT2 [24], a pre-trained image captioning model, to generate textual descriptions for images:

t_{i} = G_{c a p} (I_{i})

(5)

where

G_{c a p}

is the pre-trained caption model,

I_{i}

and

t_{i}

is the ith 2D input image and the corresponding generated caption, respectively.

Then, a pre-trained text summarizer, BART [25], is used to summarize all captions

t_{i}

of a scene:

t = G_{s u m} (t_{1}, \dots, t_{i}, \dots, t_{N})

(6)

where

G_{s u m}

is the pre-trained caption summarizer,

t

is the final scene-level caption that describes the entire scene, N is the number of frames in a scene.

The captioner

G_{c a p}

, as shown in Figure 2, has been pre-trained on a massive number of image–text pairs, exhibiting strong generalization capabilities. As a result, it can generate rich semantic descriptions for 2D images, which include attributes of entities as well as the relationships between them. For example, in

t_{1}

, “a kitchen with a wooden table and chairs” corresponds to the first image caption. In this text description, “wooden” describes the material, “a” and “s” indicate quantity, “kitchen” specifies the room type, and “with” explains the spatial relationship. Compared to simply representing visual information with single entities like “table” and “chairs”, such a complete description provides richer semantic information and interaction relationships. Moreover, after summarizing multi-view descriptions using

G_{s u m}

, we can retain effective information and avoid potential semantic conflicts that may arise in multi-view descriptions. We validate the effectiveness of scene-level text descriptions for model performance in Section 4.3.

Then we use the text encoder from a pre-trained CLIP [10], denoted as

ϵ_{θ}^{T e x t}

, to encode the final caption and get the text features

F^{T e x t}

as follows:

F^{T e x t} = ϵ_{θ}^{T e x t} (t) \in R^{512}

(7)

where

ϵ_{θ}^{T e x t}

is frozen during training and testing, as shown in Figure 3. During training (Figure 3a), we use the scene-level caption generated in Figure 2 as the input of

ϵ_{θ}^{T e x t}

. For testing (Figure 3b), we prompt the input query via the template ‘a XX in a scene’, and feed it into

ϵ_{θ}^{T e x t}

. For example, we use “a sofa in a scene” as the text input to identify the points of the sofa.

3.4. Textual Alignment and Masked Consistency via Contrastive Learning

As shown in Figure 3, after the generation of multi-modal features, including 2D image features

F^{2 D}

, text features

F^{T e x t}

, and Masked 3D point cloud features

F_{M a s k}^{3 D}

, we can align them through contrastive learning. Specifically, we use the following objective functions:

L_{1} = 1 - c o s (F^{2 D}, F_{M a s k}^{3 D})

(8)

L_{2} = 1 - c o s (F^{T e x t}, F_{M a s k}^{3 D})

(9)

L_{T A M C} = L_{1} + α L_{2}

(10)

where

c o s

is the cosine similarity calculation function.

L_{1}

is to align the masked 3D features

F_{M a s k}^{3 D}

and 2D features

F^{2 D}

and

L_{2}

is to align the

F_{M a s k}^{3 D}

and text features

F^{T e x t}

.

α

is a weight to balance the loss of different modalities; we discuss its influence in Section 4.3.2. Please note that

ϵ_{θ}^{2 D}

and

ϵ_{θ}^{T e x t}

are frozen during the training process and only

ϵ_{θ}^{3 D}

is trainable.

For the multi-modalities’ alignment, due to the open vocabulary image features and text features, the output of the refined 3D model naturally exists within the same feature space. Therefore, TAMC does not require predefined categories as PLA [5] does. This is evident from the experimental setup, where we do not distinguish between base and novel classes like PLA. Any category or arbitrary input is acceptable. Moreover, this text-image joint feature in

F^{3 D}

allows for 3D scene-level understanding given any textual prompt. Compared to OpenScene, TAMC’s semantic understanding capability is superior, as validated in Section 4.2 through comparisons with the state-of-the-art methods. Additionally, the Masked Consistency strategy enhances the model’s recognition of irregular objects, where we demonstrate the result in Section 4.3.3.

4. Experiments

4.1. Setups

4.1.1. Dataset

We use ScanNet [26] as the benchmark dataset, which contains indoor scene data annotated with 20 classes for the pixel-wise labeling task.

4.1.2. Metrics

We calculate class-wise intersection over union (IoU) and accuracy (Acc.) and report their mean values, i.e., mean IoU (mIoU) and mean Acc. (mAcc.) for evaluation.

4.1.3. Model Structure

We keep the identical setting as OpenScene for fair comparisons, where LSeg [3] is utilized; it is a CLIP variant for the image segmentation task. LSeg consists of an image encoder and a text encoder, which are used to generate image and text features, respectively. For the 3D encoder, we use MinkowskiNet18A [15].

4.2. Results

In Table 1, we evaluate the proposed method on the ScanNet [26] validation set and compare its performance against both fully supervised and zero-shot baselines. To ensure a fair comparison, we reproduce the results for OpenScene based on the official codes released by the authors (https://pengsongyou.github.io/openscene (accessed on 20 September 2024)). For zero-shot methods, TAMC achieves state-of-the-art performance, with mIoU and mAcc scores of 51.9% and 63.4%, respectively. These scores surpass those of the OpenScene baseline by 1.3 and 1.2 percentage points, demonstrating the effectiveness of the proposed method. During fully supervised methods, although it outperforms those from previous years [27], there is still a significant gap compared to the state-of-the-art methods [28]. This indicates that zero-shot methods have considerable potential in 3D understanding tasks.

Visual comparisons of semantic segmentations are shown in Figure 4. In the top row, the result highlighted in the red box indicates that TAMC predicts the bathtub (marked in pink) more accurately and completely. In the second row, a similar phenomenon can be observed, where our prediction for the door is more precise. In the third row, the proposed method corrects the wrong prediction of the baseline (the red door). These results suggest that by incorporating the proposed Textual Alignment and Masked Consistency by contrastive learning, we can achieve better and more robust scene understanding capabilities, leading to improved performance.

4.3. Ablation Studies and Analysis

4.3.1. Influence of Textual Alignment Training

To investigate the impact of Textual Alignment (TA) on model performance, we conducted some comparative experiments. As shown in Table 2, the baseline (0) represents the original OpenScene, which achieves an mIoU score of 50.6%. By incorporating the 3D-text loss function

L_{1}

and combining it with the 3D–2D loss function

L_{2}

using a weight

α

(as detailed in Equation (10)), we observed a notable improvement in the model’s performance. It is important to note that, in order to exclude the influence of the proposed Masked Consistency training, we utilize the full 3D features without masking and align them with both text and image features.

4.3.2. Balance Weights of Textual Alignment Training

We introduced a parameter

α

to balance the 3D-text loss

L_{1}

and the 3D–2D loss

L_{2}

, which influences the model’s final performance. As illustrated in Table 2, we experimented with different

α

, finding that the model achieves optimal performance when

α = 0.05

.

4.3.3. Influence of Masked Consistency Training

We discuss the impact of the proposed Masked Consistency (MC) in this section, the experimental results are shown in the leftmost column of Table 3. Firstly, the MC can boost performance even with different masking ratios, except for the masking ratio of 0.1; other masking ratios can lead to an improvement of 0.2 to 1.0 in terms of the mIoU score. The reason for our improved performance lies in the proposed method’s better understanding of sparse 3D data. Specifically, MC forces the model to understand the complete scene with only a partial point cloud during the training process, enabling it to obtain richer information during inference.

4.3.4. Masking Ratio Selection for Masked Consistency Training

For Masked Consistency (MC), varying masking ratios cause performance shifts, according to the results shown in Table 3. In particular, when the masking ratio is set to 0.95, MC achieves the best mIoU score of 51.6%, which is 1.0 pp. higher than the baseline model. When we set the masking ratio to 0.1, the mIoU score is 50.6%, which is on par with the baseline OpenScene. Experimental results indicate that larger masking ratios could lead to greater improvements. This is consistent with observations in previous masking training policies for images [34], where large masking ratios are often used. Please note that we do not mask 3D point clouds directly. Instead, we mask the 3D features

F^{3 D}

, i.e., the output of

ϵ_{θ}^{3 D}

in Figure 3.

F^{3 D}

has the same spatial size as the input point cloud; this policy brings simplified engineering implementation and avoids potential misalignment during data preprocessing. It is possible that identical masking ratios would lead to different occlusions for scene features because of the proposed random masking strategy, thus introducing slight fluctuations in performance. With our masking policy, each embedding corresponds to a point that actually contains information from other points as well. Thus, a masking ratio of 0.99 does not imply that merely 1% of the raw input 3D data are available; rather, it signifies that we only present 1% of the extracted feature information. There are minor fluctuations (±0.2) in terms of mIoU score when varying the masking ratio between 0.6 and 0.95. This stability across different masking ratios highlights the robustness and adaptability of our method. This observation aligns with the findings of the MAE [34], a pre-training method designed for 2D images. When the masking ratio changes between 0.4 and 0.8, its fluctuations on the classification task are only ±0.3.

4.3.5. The Training Policy

TAMC adopts a multi-stage training pipeline for optimal performance. We conducted several ablation experiments to study the order of the proposed Masked Consistency (MC) and Textual Alignment (TA). We define two training strategies: (1) MC → TA, in which we first train the model using MC and then fine-tune it with a combination of 2D–3D and text-3D alignment losses (Equation (10)); (2) TA → MC, which represents the opposite training order of that in 1). As shown in Table 3, the performance of TA → MC consistently surpasses that of MC → TA across different masking ratios. Specifically, when the masking ratio is set to 0.95, and the training strategy is TA → MC, the model achieves an mIoU score of 51.9%, while the best result for MC → TA is only 51.0%. Therefore, we choose the second training strategy as the final one.

4.3.6. The Analysis of Per-Class Results

Table 4 shows the results of class-wise IoU; our method demonstrates superior performance compared to the baseline in most categories, particularly in those with irregular shapes like chairs and other furniture. It also works well on categories that may appear irregular due to overlap with other objects, such as floors, walls, cabinets, tables, and bookshelves. However, it also shows sub-optimal performance in some categories. On one hand, this might be attributed to the scarcity of training data; for example, classes like ‘picture’ and ‘toilet’ occupy a very small proportion of the overall training dataset. On the other hand, some categories of ScanNet exhibit high semantic similarity, such as ‘curtain’ and ‘shower curtain,’ which poses significant challenges for precise point-wise classification.

5. Conclusions

In this paper, we proposed a novel contrastive learning framework based on Textual Alignment (TA) and Masked Consistency (MC) training for the open vocabulary 3D Scene Understanding task. They are proposed to address the problems of existing methods, i.e., the lack of direct textual supervision and the neglect of the sparsity of 3D data, respectively. It masks part of the 3D features and then aligns them with 2D features and textual features through contrastive learning. The training process of TAMC incorporates direct interaction between 3D features and textual information while also forcing the model to understand the entire scene from partial 3D information. The proposed method has better scene understanding capabilities and robustness, resulting in improved performance. Our experiments demonstrate the effectiveness of the proposed method, which outperforms the baseline and achieves new state-of-the-art. The proposed method currently generates textual information for the entire scene, which is coarse for point-wise classification tasks; generating more fine-grained text could be a promising research direction for future work.

Author Contributions

Conceptualization, J.W., Z.W. and T.M.; methodology, J.W.; software, J.W. and Z.W.; validation, J.W. and Z.W.; formal analysis, J.W.; investigation, J.W., Z.W. and Y.F.; resources, T.M. and Z.W.; data curation, J.W.; writing—original draft preparation, J.W. and Z.W.; writing—review and editing, T.M. and S.O.; visualization, J.W.; supervision, T.M. and S.O.; project administration, S.O.; funding acquisition, S.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JST SPRING grant number JPMJSP2114.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This paper contains the links to the datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gómez, J.; Aycard, O.; Baber, J. Efficient detection and tracking of human using 3D LiDAR sensor. Sensors 2023, 23, 4720. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Müller, S.; Stephan, B.; Gross, H.M.; Notni, G. Point cloud hand–object segmentation using multimodal imaging with thermal and color data for safe robotic object handover. Sensors 2021, 21, 5676. [Google Scholar] [CrossRef] [PubMed]
Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven Semantic Segmentation. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Vu, T.; Kim, K.; Luu, T.M.; Nguyen, T.; Yoo, C.D. Softgroup for 3D instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2708–2717. [Google Scholar]
Ding, R.; Yang, J.; Xue, C.; Zhang, W.; Bai, S.; Qi, X. Pla: Language-driven open-vocabulary 3D scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7010–7019. [Google Scholar]
Yang, J.; Ding, R.; Deng, W.; Wang, Z.; Qi, X. Regionplc: Regional point-language contrastive learning for open-world 3D scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 19823–19832. [Google Scholar]
Peng, S.; Genova, K.; Jiang, C.; Tagliasacchi, A.; Pollefeys, M.; Funkhouser, T. Openscene: 3D scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 815–824. [Google Scholar]
Zhu, X.; Zhang, R.; He, B.; Guo, Z.; Zeng, Z.; Qin, Z.; Zhang, S.; Gao, P. Pointclip v2: Prompting clip and gpt for powerful 3D open-world learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2639–2650. [Google Scholar]
Yuan, H.; Li, X.; Zhou, C.; Li, Y.; Chen, K.; Loy, C.C. Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively. arXiv 2024, arXiv:2401.02955. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Wu, W.; Qi, Z.; Fuxin, L. Pointconv: Deep convolutional networks on 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9621–9630. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Xu, M.; Ding, R.; Zhao, H.; Qi, X. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3173–3182. [Google Scholar]
Graham, B.; Engelcke, M.; Van Der Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224–9232. [Google Scholar]
Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; Jia, J. Stratified transformer for 3D point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8500–8509. [Google Scholar]
Du, Y.; Wei, F.; Zhang, Z.; Shi, M.; Gao, Y.; Li, G. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14084–14093. [Google Scholar]
Feng, C.; Zhong, Y.; Jie, Z.; Chu, X.; Ren, H.; Wei, X.; Xie, W.; Ma, L. Promptdet: Towards open-vocabulary detection using uncurated images. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 701–717. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv 2021, arXiv:2112.14757. [Google Scholar]
Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar]
Bangalath, H.; Maaz, M.; Khattak, M.U.; Khan, S.H.; Shahbaz Khan, F. Bridging the gap between object and image-level representations for open-vocabulary detection. Adv. Neural Inf. Process. Syst. 2022, 35, 33781–33794. [Google Scholar]
Zhou, C.; Loy, C.C.; Dai, B. Extract free dense labels from clip. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 696–712. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
NLP Connect. Vit-Gpt2-Image-Captioning (Revision 0e334c7). 2022. Available online: https://huggingface.co/nlpconnect/vit-gpt2-image-captioning (accessed on 20 September 2024).
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
Tatarchenko, M.; Park, J.; Koltun, V.; Zhou, Q.Y. Tangent convolutions for dense prediction in 3D. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3887–3896. [Google Scholar]
Nekrasov, A.; Schult, J.; Litany, O.; Leibe, B.; Engelmann, F. Mix3D: Out-of-context data augmentation for 3D scenes. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Online, 1–3 December 2021; pp. 116–125. [Google Scholar]
Huang, J.; Zhang, H.; Yi, L.; Funkhouser, T.; Nießner, M.; Guibas, L.J. Texturenet: Consistent local parametrizations for learning from high-resolution signals on meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4440–4449. [Google Scholar]
Dai, A.; Ritchie, D.; Bokeloh, M.; Reed, S.; Sturm, J.; Nießner, M. Scancomplete: Large-scale scene completion and semantic segmentation for 3D scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4578–4587. [Google Scholar]
Schult, J.; Engelmann, F.; Kontogianni, T.; Leibe, B. Dualconvmesh-net: Joint geodesic and euclidean convolutions on 3D meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8612–8622. [Google Scholar]
Hu, Z.; Bai, X.; Shang, J.; Zhang, R.; Dong, J.; Wang, X.; Sun, G.; Fu, H.; Tai, C.L. Vmnet: Voxel-mesh network for geodesic-aware 3D semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15488–15498. [Google Scholar]
Lambert, J.; Liu, Z.; Sener, O.; Hays, J.; Koltun, V. MSeg: A composite dataset for multi-domain semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2879–2888. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]

Figure 1. Comparison of previous methods and ours. (a) is the workflow of methods with class limitation, which needs labeled 3D point cloud data of seen classes and evaluates the model on unlabeled classes. As a comparison, both (b) OpenScene [7] and (c) our method follow the open vocabulary setting, meaning that no annotated data are required during the training process. The difference lies in that the proposed training process includes the supervision from the pseudo text label, and the masking training policy makes the extracted 3D features more robust, resulting in higher accuracy.

Figure 2. Caption generation. Multi-view images are fed into the image captioning model

G_{c a p}

to generate corresponding captions

t_{i}

of a scene with N images, then the text-summarization model

G_{s u m}

summarizes

(t_{1}, \dots, t_{i}, \dots, t_{N})

to generate a scene-level caption

t

.

Figure 2. Caption generation. Multi-view images are fed into the image captioning model

G_{c a p}

to generate corresponding captions

t_{i}

of a scene with N images, then the text-summarization model

G_{s u m}

summarizes

(t_{1}, \dots, t_{i}, \dots, t_{N})

to generate a scene-level caption

t

.

Figure 3. Method overview. (a) Training. Given a 3D point cloud, a set of posed images, and a scene-level caption, we train a 3D encoder

ϵ_{θ}^{3 D}

to produce a masked point-wise 3D feature

F_{M a s k}^{3 D}

with two losses:

L_{1}

and

L_{2}

for

F_{M a s k}^{3 D}

-

F^{2 D}

loss and

F_{M a s k}^{3 D}

-

F^{T e x t}

loss, respectively (refer to Section 3.4). (b) Testing. We use cosine similarity loss between per-point features and text features to perform open-vocabulary 3D Scene Understanding tasks. The ‘an XX in a scene’ serves as the input prompt for text, where ‘XX’ represents query text, which adopts a dataset class during the segmentation task.

Figure 3. Method overview. (a) Training. Given a 3D point cloud, a set of posed images, and a scene-level caption, we train a 3D encoder

ϵ_{θ}^{3 D}

to produce a masked point-wise 3D feature

F_{M a s k}^{3 D}

with two losses:

L_{1}

and

L_{2}

for

F_{M a s k}^{3 D}

-

F^{2 D}

loss and

F_{M a s k}^{3 D}

-

F^{T e x t}

loss, respectively (refer to Section 3.4). (b) Testing. We use cosine similarity loss between per-point features and text features to perform open-vocabulary 3D Scene Understanding tasks. The ‘an XX in a scene’ serves as the input prompt for text, where ‘XX’ represents query text, which adopts a dataset class during the segmentation task.

Figure 4. Qualitative results on ScanNet [26]. From left to right: 3D input and related 2D image, (a) the result of the baseline method (OpenScene [7]), (b) the proposed method, (c) the ground truth segementation.

Table 1. Comparisons between TAMC and other SOTA methods on ScanNet [26]. † means results reproduced by us.

Method	$mIoU$	$mAcc$
Fully supervised methods
TangentConv [27]	40.9	-
TextureNet [29]	54.8	-
ScanComplete [30]	56.6	-
DCM-Net [31]	65.8	-
Mix3D [28]	73.6	-
VMNet [32]	73.2	-
MinkowskiNet [15]	69.0	77.5
Zero-shot methods
MSeg-Voting [33]	45.6	54.4
OpenScene † [7]	50.6	62.2
TAMC	51.9	63.4

Table 2. mIoU scores for various loss weights (

α

in Equation (10)) applied to Textual Alignment. For these ablation studies, we utilize the full 3D features to eliminate the effect of Masked Consistency.

Table 2. mIoU scores for various loss weights (

α

in Equation (10)) applied to Textual Alignment. For these ablation studies, we utilize the full 3D features to eliminate the effect of Masked Consistency.

Loss Weight	mIoU
baseline(0)	50.6
0.001	50.6
0.005	51.0
0.01	50.4
0.05	51.1
0.1	49.5
0.5	30.1
1.0	16.5

Table 3. mIoU scores of different masking rations. MC means the performance of masked multimodal consistency training; MC → TA means we first train the model w/MC and then fine-tune it with the text-3D alignment loss; TA → MC means we first train the model w/text-3D alignment loss and then fine-tune it with the Masked Consistency training policy.

Masking Ratio	MC	MC → TA	TA → MC
0.1	50.6	50.0	51.3
0.3	50.8	49.7	51.5
0.6	51.4	49.7	51.4
0.9	51.5	50.5	51.7
0.95	51.6	50.1	51.9
0.99	51.3	51.0	51.1

Table 4. Per-class statistical comparison of open-world 3D semantic segmentation on ScanNet in terms of the IoU score. The proposed TAMC achieves higher accuracy than the baseline model across most classes.

Method	Wall	Floor	Cabinet	Bed	Chair	Sofa	Table	Door	Window	Bookshelf	Picture	Counter	Desk	Curtain	Fridge	Shower Curtain	Toilet	Sink	Bathtub	Other Furniture
Baseline [5]	72.8	86.6	43.4	70.3	67.5	65.3	52.5	43.4	49.0	63.7	19.9	34.6	45.0	52.0	39.4	0.0	77.6	49.6	57.6	21.9
TAMC	73.9	88.7	44.6	71.2	71.3	63.5	53.6	45.4	51.6	65.1	17.1	41.7	45.1	53.7	42.0	0.0	75.1	50.3	60.6	24.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Wang, Z.; Miyazaki, T.; Fan, Y.; Omachi, S. TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding. Sensors 2024, 24, 6166. https://doi.org/10.3390/s24196166

AMA Style

Wang J, Wang Z, Miyazaki T, Fan Y, Omachi S. TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding. Sensors. 2024; 24(19):6166. https://doi.org/10.3390/s24196166

Chicago/Turabian Style

Wang, Juan, Zhijie Wang, Tomo Miyazaki, Yaohou Fan, and Shinichiro Omachi. 2024. "TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding" Sensors 24, no. 19: 6166. https://doi.org/10.3390/s24196166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TAMC: Textual Alignment and Masked Consistency for Open-Vocabulary 3D Scene Understanding

Abstract

1. Introduction

2. Related Work

2.1. Three-Dimensional Scene Understanding

2.2. Two-Dimensional (2D) Scene Understanding

2.3. Zero-Shot Learning for 3D Point Clouds

3. Method

3.1. Revisiting OpenScene

3.2. Three-Dimensional Feature for Masked Consistency Training

3.3. Text Feature for Textual Alignment Training

3.4. Textual Alignment and Masked Consistency via Contrastive Learning

4. Experiments

4.1. Setups

4.1.1. Dataset

4.1.2. Metrics

4.1.3. Model Structure

4.2. Results

4.3. Ablation Studies and Analysis

4.3.1. Influence of Textual Alignment Training

4.3.2. Balance Weights of Textual Alignment Training

4.3.3. Influence of Masked Consistency Training

4.3.4. Masking Ratio Selection for Masked Consistency Training

4.3.5. The Training Policy

4.3.6. The Analysis of Per-Class Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI