*5.3. Experiments on Input Modality and Model Ablations*

We first used dataset s15\_o10 to evaluate the performance of various input modalities, encoders, and ensembles. Here, we implemented four single-modality ablations, namely, depth images with the MVCNN encoder, RGB images with MVCNN and GQN encoders, and PCD with the PointNet encoder. We implemented five two-modality ablations containing two different input modalities, where the RGB images were processed using MVCNN and GQN encoders. We also implemented two ensembles with three modalities, where different RGB encoders were adopted.

Evaluation results on the test split of s15\_o10 in terms of conventional evaluation metrics are shown in the top 11 rows of Table 3. The four single-modality ablation results show that PCD with the PointNet encoder obtained the best performance and that depth with the MVCNN encoder obtained the lowest scores for all metrics. RGB images with the GQN encoder outperformed the ablation with the MVCNN encoder. PCD contain geometric and object edge information, which is an advantage in a task that requires recognizing object change and detailed object attributes. GQN was trained to obtain a compact scene representation from multiview images, which likely made it better at correlating multiview information compared to MVCNN. Depth images do not contain color information and it is difficult to obtain object shapes from them, making it challenging to understand a scene change from only depth images. We found that although depth images alone performed poorly, ensembles containing RGB (MVCNN or GQN encoder) and depth images outperformed RGB images alone. We think that this resulted from the geometric information in depth images, which is difficult to extract from RGB images. The two ensembles with three input modalities outperformed all single-modalities and ensembles with two input modalities composed of their subsets. Scene change captioning performance can thus be enhanced by using both geometric and RGB information.

The caption correctness evaluation results are shown in the top 11 rows in Table 4. We found that ablation with RGB input (GQN encoder) outperformed PCD input in terms of object correctness, whereas PCD obtained higher accuracy in change type prediction (single modality). The abundant geometric and edge information contained in PCD is beneficial for change type prediction. We also found that models with both RGB and PCD (or depth) input obtained higher object correctness than that of single modalities. This result indicates that combining geometric and RGB information leads to

a better understanding of detailed object information, which is critical for obtaining high performance in this task.

**Table 3.** Results of evaluation using s15\_o10 (top 11 rows), s15\_o85 (middle 6 rows), and s100\_o10 (bottom 6 rows). The highest scores are shown in bold.


**Table 4.** Results of change caption correctness evaluation using s15\_o10 (top 11 rows), s15\_o85 (middle 6 rows), and s100\_o10 (bottom 6 rows). The highest scores are shown in bold.


We show two example results in Figure 6. For the first example (object moved), all ensembles predict change captions correctly. Ablations with depth images and RGB images (MVCNN encoder) correctly determined the related object attributes (cyan minicar) but failed to predict the correct change type (move). In contrast, the single-modality ablation with PCD predicted the correct change type but gave the wrong color. For the second example (distractor with no object change), all models with

PCD and the ensemble with RGB (GQN encoder) and depth images gave correct change captions. However, the other models predicted wrong change captions. These results indicate that combining modalities enhances performance, and that PCD can provide geometry information, which is beneficial for predicting change type.

**Figure 6.** Example results for s15\_o10. From the top row: before-change RGB images observed from eight virtual cameras for first example; after-change RGB images; ground truth and generated captions for various models; before-change RGB images for second example; after-change RGB images (distractor with only camera position transformation); ground truth and generated captions. Correct captions are shown in green and false captions are shown in red.
