*5.4. Experiments on Dataset Complexity (Object Class Number)*

In this experiment, we evaluated model performance under a more complex set-up s15\_o85, where the object instance number was 85 (10 objects in s15\_o10). We conducted experiments on four single-modality and two three-modality ablations.

The results of conventional evaluation metrics are shown in the middle six rows of Table 3. The performance for all ablations is lower than that for s15\_o10. Notably, the performance of models with RGB images and the GQN encoder degraded significantly. This result indicates that the GQN encoder is less robust to scene set-ups with high complexity. Similar to s15\_o10, for single-modality input, PCD with PointNet performed the best. Ensembles outperformed single modalities.

The caption correctness evaluation is shown in the middle six rows of Table 4. The performance of all modalities degraded compared to that for s15\_o10 in terms of object correctness. However, models with PCD, including both single modalities and ensembles, tended to be more robust for change type prediction. This result indicates that the geometry and edge information make change type prediction more consistent.

We show one example result (object deleted) in Figure 7 (top) . All single modalities failed to give correct captions, whereas the two ensembles predicted the correct caption. Single-modality models with depth or RGB images gave the wrong change type, whereas that with PCD correctly predicted the change type (delete). s15\_o85 is more challenging than s15\_o10 because it included more objects (85 vs. 10). Combining different modalities is effective for handling datasets with relatively high complexity.

**Figure 7.** Example results for s15\_o85 and s100\_o10. From the top row: before-change RGB images observed from eight virtual cameras for example from s15\_o85; after-change RGB images; ground truth and generated captions for various models; before-change RGB images for example from s100\_o10; after-change RGB images; ground truth and generated captions. Correct captions are shown in green and false captions are shown in red.
