*3.1. Overall Framework*

As shown in Figure 2, our framework generates a change caption from the input of observations taken before and after a scene change. A scene observation consists of RGB and depth images observed from multiple viewpoints and PCD.

**Figure 2.** Overall framework. From before- and after-change scene observations of a scene, the proposed framework generates a text caption describing the scene change via three components, namely, a scene encoder that encodes the input composed of various modalities, a scene fusion component that combines the before- and after-change features *r*bef and *r*aft, a caption generator that generates a caption from the output of the scene fusion component.

Our framework comprises three components: a scene encoder, which processes input modalities with respective encoders; a scene fusion component, which combines the features of observations taken before and after a scene change; and a caption generator, which generates text captions from fused scene representations. Our framework can be further enhanced by adding more modalities, such as normal maps. Moreover, the scene fusion component and caption generator can also be improved by adopting novel approaches. We give the details of these three components in the remaining subsections.

### *3.2. Scene Encoder*

The scene encoder component transforms an observation of a scene consisting of multiview RGB, depth images, and PCD into a feature vector to express semantic and geometric information for the scene. As shown in Figure 2, we first extract feature vectors from multiview RGB images, multiview depth images, and PCD separately with respective encoders. Then, the information is aggregated via a concatenation operation. We experimented with two encoder structures, namely, MVCNN and GQN, for encoding multiview RGB images and MVCNN for depth images.

RGB/depth encoder (MVCNN): This network is adapted from Su et al. [38]. In our implementation (Figure 3a), we first extract features from each viewpoint (we transformed depth images to RGB images through the applyColorMap function with mapping parameter COLORMAP\_JET defined in OpenCV [54]) via ResNet101 [55]. Then, we apply convolution operation (separated weights) to extracted features and compute a weight vector via fully connection and softmax function. Then, we use another convolution layer to ResNet101-extracted features and multiply the output with the weight vector. Finally, a 4 × 4 × 128 × 8-dimensional feature vector is obtained.

**Figure 3.** Detailed network structure for the encoders used in this study.

RGB encoder (GQN): Eslami et al. [2] proposed GQN, which recovers an image from a given viewpoint through a scene representation network and a generation network. We adapt GQN (tower-structure [2], Figure 3b) to extract a scene representation from multiview RGB images in two stages. During the pretraining stage, we train the overall GQN using multiview scene images. Then, we discard the generation network of GQN and use the pretrained scene representation network to aggregate information from multiview images.

PCD encoder (PointNet): We use PointNet, proposed by Qi et al. [3], for extracting features from PCD. PointNet transforms raw PCD into a feature vector that can be used for a range of downstream tasks, such as classification and segmentation. The detailed structure used in this work is shown in Figure 3c.

After processing these modalities separately, we resize features to 1 × 1 × *k*-dimensional vectors (*k* is different for three modalities) and combine features using a concatenation operation. The concatenation operation makes it possible to change the number of input feature vectors, enabling both single- and multimodality inputs.

### *3.3. Before- and After-Change Scene Fusion*

We process the observation pairs taken before and after a scene change using the process described in the previous subsection. The two feature vectors are combined via the fusion method proposed by Park et al. [34].

Specifically, we denote the feature vector of the before-change scene as *r*bef and that of the after-change scene as *r*aft. We first compute the vector difference *r*diff of *r*bef and *r*aft via the following formulation.

$$r\_{\rm diff} = r\_{\rm aft} - r\_{\rm bef} \tag{1}$$

Then, we concatenate *r*bef, *r*aft, and *r*diff to create a compact feature vector as the caption generator's input.

### *3.4. Caption Generator*

The caption generator predicts a change caption from the output of the scene fusion component. As shown in Figure 2, we used a two-layer long short-term memory (LSTM) [56] structure for caption generation. Notably, the caption generator can be replaced by other language models, such as a transformer [57].

The overall network is trained end-to-end with the following loss function to minimize the distance *L* between generated captions **x** and ground truth captions **y**:

$$L = -\log(\mathcal{P}(\mathbf{y}|\mathbf{x})) \tag{2}$$

### **4. Indoor Scene Change Captioning Dataset**

Due to the high semantic and geometric complexity, training tasks targeting indoor scene understanding require a large amount of training data. Moreover, the construction of indoor scene datasets often requires a lot of manual labor. To the best of our knowledge, there is no existing large-scale indoor scene change captioning dataset.

To solve the above problems, we propose an automatic dataset generation process for indoor scene change captioning based on the existing large-scale indoor scene dataset Matterport3D [45] and two object model datasets, namely, NEDO [47] and YCB [48]. We create the before- and after-change scenes by arranging object models in indoor scenes. We set four atomic change types: add (add an object to a scene), delete (delete an object), move (change the position of an object), and replace (replace an object with a different one). We also set a distractor type in our dataset that indicates only camera position changes compared to original scenes. Implementing changes in the original 3D dataset is an alternative approach for creating datasets; however, this will result in artifacts, such as large holes after an object is deleted or moved from its original position. Thus, we sample object models from existing object model datasets and arrange them in 3D scenes to create before- and after-change scenes.

In the following subsections, we give the details of the automatic dataset generation process and the datasets created for experiments.

#### *4.1. Automatic Dataset Generation*

### 4.1.1. Scene and Object Models

We generated before- and after-change scene observation pairs based on arranging object models in 3D scenes. We used the Matterport3D dataset (consisting of 3D mesh models of 90 buildings with 2056 rooms) as our scene source. We selected 115 cuboid rooms that contain fewer artifacts (e.g., large holes in geometry) from the Matterport3D dataset. The object models used in our dataset generation were sampled from the NEDO and YCB datasets. We list the object class and instances in Table 1.



### 4.1.2. Virtual Camera Setups

We took RGB and depth images from multiple camera viewpoints along with PCD for each scene observation. To obtain an overall observation for each room scene, as shown in Figure 4, we used cuboid rooms and set eight virtual cameras (four corners and four centers of edges of the ceiling) for observing scenes. Each virtual camera was set to look at the center of the room. In addition, to enhance robustness to camera position transformation, we added random offsets of [–10.0 cm, +10.0 cm] in three dimensions for each camera during the dataset acquisition.

**Figure 4.** Virtual camera set-ups. Eight virtual cameras (four corners and four centers of edges of the ceiling) are set to look at the center of the room.

The data acquisition process can be implemented by using a single RGBD camera to observe a scene multiple times from various camera viewpoints.

### 4.1.3. Generation Process

We use AI Habitat [58] as the simulator for data acquisition. AI habitat enables generating RGB and depth images of given viewpoints from a mesh file. We generated each before- and after-change scene observation pair and related change captions in four steps. We first randomly selected a room scene from the scene sets and object models (three to five) from the object sets. The AI Habitat simulator provides a function named "get\_random\_navigable\_point()", which computes a random position where the agent can walk on based on the mesh data and semantic information (semantic label information, such as "floor" and "wall", for each triangle vertex, provided in Matterport3D dataset). In the second step, we utilized the function to obtain random navigable positions and arranged objects on those positions. We took eight RGB and depth images and generated PCD as the original scene observation through the AI Habitat simulator. The Matterport3D dataset provides mesh data of every building and position annotation for each room. We generated PCD by transforming vertices of mesh data (triangular mesh) into points of PCD. We extracted PCD for each room from PCD of building based on room position annotation (3D bounding box annotation of rooms provided by the Matterport3D dataset). Next, we implemented the four change types (add, delete, move, and replace) for the original scene along with a distractor (only camera position transformation) and obtained scene observations. The change information, including change type and object attributes, was recorded. Finally, we generated five change captions for each change type and the distractor based on the recorded change information and predefined sentence structure templates (25 captions in total for each scene). We show an example of our dataset in Figure 5. The above process makes it easy to generate datasets with various levels of complexity by adjusting scene and object numbers, change types, and sentence templates.

We currently used PCD generated from meshes, which contains fewer artifacts, such as holes and less occlusion. To further improve the practicality of our method, we plan to use PCD generated from RGBD images and conduct experiments to discuss the effects of occlusion and artifacts.

### *4.2. Dataset Statistics*

We generated dataset s15\_o10 with 9000 scenes for training and 3000 scenes for testing. In s15\_o10, we used 15 scenes and 10 object models (10-object set-up in Table 1). We used the s15\_o10 dataset to evaluate the performance obtained with various input modalities, encoders, and ensembles.

To evaluate model performance under more complex scene settings, we adjusted the number of scenes and objects and generated dataset s15\_o85 with 85 object models (85-object set-up in Table 1) and dataset s100\_o10 with 100 scenes. The other settings of s15\_o85 and s100\_o10 are the same as

those for s15\_o10. The detailed dataset statistics are shown in Table 2. Experiments with these three datasets are presented below.

**Figure 5.** Dataset instance example of adding an object. From the top row: before-change RGB images observed from eight virtual cameras; after-change RGB images; before-change depth images; after-change depth images; before- and after-change PCD; five ground truth change captions.


**Table 2.** Statistics for datasets used in this study.

### **5. Experiments**

We used datasets s15\_o10, s15\_o85, and s100\_o10 for training and evaluation. Specifically, we first used s15\_o10 for the comparison of different input modalities, encoders (MVCNN and GQN for RGB images), and ensembles of input modalities. We then used s15\_o85 and s100\_o10 for assessing the models' abilities under more complex scene setups with an increased number of objects and scenes.

We adopted several conventional image captioning evaluation metrics in each experiment. In addition to these metrics, we conducted a caption correctness evaluation to examine the detailed information given by the generated captions (change types and object attributes).

### *5.1. Evaluation Metrics*

We used four conventional evaluation metrics widely adopted in image captioning: BLEU-4 [59], ROUGE [60], SPICE [61], and METEOR [62]. These metrics evaluate the similarities between the

generated captions and the ground truth captions. BLEU-4 is used to evaluate the recall of words or phrases (multiple words) of generated captions in the ground truth captions. ROUGE evaluates the recall of ground truth captions in generated captions. SPICE considers the correctness of the sentence structures of generated captions. METEOR introduces the similarity between words to encourage the generation of captions with diverse words.

The correctness of change type and object attributes is important in the change captioning task. Therefore, in addition to the above metrics, we conducted a caption correctness evaluation. We neglect the correctness of the sentence structure and extract change type, class, color, and object (including class and color, such as "red cup") from the generated captions and compute the accuracy when compared to the ground truth captions. This evaluation indicates how well the generated captions reflect the detailed change information.

### *5.2. Implementation Details*

Here, we give the details of all the implementations. We set the input image size of both MVCNN and GQN to 64 × 64. We set the point number of PCD to 5000 for PointNet by random selecting points from PCD of rooms. For the pretraining process of GQN, we set the learning rate to 10−<sup>4</sup> and trained the overall GQN network for 10 epochs in all experiments. For the overall framework training (including all single modalities and ensembles), we set the learning rate to 10−<sup>3</sup> for PointNet and 10−<sup>4</sup> for MVCNN and the decoder. All ablations were trained for 40 epochs. We used the Adam optimizer in all experiments.
