**Yue Qiu 1,2,\*, Yutaka Satoh 1,2, Ryota Suzuki 2, Kenji Iwata <sup>2</sup> and Hirokatsu Kataoka <sup>2</sup>**


Received: 31 July 2020; Accepted: 20 August 2020; Published: 23 August 2020

**Abstract:** This study proposes a framework for describing a scene change using natural language text based on indoor scene observations conducted before and after a scene change. The recognition of scene changes plays an essential role in a variety of real-world applications, such as scene anomaly detection. Most scene understanding research has focused on static scenes. Most existing scene change captioning methods detect scene changes from single-view RGB images, neglecting the underlying three-dimensional structures. Previous three-dimensional scene change captioning methods use simulated scenes consisting of geometry primitives, making it unsuitable for real-world applications. To solve these problems, we automatically generated large-scale indoor scene change caption datasets. We propose an end-to-end framework for describing scene changes from various input modalities, namely, RGB images, depth images, and point cloud data, which are available in most robot applications. We conducted experiments with various input modalities and models and evaluated model performance using datasets with various levels of complexity. Experimental results show that the models that combine RGB images and point cloud data as input achieve high performance in sentence generation and caption correctness and are robust for change type understanding for datasets with high complexity. The developed datasets and models contribute to the study of indoor scene change understanding.

**Keywords:** image captioning; three-dimensional (3D) vision; deep learning; human-robot interaction

### **1. Introduction**

There have been significant improvements in artificial intelligence (AI) technologies for human–robot interaction (HRI) applications. For example, modern intelligent assistants (e.g., Google Assistant [1]) enable the control of household appliances through speech and allow remote home monitoring. HRI experiences can be improved through the use of AI technologies, such as the semantic and geometric understanding of 3D surroundings [2–5], the recognition of human gestures [6,7], actions [8,9], emotions [10,11], speech recognition [12,13], and dialog management [14,15]. A fundamental problem in indoor scene understanding is that scenes often change due to human activities, such as the rearranging of furniture and cleaning. Therefore, understanding indoor scene changes is essential for many HRI applications.

Developments in graphic processing units and convolutional neural network (CNN)-based methods have led to tremendous progress in 3D recognition-related studies. Various 2D approaches have been adapted for 3D data, such as recognition [3–5], detection [16], and segmentation [17]. Researchers have proposed a series of embodied AI tasks that define an indoor scene and an agent that explores the scene and answers vision-related questions (e.g., embodied question answering [18,19]), or navigates based on a given instruction (e.g., vision-language navigation [20,21]). However, most 3D

recognition-related studies have focused on static scenes. Scene change understanding is less often discussed despite its importance in real-world applications.

Vision and language tasks, including visual question answering [22–24], image captioning [25–30], and visual dialog [31,32], have received much attention due to their practicality in HRI applications. These tasks correlate visual information with language. The image captioning task aims to describe image information using text, and thus can be used to report scene states to human operators in HRI applications. Several recent image captioning methods describe a scene change based on two images of the scene [33,34]. However, they use single-view image inputs and neglect the geometric information of a scene, which limits their capability in scenes that contain occlusions. Qiu et al. [35] proposed scene change captioning based on multiview observations made before and after a scene change. However, they considered only simulated table scenes with limited visual complexity.

To solve the above problems, we propose models that use multimodality data of indoor scenes, including RGB (red, green, and blue) images, depth images, and point cloud data (PCD), which can be obtained using RGB-D (RGB-Depth) sensors, such as Microsoft Kinect [36], as shown in Figure 1. We automatically generated large-scale indoor scene change caption datasets that contain observations made before and after scene changes in the form of RGB and depth images taken from multiple viewpoints and PCD along with related change captions. These datasets were generated by sampling scenes from a large-scale indoor scene dataset and objects from two object model datasets. We created scene changes by randomly selecting, placing, and rearranging object models in the scenes. Change captions were automatically generated based on the recorded scene information and a set of predefined grammatical structures.

**Figure 1.** Illustration of indoor scene change captioning from multimodality data. From the input of two observations (consisting of RGB (red, green, and blue) and depth images captured by multiple virtual cameras and point cloud data (PCD)) of a scene observed before and after a change, the proposed approach predicts a text caption describing the change. The multiple RGB and depth images are obtained from multiple viewpoints of the same scene via virtual cameras. Each virtual camera takes an RGB and a depth image from a given viewpoint.

We also propose a unified end-to-end framework that generates scene change captions from observations made before and after a scene change, including RGB images, depth images, and PCD. We conducted extensive experiments on input modalities, encoders, and ensembles of modalities with datasets under various levels of complexity. Experimental results show that the models that combine RGB images and PCD can generate change captions with high scores in terms of conventional image caption evaluation metrics and high correctness in describing detailed change information, including change types and object attributes. The contributions of our work are four-fold:


### **2. Related Work**

### *2.1. 3D Scene Understanding*

CNN-based methods have promising performance in various 3D scene understanding tasks, such as 3D object recognition [3–5], 3D detection [16], 3D semantic segmentation [17], and shape completion [37]. These methods use CNN structures to learn the underlying 3D structures based on data in various formats, such as multiview RGB images, RGB-D images, PCD, and meshes.

Su et al. [38] proposed a network for 3D object recognition based on multiview RGB images. They proposed a multiview CNN (MVCNN) structure for aggregating information via a view pooling operation (max or average pooling) from the CNN features of multiview images. Kanezaki et al. [39] proposed a framework for feature extraction from multiview images that predicts object poses and classes simultaneously to improve performance. Esteves et al. [40] suggested that existing MVCNNs discard useful global information and proposed a group convolutional structure to better extract the global information contained in multiview images. Eslami et al. [2] proposed a generative query network (GQN) that learns 3D-aware scene representations from multiview images via an autoencoder structure. Several studies have focused on 3D understanding based on RGB-D data. Zhang et al. [41] proposed a network for depth completion from a single RGB-D image that predicts pixel-level geometry information. Qi et al. [42] proposed a 3D detection framework that detects objects in RGB images and fuses depth information to compute 3D object regions. Recent CNNs that utilize PCD have also shown promising results. Qi et al. proposed PointNet [3], which is a structure for extracting features from raw PCD via the aggregation of local information by symmetric functions (e.g., global max pooling). They later proposed PointNet++ [4] for obtaining better local information. Zhang et al. [5] proposed a simple yet effective structure that aggregates local information of PCD via a region-aware max pooling operation.

Considering the availability of RGB-D data in HRI applications, we propose models that use multiview RGB and depth images and PCD. We adopt an MVCNN and a GQN for scene understanding based on RGB images, an MVCNN for aggregating multiview depth information, and PointNet for processing PCD.

### *2.2. Indoor Scene Datasets*

Due to the high complexity and diversity of visual information, training a CNN-based indoor scene understanding method usually requires a massive amount of data. SUNCG [43] is a widely used dataset that consists of simulated 3D indoor scenes generated using computer graphics technologies. Several indoor scene datasets with scanned models of real scenes have recently been made publicly available [44–46]. The Gibson [44] dataset consists of 572 scenes and a simulator, which allows training for multiple embodied AI tasks, such as visual navigation. Matterport3D [45] contains 90 indoor scenes that are densely annotated with semantic labels. The Replica [46] dataset consists of 18 high-resolution (nearly photorealistic) scenes.

Several datasets for embodied AI tasks have been built based on the above 3D datasets. The Embodied Question Answering (EQA) v1.0 [18] dataset consists of scenes sampled from the SUNCG dataset with additional question–answer pairs. The authors further extended the EQA task for realistic scene setting by adapting the Matterport3D dataset to their Matterport3D EQA

dataset [19]. The Room-to-Room dataset [20] added navigation instruction annotation to the Matterport3D dataset for the vision-language navigation task. In these datasets, the states of the scenes are static. Qiu et al. [35] proposed a simulated dataset for scene change captioning from multiview RGB images. However, they generated scenes with a solid background color, limiting the visual complexity. In contrast, we combine the Matterport3D dataset with two open source object model datasets, namely, NEDO item database [47] and YCB dataset [48], for creating scene change datasets, where scene changes are constructed by rearranging objects in 3D scenes. To the best of our knowledge, our dataset is the first large-scale indoor scene change dataset.

### *2.3. Change Detection*

Change detection is a long-standing task in computer vision due to its practicality in real-world applications, such as scene anomaly detection, and disaster influence analysis. Change detection from street view images or videos has attracted much attention because it allows algorithms to focus on changed regions, decreasing the cost of image or video recognition [49,50]. Alcantarilla et al. [49] proposed a method that first reconstructs 3D geometry from video input and then inputs coarsely registered image pairs into a deconvolutional network for change detection. Zhao et al. [50] proposed a method with an encoder–decoder structure for pixel-level change detection based on street view images.

Change detection is also important in robot applications [51–53]. Ambrus et al. [51] proposed a method that distinguishes static and dynamic objects by reconstructing and comparing PCD of a room scene observed at different time steps. Fehr et al. [52] proposed a 3D reconstruction method that reconstructs static 3D scenes based on RGB-D images of scenes with dynamic objects. Jinno et al. [53] proposed a framework for updating a 3D map by comparing the existing 3D map with newly observed 3D data that may contain new or removed objects.

Existing change detection methods that utilize RGB images lack 3D geometry understanding. Several methods have been proposed for detecting changes from 3D data in various formats, such as RGB-D images and PCD, for robot applications. However, most works are limited to relatively small-scale datasets and do not specify detailed changes, such as the attributes of changed objects. In contrast, we consider change detection based on multimodality input, including RGB and depth images and PCD. Our models describe detailed scene changes, including change types and object attributes.

### *2.4. Change Captioning*

The image captioning task has been widely discussed. Various image captioning methods have been proposed to achieve high-performance sentence construction by using attention mechanisms [25,26] or exploring relationships between vision and language [27,28]. Generating image captions with high diversity has also been widely discussed [29,30]. However, most existing image captioning methods generate descriptions from single-view images.

Several recent works have discussed captioning based on images that include scene changes [33–35]. Difference Description with Latent Alignment (DDLA) [33] generates change descriptions from two video frames observed from different time steps of a given scene. In DDLA, an image indicating the pixel-level difference between input frames is computed and a CNN is used for generating captions from this difference image. The DUal Dynamic Attention model (DUDA) [34] uses a dual attention structure for focusing regions of images before and after a change and a dynamic attention structure, which dynamically selects information from image features before or after a change, or the difference between them. DUDA is more robust to camera transformation compared to DDLA. However, both DDLA and DUDA neglect the 3D geometry information of scenes and thus are less suitable for scenes with occlusions. Qiu et al. [35] proposed a method that generates a compact scene representation from multiview images and then generates captions based on the scene

representation. However, they performed experiments using scenes with solid colored backgrounds, and only considered RGB images from a fixed number of cameras.

In contrast, we explore and evaluate several input modalities, namely, RGB and depth images (with random camera position changes), and PCD. We conducted extensive experiments on various ensembles of these modalities. We also conducted experiments on datasets with complex and diverse visual information.
