**3. Approach**

In robot applications, the ability to recognize scene changes is essential. We propose a framework that generates scene change captions from image pairs taken before and after a scene change. Our framework correlates the after-change scene with the before-change scene and provides detailed change descriptions, including change types and object attributes. Due to the availability of RGB-D data in robot applications, we developed models that use RGB and depth images of scenes observed from multiple viewpoints and PCD. Our framework can be trained end-to-end using raw images, PCD, and related change captions. Moreover, our framework enables inputs from one or more modalities. In the following subsections, we give the details of the proposed framework.
