*3.2. Training Phase*

Figure 2 shows our learning framework, which is similar to the soundnet framework. However, we additionally used methods such as a pre-trained model with shared features, an improved audio feature extractor, and improved weakly supervised learning. In the pre-training phase, a multi-domain convolutional neural network (CNN) was applied for sharing features. In the original soundnet framework, the feature extractor is based on two models—namely, the object and scene of the feature extractor for each pre-training phase and weakly supervised learning phase. However, the data cannot be utilized efficiently. In soundnet, the object feature extractor is learned in Imagenet, while the scene feature

extractor is learned in Place365, which results in less data than Imagenet. Thus, it is difficult to determine if the feature extractor has learned enough. To address this issue, we applied a multi-domain CNN because the place and object can share features when using this method. The feature extractor used was the wide resnet101 with double width and with two classifier headers for the object and scene. The Imagenet and Place365 data were trained in one model so that the features could be shared. This method also had advantages in the training phase.

**Figure 2.** The training phase was divided into a pre-training phase and a training phase. This learning framework is similar to that of the soundnet framework; however, the pre-trained model with features shared via a multi-domain CNN, audio feature extraction via multi-time-scale transform, and weakly supervised learning via mutual learning are advantages of our framework.

In this training phase, we trained our network by using a Flickr video dataset [37] for cross-modal recognition. The video dataset was viewed as weakly labeled data with images assigned to the audio. Image features were extracted from a pre-trained model that was frozen. This frozen model meant that the results of learning were not reflected; they were used only for feature extraction. Audio data were then converted into RGB images via the multi-time-scale transform, which allowed feature extraction via the wide resnet with double width. These extracted features were used as the sources and targets for audio features and image features, respectively. The deep neural network was trained with a kernel function to map the source to a target, which meant that learning to extract features was similar for the audio and image data. Only the audio feature extractor reflected the learning results because our purpose was to approximate from the source to the target. Unlike soundnet, in the KL divergence, the source and target were configured as feature maps, not score maps. This is the same method as that used in the fitnet; however, our training frameworks conducted direct distribution training without an additional regressor through use of the same feature extractor structure. Furthermore, because the multi-domain CNN was applied in the pre-trained model, we also performed inter-distribution learning on one integrated model through only feature sharing.
