1. Introduction
Background music is a sensitive issue concerning copyright. This issue occurs in a variety of places such as broadcasting, music stores, online music streaming services and so on. To charge for copyright, it is important to know what title the background music is and how long it has been played. It is inaccurate and takes a lot of time and work to manually record the title and the playing time of background music. In order to solve this problem, automatic music identification and music section detection techniques are required. In this study, we perform automatic music identification on broadcast content. Due to the nature of broadcast content, background music is mostly mixed with speech louder than music. This characteristic results in lowering automatic music-identification performance. Therefore, we attempt to apply background music separation technique before automatic music identification.
Music signal separation was conventionally done by the traditional methods used in blind source separation (BSS) [
1] such as independent component analysis (ICA) [
2], non-negative matrix factorization (NMF) [
3], and sparse component analysis (SCA) [
4]. For monophonic music source separation, which is the same task as in this paper, NMF showed better performance than ICA and SCA [
1]. However, NMF does not yield good separation performance in real environmental conditions because the NMF algorithm has inherently linear characteristics [
5]. Recently, deep learning-based music source separation algorithms achieved good performance and outperformed the NMF algorithm [
6,
7,
8,
9]. In addition, deep learning-based music source separation algorithms have the advantage that they do not have the permutation problem [
10] because they are trained to directly produce the target signal. Grais et al. [
6] used the feed-forward neural network (FNN) architecture in speech source separation task and achieved better performance than NMF. This means that deep learning is suitable for replacing the conventional algorithms. Nugraha et al. [
7] and Uhlich et al. [
8] experimented with music source separation using the FNN architecture, which is the basic structure in deep learning. They used different datasets, but they showed better performance compared to NMF. After then, a lot of new neural network architectures were applied to audio source separation: convolutional neural network (CNN), recurrent neural network (RNN), bidirectional long-term memory (BLSTM), and so on [
11,
12].
In the time-frequency spectrogram domain, the time-axis frames are used as input to the RNN or BLSTM architecture [
11,
12]. In a previous study [
12], music source separation was performed with a BLSTM architecture. As the layer of BLSTM was stacked up to 3 layers, it showed good performance. When the number of layers was equal to FNN, BLSTM showed better performance than FNN. Data augmentation and Wiener filter for eliminating the stationary noise of the estimated target spectrogram further improved performance. Also, it showed better performance in a multi-channel environment than single-channel. The best performance in this experiment was obtained by weighted summation of estimated signals from BLSTM and FNN.
Recently, the CNN-based U-Net [
13], stacked hourglass [
14], and dense convolutional network (DenseNet) [
15], which showed good performance in the image domain [
16,
17,
18], were also successfully applied to audio source separation. The CNN-based U-Net [
13] and stacked hourglass network [
14] have an encoder-decoder style structure. Bottleneck features are generated between encoder-decoder processes. In U-Net, the feature map of each encoder is concatenated with the decoder feature map of the same resolution. The stacked hourglass network adds the feature map of the encoder to the decoder feature map of the same resolution via convolution. The connections between these encoders and decoders are advantageous for the transmission of information and the transmission of gradients. A stacked hourglass network consists of small network modules with an encoder-decoder structure to form the whole structure. The desired signal can be estimated between each module, and the loss can be calculated at each module output for back-propagation. This intermediate supervision improves the learning speed and network performance.
In the previous methods, the magnitude of the target was estimated in the spectrogram domain and the output signals were reconstructed by using the phase of the mixture [
6,
7,
8,
9,
11,
12,
13,
14,
15]. However, the reconstructed signal in the time domain is distorted. To compensate for this distortion, Stoller et al. [
19] used an end-to-end Wave-U-Net model in the time domain by modifying the U-Net architecture. Their structure is different from the U-Net with a similar shape in that it changes from the time-frequency domain to the time domain and uses 1-dimensional (1d) convolution instead of two-dimensional (2d) convolution. The end-to-end Wave-U-Net model in the time domain usually has a deeper structure to achieve good performance and has the disadvantage that it is difficult to converge. Although Wave-U-Net showed better performance than U-Net, it has lower performance than DenseNet-based architecture and BLSTM structure in the time-frequency domain [
18].
The DenseNet-based architecture has recently been shown to perform well in audio source separation tasks [
15,
20,
21]. Multi-scale multi-band DenseNet (MMDenseNet) [
15] and MMDenseNet combined with LSTM (MMDenseLSTM) [
20] are music source separation studies using DenseNet. Both architectures are based on CNN and are of an encoder-decoder style like the U-Net and stacked hourglass. The architecture of MMDenseNet was designed to parallel MDenseNet for the low-, high-, and whole-frequency band of the spectrogram, respectively. The architecture of MMDenseLSTM closely resembles the structure of MMDenseNet and places the LSTM behind the dense block. Because DenseNet is based on CNN, it learns pattern for image input, and LSTM learns pattern for time-series data, it has the advantage of learning different patterns and supplementing each other. MMDenseLSTM shows better performance than MMDenseNet, which shows the state-of-the-art performance at audio source separation. Since they are all based on the CNN architecture structured in an encoder-decoder style through down-sampling and up-sampling, bottleneck features are obtained through encoding. This is the reason why designing this structure expands the receptive field, especially in the spectrogram domain. In neuroscience, the receptive field is the local area of the previous layer output where neurons are connected. Neurons in the visual cortex exhibit local features in the early visual layer and more complex patterns in the deep layer [
22]. This is an important element of CNN that has inspired CNN [
23].
In our study, we propose a dilated time-frequency DenseNet architecture to expand the receptive field effectively. We add a time-dilated convolution [
24] which is a frame dilation rate of 2 and a frequency-dilated convolution which is a frequency dilation rate of 2 to the DenseNet. The previous CNN-based architectures expanded their receptive field with an encoder-decoder style architecture, but we expanded the receptive field more effectively by adding dilated convolution. The time- and frequency-dilated convolution systematically aggregate multi-scale contextual information of the time and frequency axes of the spectrogram. MMDenseNet is designed to place the MDenseNet model structure in parallel on each band of the spectrogram divided in half. In MMDenseNet structure, information exchange between models of each band is performed only in the last few layers, which makes it difficult to share information between each band, resulting in distortion in the output. Therefore, the proposed architecture applies a different convolution for each frequency band. The proposed architecture is shown to have the best performance in both separation and identification tasks compared with the previous architectures: U-Net, Wave-U-Net, MDenseNet, and MMDenseNet.
In the previous work [
25] done by ourselves, we studied music detection using convolutional neural networks with a Mel-scale kernel from broadcast contents. The difference between our previous work and this work is in the type of task. Whereas the previous work is a classification task for music detection, this work is a regression task to estimate the music signal itself. To combine the previous work and this work for the purpose of music identification, the integrated system should be configured in order of music source separation, music detection, and music identification. In addition, in order to construct the integrated system, it is necessary to jointly optimize the deep learning architecture of source separation and music detection using the same training data. This issue will be studied later in another work.
In
Section 2, DenseNet and the baseline architecture, which applies the DenseNet to audio source separation, are introduced. In
Section 3, the overall proposed architecture is described. Two experiments and results are presented in
Section 4. The first experiment is singing voice separation on the open resource, and the second one is music identification after source separation from our dataset. Finally, the overview and conclusion are described in
Section 5.
5. Conclusions
In this study, we proposed source separation using dilated time-frequency DenseNet for music identification in broadcast content. The background music of broadcast content is frequently mixed with speech, and further the volume of music signal is less than the volume of the speech signal in most cases. In this case, music identification is not easy, and hence background music separation is required before music identification.
In previous studies, source separation using deep learning was studied extensively and showed a good performance. We add a time-frequency dilated convolution and apply different convolutions to each frequency band of the spectrogram to effectively increase the receptive field in the CNN-based DenseNet architecture. We conducted a music-identification experiment by separating the music signal from mixture signals into the proposed architecture and the previous architecture.
The results of music identification did not correlate with the separation performance. Wave-U-Net, MDenseNet, and MMDenseNet results of the music identification experiments were in contrast to the separation performance. The separation performance of SDR was affected by the low-frequency region of the spectrogram. The music identification module extracted the fingerprinting feature using the peak points of the spectrogram. Accordingly, if only the peak points of the separated signal are well preserved, the identification is likely to succeed. Despite the different characteristics in performance, the proposed architecture showed the best performance in identification as well as in separation.