Next Article in Journal
Electricity Generation in LCA of Electric Vehicles: A Review
Next Article in Special Issue
A Baseline for General Music Object Detection with Deep Learning
Previous Article in Journal
Influence of a New Form of Bolted Connection on the Mechanical Behaviors of a PC Shear Wall
Previous Article in Special Issue
Applying Acoustical and Musicological Analysis to Detect Brain Responses to Realistic Music: A Case Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Robust Cover Song Identification System with Two-Level Similarity Fusion and Post-Processing

School of Information Science and Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2018, 8(8), 1383; https://doi.org/10.3390/app8081383
Submission received: 23 July 2018 / Revised: 13 August 2018 / Accepted: 14 August 2018 / Published: 16 August 2018
(This article belongs to the Special Issue Digital Audio and Image Processing with Focus on Music Research)

Abstract

:
Similarity measurement plays an important role in various information retrieval tasks. In this paper, a music information retrieval scheme based on two-level similarity fusion and post-processing is proposed. At the similarity fusion level, to take full advantage of the common and complementary properties among different descriptors and different similarity functions, first, the track-by-track similarity graphs generated from the same descriptor but different similarity functions are fused with the similarity network fusion (SNF) technique. Then, the obtained first-level fused similarities based on different descriptors are further fused with the mixture Markov model (MMM) technique. At the post-processing level, diffusion is first performed on the two-level fused similarity graph to utilize the underlying track manifold contained within it. Then, a mutual proximity (MP) algorithm is adopted to refine the diffused similarity scores, which helps to reduce the bad influence caused by the “hubness” phenomenon contained in the scores. The performance of the proposed scheme is tested in the cover song identification (CSI) task on three cover song datasets (Covers80, Covers40, and Second Hand Songs (SHS)). The experimental results demonstrate that the proposed scheme outperforms state-of-the-art CSI schemes based on single similarity or similarity fusion.

1. Introduction

A huge increase in the number of digital music tracks promotes the development of content-based music information retrieval (MIR) technology. As a part of MIR, cover song identification (CSI, also called cover version identification) has received increasing attention due to its potential real-world applications in copyright protection and the management of online music products. Additionally, the study of CSI techniques helps to understand how the human auditory system measures and models the similarity between music.
As one of the most fundamental components of MIR applications, how to measure and model similarity between music items is an important yet challenging research question [1]. Various similarity functions have been proposed in recent years [2,3,4,5]. Considering that the similarity between two tracks can be calculated based on different descriptors and similarity functions, the complementary properties are neglected while using a single similarity function. It has been verified [6,7,8] that different descriptors and similarity functions are complementary to each other in the CSI task. To fully take advantage of the common as well as complementary information contained in different descriptors and similarity functions in describing the similarity between tracks, some researchers began to study similarity fusion algorithms for CSI. In [9], the main melody and accompaniment of the music were extracted first. Then, the maximum value of the similarities obtained based on main melody, accompaniment, and mixture signal, separately, was taken as the final similarity. In [6], the standard classification-based fusion strategy [10] was adopted to fuse the similarities of three related yet different descriptors (harmony, melody, and bass line). In [11], the fusion of different similarities was achieved by projecting different similarities in a multi-dimensional space, where the dimensionality of the space was the number of similarities considered. However, this scheme was easily disturbed by bad descriptors because of the diluted signal-to-noise ratio. In [12], the similarity graphs obtained based on different descriptors and corresponding similarity functions were fused by the similarity network fusion (SNF) technique [13]. Then, the track-by-track similarities in the fused similarity graph were adopted for version identification. Due to the merits of the SNF technique, this fusion scheme could reduce the noise existing in each similarity graph and take advantage of the common as well as complementary information across each similarity graph. A similar strategy was adopted in [8] to fuse the similarities obtained based on the same descriptor and different similarity functions (Qmax [4] and Dmax [5]). This achieved the highest identification accuracy in the CSI task of MIREX 2016 (http://www.music-ir.org/mirex/wiki/2016:Audio_Cover_Song_Identification_Results). Some researchers proposed multi-stage similarity fusion schemes to take advantage of the common and complementary information provided by different musical descriptors and different similarity functions at the same time [7,14]. In [14], the SNF technique was applied to both the descriptor-level fusion and the similarity-level fusion. It achieved the highest identification accuracy on the Covers80 dataset. In [7], in the early fusion, the similarities obtained by the same descriptor and different similarity functions were integrated by SNF. In the late fusion, the learning method selected by the sparse group LASSO algorithm was applied to the early fused similarity to obtain the probability that the input track pair belonged to the reference/cover pair. Finally, the final similarity was obtained by averaging the probability-based similarities obtained based on each descriptor.
However, some important factors that may seriously influence the identification accuracy are not considered in the available fusion schemes: (i) The complementarity among different descriptors and that among different similarity functions is not considered simultaneously [6,8] or not fused efficiently [7]. (ii) The track manifold of the fused similarity graph, which will affect retrieval accuracy greatly, is not taken into consideration [15] (refer to Section 2.4.1 for specific examples). (iii) The bad influence caused by the “hubness” phenomenon contained in the fused similarity graph is seldom considered, which may increase the false positive rate [16].
To solve the possible shortcomings existing in the available similarity fusion algorithms and enhance the CSI performances further, a new CSI scheme based on two-level similarity fusion and post-processing is put forward in this paper. At the fusion level, a nonlinear graph fusion technique [13] is first adopted to fuse the similarity graphs constructed based on the same descriptor and different similarity functions. Then, a mixture Markov model (MMM) [17] is introduced to integrate the first-level fused similarity graphs generated based on two complementary descriptors. At the post-processing level, diffusion [16] is first applied on the obtained two-level fused similarity graph to take full advantage of the underlying structure of the tracks contained within it to reduce the noise and enhance the identification further. Then, the mutual proximity (MP) technique [15] is performed on the diffused similarity scores to reduce the bad influence caused by the “hubness” phenomenon existing in the diffused track community. It should be noted that the proposed scheme is different from our previously proposed scheme [7] in the following respects: (i) Unlike the scheme in [7], the proposed scheme is fully unsupervised. (ii) The track manifold contained in the two-level fused similarity graph is not considered in [7]. (iii) The negative influence of the “hubness” phenomenon, which is not considered in [7], is eliminated by the MP technique in the proposed scheme. Extensive experiments conducted on three cover song datasets (Covers80 (https://labrosa.ee.columbia.edu/projects/coversongs/covers80/), Covers40, and SHS https://labrosa.ee.columbia.edu/millionsong/secondhand) manifest the necessity and effectiveness of each step included in the proposed model (Section 3.3.1) and the superiority of the proposed scheme, in terms of CSI identification accuracy over state-of-the-art CSI schemes (Section 3.3.2) and computational complexity, especially when the size of the dataset increased (Section 3.3.3). The rest of this paper is organized as follows. The proposed model is presented in Section 2. Section 3 reports the experimental results. Finally, conclusions are drawn and future work is discussed in Section 4.

2. Proposed Model

A block diagram of the proposed model, which is illustrated by an example of results obtained on Covers40 (see Section 3.1), is shown in Figure 1.
Let V = { v q | q = 1 , , N } denote a music collection. Two function lists are defined as follows:
  • Function list f = { f i | i = 1 , , M } : where f i ( v q ) extracts the i-th kind of descriptor from the track v q .
  • Function list s = { s j | j = 1 , , R } : where s j ( f i ( v q ) , f i ( v p ) ) computes the j-th similarity score between the i-th descriptors of the input tracks v q and v p .

2.1. Descriptor Extraction

For each track v q , q = 1 , , N in the music collection, M kinds of descriptors (denoted as f i ( v q ) , i = 1 , , M ) are extracted, respectively. In the proposed scheme, the harmonic pitch class profile (HPCP) [18] and main melody (MLD) [19] descriptors are extracted from each track, respectively.

2.2. First-Level Fusion

For each pair of tracks ( v q and v p ), the j-th kind of similarity function is performed on their i-th descriptors to obtain the similarity score s j ( i ) ( q , p ) :
s j ( i ) ( q , p ) = s j ( f i ( v q ) , f i ( v p ) ) , i = 1 , , M , j = 1 , , R .
Thus, the track-by-track similarity matrix obtained based on the i-th descriptor and j-th similarity function can be represented as a graph, denoted as G j ( i ) { V , E , s j ( i ) } , where the vertices V correspond to the tracks in the collection, and the edges E are weighted by the corresponding similarity scores s j ( i ) .
To take advantage of the complementarity between the Qmax and Dmax similarity functions in representing the similarity between cover versions, the similarity graphs based on the same descriptor (HPCP or MLD) and two different similarity functions (Qmax [4] and Dmax [5]) are fused with the SNF technique [13]. The specific details of the SNF technique can be found in [13] and [7]. The first-level fused similarity graph for the i-th descriptor can be denoted as G ( i ) ( V , E , A ( i ) ) , i = 1 , 2 , which is obtained with Equation (2):
G ( i ) V , E , A ( i ) = S N F s j ( i ) , i = 1 , , M , j = 1 , , R .
To test the validity of the first-level fusion, three cover sets shown in Table 1 are studied here. The six tracks were used both as the queries and the targets. The corresponding 6 × 6 similarity matrices obtained by MLD-Qmax, MLD-Dmax, and the first-level fused version of them (denoted as SNF-MLD-QD), are shown in Figure 2a–c, respectively. The cells corresponding to the query/cover pairs are marked with white boxes. It can be seen that MLD-Qmax and MLD-Dmax did not work on the No. 1 and No. 3 cover sets, respectively. However, after first-level fusion, this problem was solved.

2.3. Second-Level Fusion

To make full use of the common and complementary properties of different descriptors (HPCP and MLD), the first-level fused similarity graphs for each descriptor are further fused with MMM technique [17] as follows.
For a walker sitting at vertex v q V in graph G ( i ) ( V , E , A ( i ) ) , she first decides which graph to land in, jumps to that graph, then decides which neighboring vertex to go to according to the graph’s similarity matrix. The procedure of walking from v q to v p across all graphs can be represented with Equation (3):
ξ ( v p | v q ) = i ξ ( i ) ( v p | v q ) ξ ( i ) ( v q ) ,
where ξ ( v p | v q ) is the transition probability of walking from v q to v p in the second-level fused similarity graph. ξ ( i ) ( v q ) is the probability of switching to (or staying in) graph G ( i ) when the walker is at vertex v q .
The degree of v q in G ( i ) , denoted as d ( i ) ( v q ) , is defined as the sum of the edge strength of all vertices connected to v q (i.e. d ( i ) ( v q ) = p A ( i ) ( q , p ) ). The volume of graph G ( i ) , denoted as θ ( i ) , is defined as the sum of all edge strengths in it, which can be calculated as θ ( i ) = v q , v p V A ( i ) ( q , p ) = v q V d ( i ) ( v q ) . Then, ξ ( i ) ( v p | v q ) can be rewritten as
ξ ( i ) ( v p | v q ) = A ( i ) ( q , p ) / d ( i ) ( v q ) .
When the random walk model reaches a stationary state, the stationary probability at vertex v q is defined as
Π ( i ) ( v q ) = d ( i ) ( v q ) / θ ( i ) .
Suppose the stationary probability of the second-level fused graph, denoted as Π ( v q ) , can be represented by a linear combination of the stationary probabilities of all first-level fused graphs as follows:
Π ( v q ) = i w i ( v q ) · Π ( i ) ( v q ) ,
where w i ( v q ) is the weight for vertex v q V in graph G ( i ) , w i ( v q ) 1 and i w i ( v q ) = 1 .
Then, ξ ( i ) ( v q ) in Equation (3) can be calculated as follows:
ξ ( i ) ( v q ) = w i ( v q ) Π ( i ) ( v q ) Π ( v q ) .
By plugging (4), (5), (7) into (3), we obtain
ξ ( v p | v q ) = 1 Π ( v q ) i w i ( v q ) A ( i ) ( q , p ) θ ( i ) .
Then A ( q , p ) = i w i ( v q ) A ( i ) ( q , p ) θ ( i ) is adopted as the second-level fused similarity score. The corresponding similarity graph is denoted as G ( V , E , A ) , where A = { A ( q , p ) , q , p = 1 , , N } .

2.4. Post-Processing

At the post-processing level, first, the locally constrained diffusion process (LCDP) [16] is performed on the second-level fused similarity graph to make full use of the underlying track manifold structure contained within it to enhance the retrieval performance. Then, the MP technique is applied on the obtained diffused similarity to eliminate the negative influence caused by the “hubness” phenomenon contained in the diffused track community.

2.4.1. Diffusion Processing

For diffusion processing, we adopt the LCDP technique proposed in [16]. The central concept of LCDP is to restrict a random walk to the K nearest neighbors of the data points by replacing the original graph G in traditional diffusion process with a K nearest neighbor (K-NN) graph G K , which can effectively reduce the influence of the noisy data points. Figure 3 shows the classification results of double moon data before and after applying diffusion on the distance values. It can be seen that diffusion can utilize the structure of the underlying data manifold to enhance classification performance.
Given the second-level fused similarity matrix A , the transition matrix, denoted as U = { U ( q , p ) | q , p = 1 , , N } , can be calculated as follows:
U = D 1 A ,
where D is a diagonal matrix and the q-th diagonal element D ( q , q ) is the degree of v q in graph G.
Assume that the K-NN graph of G is G K , which is generated by only keeping the similarity scores of each node and its K nearest neighbors in G. The transition matrix corresponding to G K is U K . We generate a diffused similarity matrix, denoted as F = ( f 1 t , f 2 t , , f N t ) T , where f q t is a column vector indicating the probability of being at a vertex starting from vertex v q after t steps. Then, LCDP [16] is employed to iteratively update F as follows:
F t + 1 = U K F t U K T ,
where F 0 = U K , and the diffusion terminates after a pre-defined number of iterations or if F does not change. Then, the obtained diffused similarity graph can be denoted as G ( d ) ( V , E , F ) .

2.4.2. Hubness Reduction

To reduce the negative influence caused by the “hubness” phenomenon existing in the track community, we adopt MP algorithm [15] to transform the obtained arbitrary similarity scores to probability-based similarity scores. MP is a global scaling method, and its general idea is to reinterpret the original distance space so that two objects sharing similar nearest neighbors are more closely tied to each other. Under the assumption that all distances in a data set follow a certain distribution, any similarity s x , y can now be reinterpreted as the probability of v y being the nearest neighbor of v x , P ( X ) is defined by the similarities of v x to all other objects in the collection, and the probability of an element v y being a nearest neighbor of v x is:
P ( X < s x , y ) = F x ( s x , y ) .
F x denotes the cumulative distribution function (CDF), which is assumed for the distribution of similarity scores s x , i = 1 . . n . Then, the MP-based similarity between v x and v y , denoted as M P ( x , y ) , is defined as the probability that v y is the nearest neighbor of v x given P ( X ) and v x is the nearest neighbor of v y given P ( Y ) as follows:
M P ( x , y ) = P ( X < s x , y Y < s y , x ) .
By visualizing the joint similarity score distribution of X and Y, computing MP for a given similarity score s x , y in a collection of N objects can be boiled down to simply counting the number of objects j having a smaller similarity score to v x and v y than s x , y :
M P ( x , y ) = | j : s x , j < s x , y j : s y , j < s y , x | N .
Figure 4 shows the probability distribution of the diffused similarities on Covers40 before and after applying the MP algorithm to them. It can be seen that the MP algorithm helps to enlarge the difference between inter tracks (unrelated tracks), which helps to reduce the false positive rate.

3. Experiments

In this section, we evaluate the performance of the proposed scheme. The cover song data sets used in the experiment and the experimental settings are described in Section 3.1 and Section 3.2, respectively. The experimental results, which include the necessity and importance of each step in the proposed scheme, the performance comparison with state-of-the-art CSI schemes, and the computational complexity comparison with other fusion-based CSI schemes, are discussed in Section 3.3.

3.1. Datasets

To evaluate the performance of the proposed model, we used three different cover song datasets (see Table 2) in the experiments.
Covers80, denoted as DB160 in this paper, is provided by Ellis from LabROSA. It contains 80 cover sets with 2 tracks in each set. Most of the tracks in this database have significant differences in rhythm.
Covers40, denoted as DB400 here, is composed of 400 tracks and 40 cover sets collected by us. There are 9 cover versions, which include both popular songs and classical music, for each original track. A complete list of this collection can be obtained by contacting us by email.
SHS, part of Second Hand Song cover song dataset, which consists of 12,730 tracks. There are 4235 original tracks and 8495 covers in this collection. The average number of covers in each cover set is 3.01, ranging from 2 to 42. This collection spans a variety of genres, including pop, rock, electronic, jazz, blues, and classical music. As shown in Table 2, we split it into four subsets sequentially without overlapping, denoted as DB3172, DB3183, DB3187, and DB3188, respectively.

3.2. Experiment Settings

To reduce the computation time and the memory requirements, the track was converted into a mono, 22.5 kHz, and 16 bits per sample version. Then the pre-processed signal was segmented into frames of 464 ms by Hamming window without overlapping. For each frame, the HPCP and MLD descriptors were extracted. Qmax and Dmax were adopted to measure the similarity between HPCP or MLD descriptors. As for the evaluation measures, the mean of average precision (MAP) [4], the mean averaged reciprocal rank (MaRR) [20], and the total number of covers identified in TOP 10 (TOP-10) were adopted to evaluate the performance of the CSI schemes. The larger the value of MAP, MaRR, or TOP-10, the better the performance achieved.

3.3. Experimental Results

First, we prove the necessity and importance of each step included in the proposed model by comparing the identification accuracy obtained in each step. Second, we compare the performance of the proposed model with those of state-of-the-art CSI schemes, in terms of MAP, MaRR, and TOP-10, on all three datasets. Finally, we compare the computational complexity of the proposed model with those of other similarity-fusion based CSI schemes.

3.3.1. Necessity and Importance of Each Step Included in the Proposed Model

To verify the necessity and validity of each step in the proposed model (see Figure 1), the identification accuracy in terms of MAP, MaRR, and TOP-10 achieved in each step are compared in Figure 5, where baseline (BL) is the fusion object (HPCP-Qmax, HPCP-Dmax, MLD-Qmax, MLD-Dmax) that achieved the best performance, and SNF denotes the first-level fused similarity for the HPCP descriptor. In Figure 5, only the results on DB3172 are included. Similar results could be obtained on the other three SHS subsets.
The experimental results shown in Figure 5 demonstrate that: (i) Each step in the proposed model helped to enhance the identification accuracy. (ii) SNF-based first-level fusion could enhance the MAP and TOP-10 performances to a large extend. (iii) MMM-based second-level fusion helped to improve the MAP and TOP-10 further. (iv) Diffusion could enhance the performance of the proposed model in terms of TOP-10 greatly, which may benefit from making use of the track manifold of the fused similarity graph. (v) The MP step helped to enhance the MaRR performance of the proposed model greatly, indicating a lower false positive rate.

3.3.2. Comparison with State-Of-The-Art CSI Schemes

To verify the efficiency of the proposed scheme in comparison with other CSI schemes that are based on single similarity function or similarity fusion, the MAP, MaRR, and TOP-10 achieved by each scheme are listed in Table 3. The CSI schemes included in this experiment were the proposed model (denoted as TLSFP—two-level similarity fusion and post-processing); HPCP-Qmax [4]; HPCP-Dmax [5]; a particle swarm optimization (PSO)-based scheme [21]; a high space (HS) mapping-based scheme [11]; the scheme proposed in [8] (denoted as SNF-2); the scheme proposed in [12] (denoted as SNF-3); SNF-4, which fuses the similarities based on HPCP-Qmax, HPCP-Dmax, MLD-Qmax, and MLD-Dmax with SNF; and a two-layer fusion based scheme [7]. For the HS and PSO schemes, the same similarity types as those in SNF-4 were adopted.
The experimental results shown in Table 3 demonstrate that the proposed TLSFP scheme outperformed the other CSI schemes (based on single similarity function or similarity fusion) included in terms of MAP, MaRR, and TOP-10, on all six datasets except for the MAP value on DB3187. The gap was 0.0069, which is very small and can be neglected.

3.3.3. Computational Complexity Comparison

In this experiment, the computational complexity of the proposed model in terms of average computing time is compared with those obtained by PSO-, HS-, and SNF-4-based fusion schemes.
All the experiments were carried out on a desktop machine with an Intel(R) Core(TM) i7 CPU (4.0 GHz) and 32 GB memory. Given the total fusion computing time T, we obtained the average computing time with A v g T = T / ( N 2 ) 2 , where N is the total number of tracks in the dataset.
The experimental results shown in Figure 6 demonstrate that: (i) The PSO scheme cost much more time than the other three. (ii) HS achieved the lowest computational complexity in four schemes. However, its performance may be unsatisfactory (see Table 3). (iii) The proposed TLSFP scheme needed a slightly longer time than SNF-4 when the dataset was small. However, with the increase of the dataset size, the difference became smaller and smaller. When the SHS was considered, the computational complexity of TLSFP was lower than that of SNF-4. So, the proposed model is very fit for large music collections.

4. Conclusions and Future Work

In this paper, we propose a music information retrieval scheme based on two-level similarity fusion and post-processing. It adopts different strategies (SNF and MMM) to combine the merits of different similarity functions and those of different descriptors in two fusion levels. In addition, it introduces diffusion and MP techniques to refine the fused similarity scores to enhance cover version identification accuracy. Extensive experiments on three cover song datasets (including Covers80 and SHS) manifested the effectiveness and efficiency of the proposed model in comparison with state-of-the-art CSI schemes.
TLSFP can be modified and applied to other important tasks in different fields, such as image classification, visual object tracking, cancer subtypes identification, and drug taxonomy, etc. We leave all these problems for future work.

Author Contributions

M.L. conceived of the study, participated in the design of the work, data collection, data analysis, interpretation, and coordination, and drafted the manuscript. N.C. helped to revise the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [grant number 61771196].

Conflicts of Interest

The authors declare that they have no competing interests.

Abbreviations

The following abbreviations are used in this manuscript:
BLBaseLine
CSICover Song Identification
HPCPHarmonic Pitch Class Profile
HSHigh Space
K-NNK Nearest Neighbor
LCDPLocally Constrained Diffusion Process
MAPMean of Average Precision
MaRRMean Averaged Reciprocal Rank
MMMMixture Markov Model
MLDMelody
MIRMusic Information Retrieval
MIREXMusic Information Retrieval Evaluation eXchange
MPMutual Proximity
PSOParticle Swarm Optimization
SHSSecond Hand Songs
SNFSimilarity Network Fusion
TLSFPTwo-Level Similarity Fusion and Post-Processing
TOP-10Total Number of Covers Identified in TOP 10

References

  1. Berenzweig, A.; Logan, B.; Ellis, D.P.; Whitman, B. A large-scale evaluation of acoustic and subjective music-similarity measures. Comput. Music J. 2004, 28, 63–76. [Google Scholar] [CrossRef]
  2. Dannenberg, R.B.; Goto, M. Music structure analysis from acoustic signals. In Handbook of Signal Processing in Acoustics; Springer: New York, NY, USA, 2008; pp. 305–331. [Google Scholar]
  3. Ellis, D.P. Identifying ‘cover songs’ with beat-synchronous chroma features. MIREX 2006, 1–4. [Google Scholar] [CrossRef]
  4. Serra, J.; Serra, X.; Andrzejak, R.G. Cross recurrence quantification for cover song identification. New J. Phys. 2009, 11, 093017. [Google Scholar] [CrossRef] [Green Version]
  5. Yang, F.; Chen, N. Cover Song Identification Based on Cross Recurrence Plot and Local Alignment. J. East China Univ. Sci. Technol. 2016, 42, 247–253. [Google Scholar]
  6. Salamon, J.; Serrà, J.; Gómez, E. Melody, bass line, and harmony representations for music version identification. In Proceedings of the 21st International Conference Companion on World Wide Web, Lyon, France, 16–20 April 2012; pp. 887–894. [Google Scholar]
  7. Chen, N.; Li, M.; Xiao, H. Two-layer similarity fusion model for cover song identification. EURASIP J. Audio Speech Music Process. 2017, 2017, 12. [Google Scholar] [CrossRef]
  8. Chen, N.; Li, W.; Xiao, H. Fusing similarity functions for cover song identification. Multimed. Tools Appl. 2018, 77, 2629–2652. [Google Scholar] [CrossRef]
  9. Foucard, R.; Durrieu, J.L.; Lagrange, M.; Richard, G. Multimodal similarity between musical streams for cover version detection. In Proceedings of the 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, TX, USA, 14–19 March 2010; pp. 5514–5517. [Google Scholar]
  10. Ravuri, S.; Ellis, D.P. Cover song detection: From high scores to general classification. In Proceedings of the 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, TX, USA, 14–19 March 2010; pp. 65–68. [Google Scholar]
  11. Degani, A.; Dalai, M.; Leonardi, R.; Migliorati, P. A heuristic for distance fusion in cover song identification. In Proceedings of the 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Paris, France, 3–5 July 2013; pp. 1–4. [Google Scholar]
  12. Chen, N.; Xiao, H.D. Similarity fusion scheme for cover song identification. Electron. Lett. 2016, 52, 1173–1175. [Google Scholar] [CrossRef]
  13. Wang, B.; Mezlini, A.M.; Demir, F.; Fiume, M.; Tu, Z.; Brudno, M.; Haibe-Kains, B.; Goldenberg, A. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 2014, 11, 333–337. [Google Scholar] [CrossRef] [PubMed]
  14. Tralie, C.J. Early MFCC And HPCP Fusion for Robust Cover Song Identification. arXiv, 2017; arXiv:1707.04680. [Google Scholar]
  15. Schnitzer, D.; Flexer, A.; Schedl, M.; Widmer, G. Local and global scaling reduce hubs in space. J. Mach. Learn. Res. 2012, 13, 2871–2902. [Google Scholar]
  16. Yang, X.; Koknar-Tezel, S.; Latecki, L.J. Locally constrained diffusion process on locally densified distance spaces with applications to shape retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 357–364. [Google Scholar]
  17. Zhou, D.; Burges, C.J. Spectral clustering and transductive learning with multiple views. In Proceedings of the 24th International Conference on Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 1159–1166. [Google Scholar]
  18. Gómez, E. Tonal Description of Music Audio Signals. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2006. [Google Scholar]
  19. Tsai, W.H.; Yu, H.M.; Wang, H.M. Using the Similarity of Main Melodies to Identify Cover Versions of Popular Songs for Music Document Retrieval. J. Inf. Sci. Eng. 2008, 24, 1669–1687. [Google Scholar]
  20. Salamon, J. Melody Extraction from Polyphonic Music Signals. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013. [Google Scholar]
  21. Shi, Y. Particle swarm optimization: Developments, applications and resources. In Proceedings of the 2001 Congress on evolutionary computation, Seoul, Korea, 27–30 May 2001; Volume 1, pp. 81–86. [Google Scholar]
Figure 1. Block diagram and illustrative example of the proposed model, taking part results on Covers40 as an example. (a) Extract the harmonic pitch class profile (HPCP) descriptor and main melody (MLD) descriptor from each track in the music collection. (b) A track-by-track similarity graph is constructed based on each descriptor and corresponding similarity function. The similarity graphs based on the same descriptor and different similarity functions are fused with similarity network fusion (SNF). (c) The first-level fused similarity graphs for each descriptor are integrated with the mixture Markov model (MMM) technique to obtain a second-level fused similarity graph. (d) Post-processing. First, diffusion is performed on the second-level fused similarity graph to take advantage of the structure of the underlying track manifold contained within it to reduce noise and enhance retrieval accuracy, then mutual proximity (MP) is adopted to modify the diffused similarity to reduce the “hubness” phenomenon.
Figure 1. Block diagram and illustrative example of the proposed model, taking part results on Covers40 as an example. (a) Extract the harmonic pitch class profile (HPCP) descriptor and main melody (MLD) descriptor from each track in the music collection. (b) A track-by-track similarity graph is constructed based on each descriptor and corresponding similarity function. The similarity graphs based on the same descriptor and different similarity functions are fused with similarity network fusion (SNF). (c) The first-level fused similarity graphs for each descriptor are integrated with the mixture Markov model (MMM) technique to obtain a second-level fused similarity graph. (d) Post-processing. First, diffusion is performed on the second-level fused similarity graph to take advantage of the structure of the underlying track manifold contained within it to reduce noise and enhance retrieval accuracy, then mutual proximity (MP) is adopted to modify the diffused similarity to reduce the “hubness” phenomenon.
Applsci 08 01383 g001
Figure 2. Similarity matrices obtained by (a) MLD-Qmax, (b) MLD-Dmax, and (c) SNF-MLD-QD.
Figure 2. Similarity matrices obtained by (a) MLD-Qmax, (b) MLD-Dmax, and (c) SNF-MLD-QD.
Applsci 08 01383 g002
Figure 3. Illustration of the effectiveness of the diffusion process in classification. Pentagrams represent two queries from different groups. Each element is assigned to one of the two queries according to its distances with the query samples: (a) without diffusion (b) with diffusion.
Figure 3. Illustration of the effectiveness of the diffusion process in classification. Pentagrams represent two queries from different groups. Each element is assigned to one of the two queries according to its distances with the query samples: (a) without diffusion (b) with diffusion.
Applsci 08 01383 g003
Figure 4. Probability distribution of the diffused similarities on DB400 (a) before and (b) after applying MP to them.
Figure 4. Probability distribution of the diffused similarities on DB400 (a) before and (b) after applying MP to them.
Applsci 08 01383 g004
Figure 5. Identification accuracy achieved in each step of the proposed model on (first column) DB160, (second column) DB400, and (last column) DB3172. BL: baseline; DIFF: diffusion processing; MAP: mean of average precision; MaRR: mean averaged reciprocal rank; TOP-10: total number of covers identified in TOP 10.
Figure 5. Identification accuracy achieved in each step of the proposed model on (first column) DB160, (second column) DB400, and (last column) DB3172. BL: baseline; DIFF: diffusion processing; MAP: mean of average precision; MaRR: mean averaged reciprocal rank; TOP-10: total number of covers identified in TOP 10.
Applsci 08 01383 g005
Figure 6. The comparison of average computing time achieved by different similarity fusion schemes on four datasets.
Figure 6. The comparison of average computing time achieved by different similarity fusion schemes on four datasets.
Applsci 08 01383 g006
Table 1. The tracks in the selected cover sets.
Table 1. The tracks in the selected cover sets.
Cover SetsTitle of the TracksArtistsTrack ID
No. 1Wish You Were HereWyclef Jean1
Pink Floyd2
No. 2White RoomSheryl Crow3
Cream4
No. 3YesterdayEn Vogue5
Beatles6
Table 2. Cover song datasets used.
Table 2. Cover song datasets used.
Dataset NameNum. of TracksNum. of Cover SetsAve. Num. of Tracks in Each Cover Set
DB160160802
DB4004004010
DB3172317211192.83
DB318331839853.23
DB3187318710303.09
DB3188318811012.90
Table 3. Identification accuracy comparison among different cover song identification (CSI) schemes. HS: high space; PSO: particle swarm optimization; TLSFP: two-level similarity fusion and post-processing.
Table 3. Identification accuracy comparison among different cover song identification (CSI) schemes. HS: high space; PSO: particle swarm optimization; TLSFP: two-level similarity fusion and post-processing.
DatasetsAlgorithmMAPMaRRTOP-10
DB160HPCP-Qmax [4]0.54350.283198
HPCP-Dmax [5]0.57090.2979104
PSO (HPCP-Qmax) [21]0.57580.2993101
HS [11]0.58680.3086107
SNF-2 [8]0.62470.3269114
SNF-3 [12]0.64130.3346113
SNF-40.64790.3369114
Two-layer-fusion [7]0.66800.6680119
TLSFP0.68170.6817125
DB400HPCP-Qmax [4]0.82270.19072852
HPCP-Dmax [5]0.79450.19072717
PSO [21]0.79330.24452571
HS [11]0.75640.18832651
SNF-2 [8]0.93590.20403286
SNF-3 [12]0.96110.20803408
SNF-40.98480.21183529
Two-layer-fusion [7]0.97540.30943482
TLSFP0.98660.31073545
DB3172HPCP-Qmax [4]0.44480.28313538
HPCP-Dmax [5]0.44120.20593501
PSO [21]0.45930.21013634
HS [11]0.35360.16912832
SNF-2 [8]0.53990.23794556
SNF-3 [12]0.50040.22383962
SNF-40.56020.24684602
Two-layer-fusion [7]0.56220.45794734
TLSFP0.56730.45904787
DB3183HPCP-Qmax [4]0.42960.18774647
HPCP-Dmax [5]0.43210.19214567
PSO [21]0.44420.19384768
HS [11]0.29470.13663103
SNF-2 [8]0.55120.22856015
SNF-3 [12]0.48930.20645177
SNF-40.55080.22856015
Two-layer-fusion [7]0.55460.41476309
TLSFP0.56930.42216461
DB3187HPCP-Qmax [4]0.42700.19094025
HPCP-Dmax [5]0.41890.19183862
PSO [21]0.43980.19674128
HS [11]0.31270.14722981
SNF-2 [8]0.53250.22705198
SNF-3 [12]0.48650.21194487
SNF-40.55130.23395410
Two-layer-fusion [7]0.53580.41685421
TLSFP0.54440.42075502
DB3188HPCP-Qmax [4]0.44850.20313835
HPCP-Dmax [5]0.45020.20843815
PSO [21]0.46090.20983938
HS [11]0.36300.17113173
SNF-2 [8]0.53910.23614792
SNF-3 [12]0.49510.21994193
SNF-40.54290.23834755
Two-layer-fusion [7]0.54840.44564946
TLSFP0.55710.45045029

Share and Cite

MDPI and ACS Style

Li, M.; Chen, N. A Robust Cover Song Identification System with Two-Level Similarity Fusion and Post-Processing. Appl. Sci. 2018, 8, 1383. https://doi.org/10.3390/app8081383

AMA Style

Li M, Chen N. A Robust Cover Song Identification System with Two-Level Similarity Fusion and Post-Processing. Applied Sciences. 2018; 8(8):1383. https://doi.org/10.3390/app8081383

Chicago/Turabian Style

Li, Mingyu, and Ning Chen. 2018. "A Robust Cover Song Identification System with Two-Level Similarity Fusion and Post-Processing" Applied Sciences 8, no. 8: 1383. https://doi.org/10.3390/app8081383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop