Adaptive Filtering for Multi-Track Audio Based on Time–Frequency-Based Masking Detection

Zhao, Wenhan; Pérez-Cota, Fernando

doi:10.3390/signals5040035

Open AccessArticle

Adaptive Filtering for Multi-Track Audio Based on Time–Frequency-Based Masking Detection

by

Wenhan Zhao

and

Fernando Pérez-Cota

^*

Department of Electric and Electronic Engineering, Faculty of Engineering, University of Nottingham, Nottingham NG7 2RD, UK

^*

Author to whom correspondence should be addressed.

Signals 2024, 5(4), 633-641; https://doi.org/10.3390/signals5040035

Submission received: 7 June 2024 / Revised: 30 August 2024 / Accepted: 24 September 2024 / Published: 2 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

There is a growing need to facilitate the production of recorded music as independent musicians are now key in preserving the broader cultural roles of music. A critical component of the production of music is multitrack mixing, a time-consuming task aimed at, among other things, reducing spectral masking and enhancing clarity. Traditionally, this is achieved by skilled mixing engineers relying on their judgment. In this work, we present an adaptive filtering method based on a novel masking detection scheme capable of identifying masking contributions, including temporal interchangeability between the masker and maskee. This information is then systematically used to design and apply filters. We implement our methods on multitrack music to improve the quality of the raw mix.

Keywords:

adaptive filtering; multitrack mixing; auditory masking

1. Introduction

With the advent of the internet and advances in electronic and computing technologies, creating music is becoming more accessible than ever before. In one phrase, music production is now available to the masses [1]. It has become much easier for independent artists to access recording tools, software, and learning resources. Therefore, it is also easier for independent musicians to distribute music and gain recognition without a record label [2]. However, independent musicians are also not being drawn into the mainstream as often, making mainstream music formulaic and overly commercialised. Consequently, music is in danger of losing some of the cultural aspects that have previously made it significant [3].

Despite advances in technology, independent musicians still face challenges in achieving a professional-quality product. One of these challenges is multitrack mixing, which involves applying multiple processes (panning, equalisation, compression, gain, etc.) for blending multiple audio tracks in a digital audio workstation (DAW) [4]. DAWs contain a very large library of tools available to perform a mix. For instance, compressors [5], equalisers [6,7,8], reverberators [9], dereverberators [10], pitch correction [11], and various other effects [12]. Achieving a professional quality mixing still relies on highly experienced mix engineers, and while DAWs have become more affordable and sophisticated, they are still by no means easy to use for someone without training [13]. There are multiple works that address the challenges of mixing for the untrained user. For instance, using machine learning to predict the values of predefined features that will generate a desired outcome [14,15] or studying the methodologies of accomplished mixing engineers [16] or their practices in equalisation [17].

One of the key problems in audio multitrack mixing is spectral masking [18]. This is the inability to discern two sounds if they overlap in time and frequency. Highly developed ear skills are required to remove masking in modern DAWs as they still rely on the user to listen and identify which frequencies need adjustment at which times. Several solutions to the challenge of detecting and removing masking have been proposed. These solutions often use cross-adaptive mechanisms to analyse the influence of a track over the others in order to apply equalisation filters [19]. Features for the analysis, such as intensity at sub-bands [20], or loudness loss [21] are often predefined by the user and these might not be suitable for every piece. Panning has also been used as a means to reduce masking under a similar analysing scheme [22]. These methods might have some setbacks. For instance, relying on the user to predetermine spectral zones or have fixed predefined spectral bands, and might overestimate/underestimate masking due to the omission of the interchangeability of masker and maskee. Equalisation and panning are probably the most effective mechanisms to reduce masking; however, new methods based on artificial intelligence promise not only to reduce masking but to apply all forms of processing in one package [5]. These artificial intelligence methods are still in their infancy and might remove the creative aspects of performing a mix.

There are also a number of commercial software tools that aid in achieving a high-quality mix without conventional training. For instance, soundtheory’s gullfoss [23] uses equalisation to improve quality, the oeksound’s, soothe [24] suppresses prominent frequencies and the Master the mix’s fuser [25] compares characteristics in the working mix with a reference track and provides guidance on how to achieve the same characteristics. The developers of such tools might not reveal how their technology works, and without this understanding, it can be challenging for users to apply these tools correctly. At times, this can be exasperated by the use of artificial intelligence approaches [26,27] where the underlying mechanisms of the algorithms are not clear to the developer either. Mixing remains one of the most challenging aspects of the production of music. Algorithms dedicated to supporting this process are still developing [28], leaving ample room for advancements to fully leverage the accessibility to computing power.

In this article, we propose a simple masking detection method that systematically compares each track with high temporal and spectral resolutions, incorporates the interchangeability between masker and maskee and uses simple user-defined parameters to produce an intricate equalisation filter to improve the clarity of a mix.

2. Evaluation of Masking and Filtering

Our method first calculates the magnitude of the short-time Fourier transforms T of each track:

T (t, f) = | F {T_{r} (t_{0}) * w (t_{0} - t)} |

(1)

where

T_{r}

is the raw track,

t_{0}

the temporal dimension, w the window function, and t the time offset of the window function. The purpose of the window function is to suppress the signal outside of the region of interest to provide resolution in the time domain. In this case, we have used a Hann window [29] of 180 ms in length and spaced by 90 ms. This is a compromise between temporal and spectral resolutions, which enables resolving individual notes (see Figure 1a,b).

We propose Equation (2) to obtain an array of values that identify where in time and frequency masking is occurring, at which severity and in which polarity (i.e., differentiating masker and maskee). We define a masking matrix

M^{'} (f, t)

by:

M_{12}^{'} (t, f) = \frac{(T_{1} (t, f) * T_{2} (t, f)) (T_{1} (t, f) - T_{2} (t, f))}{{(m e a n (T_{1} (t, f)) + m e a n (T_{2} (t, f)))}^{3}}

(2)

where

T_{1}

and

T_{2}

are the magnitude of short-time Fourier transforms of track 1 and track 2, respectively. The expression

m e a n ()

indicates a 2D average (scalar). This process creates a matrix of masking

M_{12}^{'} (t, f)

between specific tracks, 1 and 2 (see Figure 1c). The equation consists of three parts:

$(T_{1} (t, f) * T_{2} (t, f))$ —this term evaluates how relevant it is to calculate masking at this specific time and frequency. For instance, if the amplitude of one track is close to zero, then masking would be small.
$(T_{1} (t, f) - T_{2} (t, f))$ —this term evaluates the relative magnitude and polarity of masking. Positive values for $T_{1}$ indicate this is the masker while negative indicates it is the maskee. This is an important term to differentiate the masker from the maskee at different times.
${(m e a n (T_{1} (t, f)) + m e a n (T_{2} (t, f)))}^{3}$ —this is a normalisation term to ensure scale invariance that the same results will be provided if the same non-zero dB gain is applied to both tracks.

The two-dimensional masking matrix

M^{'} (f, t)

is then reduced into a one-dimensional form where all the temporal information is summed while the frequency space is preserved:

M_{12}^{'} (f) = \sum_{t = 0}^{t = N_{t}} M_{12}^{'} (t, f)

(3)

where t is the time position of the window function and

N_{t}

is the total number of time samples (see Figure 2a). Even though the temporal information is compressed, this only occurs after the calculation of masking takes place. Therefore, the frequencies that contribute to

M^{'} (f)

are only those that overlap in both time and frequency. This is key to preventing filtering elements of the signals whose components do not exist at the same time in the tracks. After the data are reduced to a single dimension, the positive and negative parts, which correspond to the masking of T1 over T2 (T1 is the masker) and vice versa (T1 is the masquee), are separated. Therefore,

M_{12}^{'} (f)

becomes:

M_{12} (f) = \{\begin{matrix} M_{12}^{'} (f), & if M_{12}^{'} (f) > 0 \\ 0, & otherwise \end{matrix}

(4)

M_{21} (f) = \{\begin{matrix} | M_{21}^{'} (f) |, & if M_{12}^{'} (f) < 0 \\ 0, & otherwise \end{matrix}

(5)

where

M_{12} (f)

and

M_{21} (f)

are the contributions to masking from the masker and the maskee, respectively (see Figure 2b,c). After calculating all contributions to masking from each track to the others (for an arbitrary number of tracks), these are summed and normalised:

\begin{matrix} M_{T 1} (f) & = \frac{M_{1, 2} (f) + M_{1, 3} (f) + . . . M_{1, N} (f)}{M_{0 T 1}} \\ M_{T 2} (f) & = \frac{M_{2, 1} (f) + M_{2, 3} (f) + . . . M_{2, N} (f)}{M_{0 T 2}} \\ ⋮ \\ M_{T N} (f) & = \frac{M_{N, 1} (f), M_{N, 2} (f) . . . M_{N, (N - 1)} (f)}{M_{0 T N}} \end{matrix}

(6)

where N is the number of tracks and

M_{0 T 1}, M_{0 T 2}, . . . M_{0 T N}

are the normalise-to-unity factors calculated as the maximum value of the sum of the numerators for each track. For example,

\begin{matrix} M_{0 T 1} = max {M_{1, 2} (f) + M_{1, 3} (f) + . . . + M_{1, N} (f)} . \end{matrix}

(7)

Equation (6) effectively normalises all the masking contributions for each track to others to unity in order to generate the final masking vectors (

M_{T 1} . . . M_{T N}

, see Figure 2d). For an arbitrary number of tracks N, Equation (2) is applied for each pair (see Section 3). The total number of possible combinations between tracks is given by

N * (N - 1)

because tracks are only grouped in pairs, and not with themselves (i.e., not

M_{1, 1} (f)

). For stereo tracks, the process is similar, treating left and right channels as independent tracks but not comparing opposite sides of the panning. Mono tracks then can be treated as stereo tracks with identical sides.

Filter Design

The masking vectors

M_{T 1} (f)

,

M_{T 2} (f)

…

M_{T N} (f)

obtained from Equation (6) represent the masking contributions of each track to others (see Figure 2d) and so provide then the basis for the reduction of masking by filtering. To design the filters, the most prominent peaks of each masking vector are found by comparing every sample to the two immediate neighbours to detect local maxima (see blue starts on Figure 3a). Then, an adaptive threshold [30] is implemented with a fixed lower limit (see Figure 3b). Applying the threshold reduced the number of peaks originally found by comparing neighbours. This is particularly more noticeable for higher frequencies (see Figure 3c). The central frequency and width of each of the remaining peaks are then used to generate a parametric equalisation filter [31] with the gain of each band proportional to each peak so the relative ratios of masking at different frequencies are preserved (see Figure 3d). The filter is normalised to have a maximum gain of 0 dB and a user-defined minimum gain (for instance, −3 dB in Figure 3d). This process is performed for all masking vectors

M_{T} (f)

and the generated filters are applied to their correspondent tracks in the Fourier space. Finally, the tracks are added up and then this raw mix is normalised to unity (rough master) for evaluation.

3. Results

To demonstrate our methodology, we employ a multitrack from the Mixing Secrets library [13]. The song is Flesh and Bone by Wesley Morgan. The song contains four tracks in the mix (bass, accordion, vocals, and guitar). Tracks with multiple microphone sources were premixed to ensure one track per instrument. The masking matrices for Flesh and Bone obtained using Equation (2) are shown in Figure 4. The most prominent masking comes from the bass to all other tracks (a–c, red); nevertheless, there is masking across all tracks at different extents. Based on these, and using Equations (3)–(6), an attenuating filter was generated for each track (see Figure 5). The filters have multiple peaks attenuating at different bands, except for guitar and vocals at approximately 165 Hz (E4, yellow and dashed purple). Each filter was applied to its correspondent track and then the tracks were averaged with their original levels (rough mix) and normalised to unity (rough master). The gain of each filter is user-defined and set as −8, −6, −3, and −3 dB for bass, accordion, vocals, and guitar, respectively.

For clarity, Figure 6 shows a schematic of how the four tracks were combined, separated, and recombined to achieve the masking vectors required for filter design. In essence, all possible combinations of tracks are used to generate the initial masking vectors using Equations (2) and (3) (

M^{'} (f)

), then masker and maskee are separated (Equations (4)–(5)), averaged, and normalised (Equation (6)), resulting in (

M {(f)}_{T}

) as the basis for designing a filter. The number of tracks in this example is four and can be easily expanded to any number.

Figure 7 shows the individual tracks before and after filtering. The changes in the spectrum match the response of the filter presented in Figure 5. Tracks 1 and 2, (bass and accordion, see Figure 7a–d) are modified significantly as the selected gain was aggressive (−8 and −6 dB, respectively). Tracks 3 and 4 (vocals and guitar, see Figure 7e–h) are modified subtly, yet changes in the frequency and time domains are also visible.

4. Discussion

In performing a subjective comparison of the original and filtered mixes for Flesh and Bone (see Supplementary Data), we observe a clear change in the characteristics of the mix. Compared to the original, the filtered mix exhibits less muddiness, particularly on the accordion and vocals, and the levels appear more balanced; the dominance of the accordion in the mix is reduced while the guitar becomes more audible. In our opinion, this is a positive demonstration of our method, considering that there was no adjusting of the gain for each track and no prior information. We have included a second track, Nearly There by Jesse Joy, from the same repository, where the clarity of vocals and guitar are also improved (see Supplementary Data). Besides the gain for each track, further user-defined parameters can be included such as the prominence of peaks of the adaptive threshold and the noise floor adjustment. Modifying these parameters will increase the contribution of the resulting filter, incrementing the number of peaks at higher frequencies.

Finding these intricate filter responses by ear requires a significant amount of practice and training and is by no means an easy task. This is particularly relevant for acoustic or electric instruments recorded using a microphone out of a speaker cabinet (i.e., electric guitar or electric organ). Therefore, there is a benefit in utilising our method, which simplifies the process by reducing the number of parameters in play during mixing.

Our method can be improved by preserving the two-dimensional information and filtering in the STFT space. However, this approach might need significantly more computational resources. This method might not be well suited when mixing different sources of the same instrument as these will have very similar spectral contents.

We envision that our method could be turned into a digital plug-in that can be used as either a pre-processing step or a method to visualise masking and generate editable filters as a mean for musicians untrained in multitrack mixing to improve the quality of their self-produced music.

5. Conclusions

We have presented a novel methodology for the characterisation and reduction of masking in multitrack audio recordings. Our method is an adaptive filtering that systematically compares all tracks of multitrack audio to find the magnitude and polarity of masking in time and frequency. This reduces over-filtering by preventing the contribution of frequencies that overlap only in frequency and not time. Then, this information is used to design filters for each track whose gain can be adjusted. In doing so, we have demonstrated that the resulting mixed audio exhibits a subjective improvement in the quality of the mix (see Supplementary Data). We believe this method can be useful to musicians not trained in audio engineering, to achieve better quality self-produced music.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/signals5040035/s1, Audio S1: Flesh and bone unfiltered (FleshAndBone/Mix_original.wav) Audio S2: Flesh and Bone filtered (FleshAndBone/Mix_Filtered.wav), Audio S3: Nearly there unfiltered (NearlyThere/Mix_original.wav) Audio S4: Nearly there filtered (NearlyThere/Mix_Filtered.wav).

Author Contributions

W.Z. and F.P.-C. contributed equally for the conceptualisation and the development and testing of the methodologies. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article and Supplementary Materials.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hagen, A.N. Datafication, Literacy, and Democratization in the Music Industry. Pop. Music. Soc. 2022, 45, 184–201. [Google Scholar] [CrossRef]
Eiriz, V.; Leite, F.P. The digital distribution of music and its impact on the business models of independent musicians. Serv. Ind. J. 2017, 37, 875–895. [Google Scholar] [CrossRef]
Guo, X. The Evolution of the Music Industry in the Digital Age: From Records to Streaming. J. Sociol. Ethnol. 2023, 5, 7–12. [Google Scholar] [CrossRef]
Owsinski, B. The Mixing Engineer’s Handbook; Bobby Owsinski Media Group: Burbank, CA, USA, 2022; p. 391. [Google Scholar]
Steinmetz, C.J.; Pons, J.; Pascual, S.; Serrà, J. Automatic multitrack mixing with a differentiable mixing console of neural audio effects. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing Proceedings, Toronto, ON, Canada, 6–11 June 2021; pp. 71–75. [Google Scholar] [CrossRef]
Zaknich, A.; Lee, G.E. An audio equalisation linear phase FIR filter design method using RBF based smoothing and interpolation. In Proceedings of the 4th International Conference on Intelligent Sensing and Information Processing, ICISIP 2006, Bangalore, India, 15 October–18 December 2006; pp. 109–114. [Google Scholar] [CrossRef]
Välimäki, V.; Reiss, J.D. All About Audio Equalization: Solutions and Frontiers. Appl. Sci. 2016, 6, 129. [Google Scholar] [CrossRef]
Perez-Gonzalez, E.; Reiss, J.D. Automatic Mixing. In DAFX: Digital Audio Effects: Second Edition; John Wiley & Sons: Hoboken, NJ, USA, 2011; pp. 523–549. [Google Scholar] [CrossRef]
Välimäki, V.; Parker, J.D.; Savioja, L.; Smith, J.O.; Abel, J.S. Fifty years of artificial reverberation. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 1421–1448. [Google Scholar] [CrossRef]
Tan, K.; Xu, Y.; Zhang, S.X.; Yu, M.; Yu, D. Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network. IEEE J. Sel. Top. Signal Process. 2020, 14, 542–553. [Google Scholar] [CrossRef]
Schörkhuber, C.; Klapuri, A.; Sontacchi, A. Audio pitch shifting using the constant-q transform. J. Audio Eng. Soc. 2013, 61, 562–572. [Google Scholar]
Wilmering, T.; Moffat, D.; Milo, A.; Sandler, M.B. A History of Audio Effects. Appl. Sci. 2020, 10, 791. [Google Scholar] [CrossRef]
Senior, M. Mixing Secrets for the Small Studio; Routledge: New York, NY, USA, 2018; p. 432. [Google Scholar]
Scott, J.J.; Kim, Y.E. Analysis of Acoustic Features for Automated Multi-Track Mixing. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), Miami, FL, USA, 24–28 October 2011. [Google Scholar]
Scott, J.J.; Prockup, M.; Schmidt, E.M.; Kim, Y.E. Automatic Multi-Track Mixing Using Linear Dynamical Systems. In Proceedings of the 8th Sound and Music Computing Conference, Padova, Italy, 6–9 July 2011. [Google Scholar]
Wakefield, J.; Dewey, C. An Investigation into the Efficacy of Methods Commonly Employed by Mix Engineers to Reduce Frequency Masking in the Mixing of Multitrack Musical Recordings. In Proceedings of the Audio Engineering Society 138th European Convention, Warsaw, Poland, 7–10 May 2015. [Google Scholar]
Reed, D. Perceptual assistant to do sound equalization. In Proceedings of the 5th International Conference on Intelligent User Interfaces, Proceedings IUI, New Orleans, LA, USA, 9–12 January 2000; pp. 212–218. [Google Scholar] [CrossRef]
Greenwood, D.D. Auditory Masking and the Critical Band. J. Acoust. Soc. Am. 1961, 33, 484–502. [Google Scholar] [CrossRef]
Gonzalez, E.P.; Reiss, J.D. Improved control for selective minimization of masking using inter-channel dependancy effects. In Proceedings of the 11th International Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, 1–4 September 2008. [Google Scholar]
Hafezi, S.; Reiss, J.D. Autonomous Multitrack Equalization Based on Masking Reduction. J. Audio Eng. Soc. 2015, 63, 312–323. [Google Scholar] [CrossRef]
Wichern, G.; Robertson, H.; Wishnick, A. Quantitative analysis of masking in multitrack mixes using loudness loss. In Audio Engineering Society Convention 141; Audio Engineering Society: Cambridge, MA, USA, 2016; p. 9646. [Google Scholar]
Tom, A.; Reiss, J.D.; Depalle, P. An Automatic Mixing System for Multitrack Spatialization for Stereo Based on Unmasking and Best Panning Practices. In Proceedings of the 146th AES Convention, Dublin, Ireland, 20–23 March 2019. [Google Scholar]
Gullfoss DAW Plugin. Gullfoss Information Webpage. Soundtheory LTD. Available online: https://www.soundtheory.com/gullfoss (accessed on 8 May 2024).
Soothe 2 DAW Plugin. Soothe 2 Information Webpage. Oeksound Ltd. Available online: https://oeksound.com/plugins/soothe2 (accessed on 6 May 2024).
Reference DAW Plugin. Reference Information Webpage. Mastering the Mix LTD. Available online: https://www.masteringthemix.com/products/reference (accessed on 22 May 2024).
Liu, W. Literature survey of multi-track music generation model based on generative confrontation network in intelligent composition. J. Supercomput. 2023, 79, 6560–6582. [Google Scholar] [CrossRef]
Liu, X.; Mourgela, A.; Ai, H.; Reiss, J.D. An automatic mixing speech enhancement system for multi-track audio. arXiv 2024, arXiv:2404.17821. [Google Scholar] [CrossRef]
Man, B.D.; Reiss, J.D.; Stables, R. Ten Years of Automatic Mixing. In Proceedings of the 3rd Workshop on Intelligent Music Production, Salford, UK, 15 September 2017. [Google Scholar]
Xu, H.; Zhou, S.; Qin, W.; Litak, G. An Improved Interpolation Algorithm for the Damped Signal Based on Hann Window. In Proceedings of the 2022 14th International Conference on Signal Processing Systems, ICSPS 2022, Zhenjiang, China, 18–20 November 2022; pp. 269–277. [Google Scholar] [CrossRef]
Huang, H.; Luo, J.; Moaveni, M.; Tutumluer, E.; Hart, J.M.; Beshears, S.; Stolba, A.J. Field Imaging and Volumetric Reconstruction of Riprap Rock and Large-Sized Aggregates: Algorithms and Application. Transp. Res. Rec. 2019, 2673, 575–589. [Google Scholar] [CrossRef]
Orfanidis, S.J. High-Order Digital Parametric Equalizer Design. J. Audio Eng. Soc. 2015, 53, 1026–1046. [Google Scholar]

Figure 1. Calculation of masking in the temporal and spectral dimensions using short-time Fourier transforms. The time–frequency spectrum of tracks 1 ((a), bass) and 2 ((b), guitar) was obtained using short-time Fourier transforms. (c) Masking between track 1 and track 2 as given by Equation (2). The positive values in (c) indicate the masking of track 1 over track 2 while the negative values in (c) indicate the masking of track 2 over track 1.

Figure 2. Calculation of masking in the frequency domain. (a) The masking matrix

M_{12}^{'} (t, f)

is compressed to a single dimension (

M_{12}^{'} (f)

) using Equation (3). (b,c) Masking of track 1 over track 2 (

M_{1, 2} (f)

,red) and track 2 over track 1 (

M_{2, 1} (f)

,blue) are separated using Equations (4) and (5), respectively. (d) The final masking of T_r1, T_r2…T_rN over others (

M_{T 1} (f)

,

M_{T 2} (f)

…

M_{T N} (f)

), which is used for filter design (green), was calculated using Equation (6). In this equation, the contributions to masking from each track over the others (i.e., M_1,2,M_1,3…M_1,N) are averaged and normalised. The same process can be expanded to any number of tracks (see the Results section).

Figure 2. Calculation of masking in the frequency domain. (a) The masking matrix

M_{12}^{'} (t, f)

is compressed to a single dimension (

M_{12}^{'} (f)

) using Equation (3). (b,c) Masking of track 1 over track 2 (

M_{1, 2} (f)

,red) and track 2 over track 1 (

M_{2, 1} (f)

,blue) are separated using Equations (4) and (5), respectively. (d) The final masking of T_r1, T_r2…T_rN over others (

M_{T 1} (f)

,

M_{T 2} (f)

…

M_{T N} (f)

), which is used for filter design (green), was calculated using Equation (6). In this equation, the contributions to masking from each track over the others (i.e., M_1,2,M_1,3…M_1,N) are averaged and normalised. The same process can be expanded to any number of tracks (see the Results section).

Figure 3. Filter design. (a) Masking matrix (orange) with peaks identified (blue stars). (b)

M_{T 1} (f)

with adaptive threshold (blue) and noise floor (dashed green). (c) Only the peaks above the adaptive threshold remain and are used for filter design. (d) Frequency response of the final filter for track 1 with a user-defined gain of 3 dB (yellow) and showing the peaks (orange stars) used for the design.

Figure 3. Filter design. (a) Masking matrix (orange) with peaks identified (blue stars). (b)

M_{T 1} (f)

with adaptive threshold (blue) and noise floor (dashed green). (c) Only the peaks above the adaptive threshold remain and are used for filter design. (d) Frequency response of the final filter for track 1 with a user-defined gain of 3 dB (yellow) and showing the peaks (orange stars) used for the design.

Figure 4. Masking detected for S₁. The positive masking (red) represents a track 1 masking track 2 while negative masking (yellow) represents track2 masking track1. (a) Bass (1) over accordion (2). (b) Bass (1) over vocals (2). (c) Bass (1) over guitar (2). (d) Accordion (1) over guitar (t2). (e) Accordion (1) over vocals (2). (f) Vocals (1) over guitar (2).

Figure 5. Filters generated and applied to each track. Each track had its own filter applied, which was generated through the analysis of the relationships between the tracks. The filters attenuate each track at different bands, except for guitar and vocals which share an attenuation peak at 165 Hz.

Figure 6. Relationships between source tracks and final masking vectors for the four tracks of the multitrack from Flesh and Bone. The same process can be expanded for any number of tracks.

Figure 7. Multitrack spectrum and excerpt before and after filtering. (a,b) bass, (c,d) accordion, (e,f) vocals, and (g,h) guitar. Bass and accordion show the most prominent differences as gains for the filters were more aggressive. The spectral differences match those seen in Figure 5.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, W.; Pérez-Cota, F. Adaptive Filtering for Multi-Track Audio Based on Time–Frequency-Based Masking Detection. Signals 2024, 5, 633-641. https://doi.org/10.3390/signals5040035

AMA Style

Zhao W, Pérez-Cota F. Adaptive Filtering for Multi-Track Audio Based on Time–Frequency-Based Masking Detection. Signals. 2024; 5(4):633-641. https://doi.org/10.3390/signals5040035

Chicago/Turabian Style

Zhao, Wenhan, and Fernando Pérez-Cota. 2024. "Adaptive Filtering for Multi-Track Audio Based on Time–Frequency-Based Masking Detection" Signals 5, no. 4: 633-641. https://doi.org/10.3390/signals5040035

Article Menu

Adaptive Filtering for Multi-Track Audio Based on Time–Frequency-Based Masking Detection

Abstract

1. Introduction

2. Evaluation of Masking and Filtering

Filter Design

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI