A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations

Moroșanu, Bogdan; Negru, Marian; Nicolae, Georgian; Ioniță, Horia Sebastian; Paleologu, Constantin

doi:10.3390/info16060494

Open AccessArticle

A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations

by

Bogdan Moroșanu

,

Marian Negru

,

Georgian Nicolae

,

Horia Sebastian Ioniță

and

Constantin Paleologu

^*

Faculty of Electronics, Telecommunications and Information Technology, National University of Science and Technology POLITEHNICA Bucharest, 060042 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Information 2025, 16(6), 494; https://doi.org/10.3390/info16060494 (registering DOI)

Submission received: 13 May 2025 / Revised: 5 June 2025 / Accepted: 12 June 2025 / Published: 13 June 2025

(This article belongs to the Special Issue Optimization Algorithms and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Modern audio production workflows often require significant manual effort during the initial session preparation phase, including track labeling, format standardization, and gain staging. This paper presents a rule-based and Machine Learning-assisted automation system designed to minimize the time required for these tasks in Digital Audio Workstations (DAWs). The system automatically detects and labels audio tracks, identifies and eliminates redundant fake stereo channels, merges double-tracked instruments into stereo pairs, standardizes sample rate and bit rate across all tracks, and applies initial gain staging using target loudness values derived from a Genetic Algorithm (GA)-based system, which optimizes gain levels for individual track types based on engineer preferences and instrument characteristics. By replacing manual setup processes with automated decision-making methods informed by Machine Learning (ML) and rule-based heuristics, the system reduces session preparation time by up to 70% in typical multitrack audio projects. The proposed approach highlights how practical automation, combined with lightweight Neural Network (NN) models, can optimize workflow efficiency in real-world music production environments.

Keywords:

audio engineering; digital audio workstations; machine learning; audio workflow optimization; gain staging; audio classification; audio production

Graphical Abstract

1. Introduction

Audio production has undergone significant transformations with the widespread adoption of Digital Audio Workstations (DAWs) [1]. These platforms have democratized music production, empowering millions of musicians worldwide, ranging from professional recording artists to independent home-based producers, to create and release their music easily. In recent years, this has resulted in an unprecedented growth in music releases, with approximately 34.1 million new tracks uploaded to streaming services globally in 2022 alone, and daily uploads reaching upwards of 120,000 tracks [2].

Despite such technological advances, the initial session preparation phase in professional audio production environments remains time-consuming and predominantly manual [3]. This inefficiency significantly impacts productivity, especially considering there are likely tens of thousands to hundreds of thousands of sound engineers globally involved in recording, mixing, and mastering, who regularly engage in these repetitive tasks [4]. That is roughly one professional engineer available per several hundred artists of audio releases each year. Engineers often allocate considerable time to essential preparatory tasks before the mixing process can begin, including the following:

Track identification and labeling;
Format normalization (sample rate and bit depth);
Redundant track identification and cleanup;
Organization of related tracks (e.g., grouping similar tracks);
Initial gain staging and level setting.

While these tasks ensure an organized and efficient mixing workflow, their repetitive nature makes them ideal candidates for automation [5]. The time burden of preparation tasks becomes particularly pronounced given contemporary production complexities, where a recording session often exceeds 50 tracks [6]. Coupled with tight deadlines and budget constraints, this amplifies the critical need for optimized workflows [7].

Recent advances in Machine Learning have shown substantial potential for audio content analysis and classification [8], opening new avenues for audio workflow optimization. Previous research has explored automation of individual audio production tasks, including automatic mixing [9], intelligent effects processing [10], and assistive mastering tools [11]. However, comprehensive solutions addressing the initial session preparation phase holistically remain scarce, highlighting a clear gap in current research and technological application.

This paper addresses this gap by presenting an automation system that integrates rule-based heuristics with Deep Learning techniques. Our approach optimizes the entire preparation phase, from track classification and labeling to gain staging, focusing on practical implementation and significant workflow improvements. By streamlining these tasks, the proposed system substantially reduces preparation time, allowing engineers to focus more effectively on the creative and artistic aspects of production.

Building upon our previous work in gain optimization using Genetic Algorithms [12], we extend our automation framework to encompass the complete session preparation process. Given the vast number of artists and music releases annually, compared to the extensive global community of sound engineers, the demand for such automation tools is substantial and immediate. Our contribution lies in demonstrating the integration of multiple automation techniques into a unified system that enhances efficiency, preserves artistic integrity, and meets professional audio production standards.

The remainder of this paper is organized as follows. Section 2 reviews related work in audio production automation and Machine Learning for audio. Section 3 presents the problem formulation. Section 4 details our proposed system architecture and implementation. Section 5 describes the experimental setup and evaluation methodology. Section 6 presents the performance results. Section 7 analyzes limitations and practical implications. Finally, Section 8 concludes the paper and outlines future work.

2. Related Work

2.1. Automation in Audio Production Workflows

The audio production workflow has been the subject of various automation efforts aimed at enhancing efficiency and consistency. Recent research by Jillings et al. [6] analyzed time allocation in professional studios, finding that a significant time of the project is dedicated to non-creative technical tasks. This finding underscores the potential value of automation solutions in the production chain.

Commercial DAW platforms have integrated limited automation features, primarily targeting repetitive mixing operations such as automatic gain riding [13] and dynamic equalization [14]. Automatic gain riders dynamically adjust vocal levels to maintain consistent prominence, aligning with industry practices in which vocals are kept approximately 3–8 loudness units (LU) above the backing track [14]. Similarly, dynamic equalizers continuously adapt frequency content to minimize inter-track masking and enhance clarity in dense mixes. However, these tools generally address isolated aspects of the mixing process, such as vocal leveling or resonance suppression, rather than offering holistic workflow optimization for the session preparation phase. Critical preparatory tasks, including fine-grained instrument classification, multitrack labeling, and initial gain staging, remain largely manual and are not comprehensively addressed by current commercial automation solutions.

De Man et al. [15] introduced the Open Multitrack Testbed, a large open-access repository of multitrack recordings annotated with detailed metadata to support research in areas such as automatic mixing, source separation, and audio feature analysis. By providing access to diverse multitrack data under Creative Commons licenses, their work addressed a critical barrier to data-driven research in music production. In parallel, Moffat and Sandler [16] proposed conceptual frameworks for Intelligent Music Production (IMP) systems, outlining levels of automation ranging from insightful to fully automatic control. Their work emphasized approaches to knowledge representation, including rule-based and ML techniques, for automating tasks such as gain adjustment, equalization, and masking reduction. These studies highlight the enabling role of shared multitrack datasets and the potential of knowledge-based and data-driven methods in audio production. However, their focus remains primarily on mixing and creative decision support, rather than on automating the preparatory stages essential for efficient DAW session organization.

2.2. Machine Learning in Audio Content Analysis

Audio content analysis using Deep Learning has advanced significantly in recent years. Convolutional Neural Networks (CNNs) have demonstrated particular efficacy in audio classification tasks [17], achieving high accuracy in instrument recognition [18] and content categorization [19].

Transfer learning approaches, where pre-trained models on large datasets are fine-tuned for specific audio classification tasks, have shown promising results even with limited training data [20]. This approach is particularly relevant for specialized audio production applications where large annotated datasets may not be available.

Pons et al. [21] investigated lightweight architectures for music audio tagging, comparing assumption-free models based on raw waveforms with domain-informed models utilizing spectrogram inputs. Their results showed that while domain knowledge benefits models trained on limited datasets, minimalistic sample-level CNNs can outperform spectrogram-based architectures when trained on large-scale data. Similarly, Ghosal and Kolekar [22] developed efficient CNN-LSTM (Convolutional Neural Network combined with Long Short-Term Memory) models for genre classification, combining traditional spectral and rhythmic features with a transfer learning approach to achieve near state-of-the-art accuracy on the GTZAN dataset [23]. These studies highlight that lightweight, scalable models, either assumption-free or domain-assisted, remain central to advancing audio classification, particularly in resource-constrained environments.

More recently, Guinot et al. [24] introduced the Leave-One-EquiVariant (LOEV) framework to address information loss in contrastive self-supervised learning for musical representations. By selectively preserving task-relevant equivariances to augmentations such as pitch shifting and time stretching, LOEV achieved improvements of up to 30% in key and tempo estimation tasks compared to conventional contrastive methods, while maintaining a high Area Under the ROC Curve (AUROC) of 90.6%. Their method also demonstrated effective disentanglement of latent spaces, enabling targeted retrieval of music attributes. Complementarily, Hasumi et al. [25] proposed a classifier group chain model for music tagging, modeling conditional dependencies between categories such as genre, instrument, and mood. Using the MTG-Jamendo dataset [26], their approach achieved an overall AUROC of 82.2% on the Top50 tag subset, outperforming baseline methods by approximately 0.4–0.6 percentage points and confirming the benefits of context-aware sequential decoding in multi-label music tagging tasks.

Despite these advances, a significant gap exists in current audio classification research with respect to the fine-grained instrument categorization required for practical DAW session preparation. While existing models effectively distinguish broad primary categories like “drums” or “strings,” they typically lack the specificity to differentiate between closely related instruments within the same family, such as distinguishing a kick drum from a snare drum, or recognizing toms and hi-hat recordings. This limitation is particularly problematic in multitrack recording contexts, where accurate track identification at the individual instrument level is essential for proper session organization and initial gain staging. Furthermore, most available models are trained on isolated, high-quality recordings that do not reflect the varied signal characteristics encountered in raw multitrack recording sessions, which often include crosstalk, varied microphone placements, and room acoustics that significantly alter spectral fingerprints. These practical considerations highlight the need for specialized classifiers designed specifically for the multitrack production environment rather than general audio content analysis.

2.3. Gain Staging and Level Optimization

Gain staging remains a critical aspect of audio production that directly impacts signal quality and mixing headroom [27]. Traditional approaches rely on engineer experience and heuristics [5], while more recent work has explored data-driven methods for optimizing initial levels [28].

Genetic Algorithms (GAs) have been applied to various audio optimization problems, including equalizer settings [29] and compressor parameters [30]. Our previous work [12] utilized GA as well as NN to derive optimal gain values for 66 different instrument categories based on subjective quality assessments by professional engineers. De Man and Reiss [31] investigated the relationship between perceived loudness and technical measurements, providing insights into objective metrics that can guide automated gain staging decisions. Building on this work, Ma et al. [28] developed an intelligent gain control system that balances technical measurements with perceptual factors.

While these studies provide valuable foundations for individual components of our proposed system, there remains a gap in the literature regarding integrated solutions that address the complete session preparation workflow. Our work aims to bridge this gap by combining multiple automation techniques into a cohesive system specifically designed for optimizing preparation time in professional audio production environments.

3. Problem Formulation

While the creative aspects of mixing such as balancing, EQ (equalization), compression, and effects processing are often emphasized in both pedagogy and industry discourse, less attention has been paid to the substantial portion of time engineers spend preparing a session before these tasks can begin. This preparatory phase includes identifying and labeling tracks, grouping related instruments, removing redundant audio files, format standardization, and applying initial gain staging. Each of these steps, while essential for an efficient and professional mix, is highly repetitive and prone to human error when performed manually.

Survey data from recent academic studies illustrates the scale of this challenge. For example, Tot et al. [32] surveyed 29 mix engineers and found that a typical full mix session involving approximately 24 tracks takes an average of 7 h to complete. Of this time, nearly 40–50% is often spent on session preparation tasks, amounting to 2.5 to 3.5 h per mix, before any creative mixing decisions are made. Similarly, Jillings [6] analyzed session logs and confirmed that track grouping, labeling, and routing collectively consumed a major portion of early-stage time investments.

A separate analysis by Stickland et al. [33] observed that in collaborative mixing environments, engineers consistently dedicated the initial 30–40% of total session time to preparatory tasks before engaging in tonal or spatial processing. These tasks included organizing the session structure, adjusting initial levels, and confirming file compatibility (e.g., sample rate or stereo/mono formats).

Given that typical multitrack projects now exceed 40–60 individual audio stems [34], the manual workload increases proportionally. Engineers often repeat the same routine for each session: cleaning mislabeled files, adjusting levels to prevent clipping, and aligning stereo pairs, activities that do not directly contribute to the creative sonic identity of the track.

Identified Inefficiencies in Current Workflows

Analysis of current DAW workflows reveals several key inefficiencies:

1.: Manual Track Identification and Labeling: Engineers spend considerable time manually identifying instruments from file naming and waveform appearances and applying appropriate labels, a task that scales linearly with track count.
2.: Inefficient Track Organization: Sorting tracks by instrument families (drums, bass, guitars, etc.) is typically performed manually despite following consistent patterns across projects.
3.: Redundant Channel Configuration: Significant time is spent identifying and eliminating fake stereo channels (identical mono signals panned hard left/right) and properly configuring genuine stereo content.
4.: Initial Gain Staging: Setting appropriate initial gain levels for different instrument types relies heavily on audio engineering experience and often requires multiple readjustments during the mixing process.

These inefficiencies represent compelling automation opportunities, particularly as the average track count in modern music production continues to increase. Our proposed system aims to address these specific challenges through a combination of rule-based processing and ML assistance.

4. Methodology

4.1. System Overview

Our system streamlines the production workflow through four integrated modules: track content classification, format standardization, track organization, and intelligent gain staging. Each module is designed to operate independently while contributing to a coherent, optimized workflow. The classification module identifies instruments using both spectral features and file metadata. Standardization ensures consistent technical settings across tracks. The organization module detects inter-track relationships, like stereo pairs or grouped microphones, and arranges them logically. Lastly, the gain staging module applies appropriate starting levels based on instrument type, ensuring clean input signals and headroom for processing. This modular architecture enables partial or complete implementation based on specific workflow requirements.

While the time spent on any individual preparation task may seem negligible, the cumulative burden across multiple projects becomes substantial. Even a modest reduction, such as a 10% improvement in efficiency, can translate into dozens or hundreds of hours saved over the course of a year. By automating the most redundant elements of session setup, our system allows engineers to focus their attention on creative decision-making, improving both productivity and artistic outcomes.

4.2. Track Identification and Labeling

At the core of our automation system is a classification module that identifies instrument types and characteristics. We implemented two complementary approaches: a Convolutional Neural Network based on ResNet [35] architecture for spectral audio analysis, and a multimodal approach that incorporates both audio and textual information.

An extensive overview of the class structure of the dataset is presented in Section 4.2.2 and Table 1. This includes further justification for both audio track classification and multimodal track identification, along with our approach to managing complex or ambiguous cases such as drum components and sound effects.

4.2.1. Audio Spectral Classification

For audio classification, we used three models, based on the ResNet-18 architecture [35], which has demonstrated exceptional performance in various audio classification tasks [19]. The advantage of a ResNet architecture is that it employs residual connections for mitigating the vanishing gradient problem, allowing effective training of deeper networks.

Our implementation modifies the standard ResNet-18 structure by

Converting the input layer to accept single-channel magnitude spectrogram from 3 s of audio, using a number of 512 for $N_{F F T}$ and ${Hop}_{length}$ ;
Adjusting the final fully connected layer to output probabilities for the distinct instrument classes;
Employing class weighting to address the imbalanced distribution of instrument typically found in multitrack recordings.

Our system, presented in Figure 1, employs a hierarchical classification approach using specialized ResNet models at different levels. This architecture provides both broad categorization and fine-grained instrument identification. More specifically, we divide this into Primary and Secondary classification problems. Inspired from the Music Source Separation [36] task, we found it helpful to first classify the audio in four main classes: Vocals, Bass, Drums, and Other (anything else). The second step consists in subclass identification. For Bass and Vocals, the Primary classification already provides sufficient specificity, as these categories typically represent distinct instruments that do not require further differentiation in the mixing context. Furthermore, separating Vocals in Backing or Leading Vocals, or Bass into multiple subclasses, directly from audio, can be quite difficult. Therefore, we focused our Secondary classification efforts only on the more diverse Drums and Other categories by applying specialized models trained for each subcategory. We divided drum sounds into five distinct classes: Kick, Snare, Tom, Overhead, and Percussion. Similarly, the Other category was further classified into more specific instrument families: Guitar (electric and acoustic), Keyboard (piano, synthesizers), and Orchestral (strings, brass, woodwinds).

The hierarchical design delivers multiple benefits to our automation system. By focusing each specialized model on distinguishing between similar instruments within a category, we achieve more nuanced feature learning and improved accuracy. The approach also reduces model complexity by dividing the classification problem into manageable subproblems. Furthermore, our pipeline processes tracks efficiently by applying Secondary models only when needed, and remains extensible, allowing new instrument subcategories to be added to each branch without restructuring the entire system.

4.2.2. Multimodal Classification

While spectral features provide substantial information for audio content classification, incorporating additional modalities can significantly enhance classification accuracy, particularly for similar-sounding instruments within the same family [37]. Building on recent advances in multimodal learning [38], we developed an approach that leverages both audio content and filename metadata to improve instrument identification in DAW sessions.

Our multimodal classifier integrates two parallel processing streams: a CNN-based audio analysis path and a text embedding path that employs transformer-based language models. This design was informed by research showing that professional engineers frequently encode instrument information in track naming conventions, providing valuable complementary information to the audio signal itself.

The architecture of our multimodal classifier, illustrated in Figure 2, consists of the following:

A Convolutional Neural Network for processing audio spectrograms with six convolutional layers, inspired by architectures that have shown success in music information retrieval tasks.
A text embedding component utilizing the Sentence-BERT MiniLM-L6-v2 model [39] to generate dense vector representations (384 dimensions) of track filename text after preprocessing.
A fusion module that concatenates features from both modalities and processes them through fully connected layers, following established practices in multimodal fusion [40].

For text preprocessing, we apply several transformations to standardize filename text, using common techniques from Natural Language Processing (NLP). We first converted the filename to lowercase, eliminated common non-descriptive elements such as file extensions, replaced underscores and hyphens with spaces, then we removed all non-alphabetic characters except spaces. During training, we employ data augmentation techniques for text before extracting embeddings, which has been shown to improve generalization in classification tasks [41]. We generate synthetic text with the same length of the filename filtered text, and we add it either before or after the track name.

Figure 2. Multimodal system for intrument classification. We used a 6 s audio (3 measures for a standard tempo of 120 Beats Per Minute, e.g., BPM), converted into a spectrogram as audio input, and BERT embeddings for text input. All Conv2D layers have a kernel of

3 \times 3

, followed by a MaxPooling with

2 \times 2

. For the fully connected layers, we used a Dropout of 0.2.

Figure 2. Multimodal system for intrument classification. We used a 6 s audio (3 measures for a standard tempo of 120 Beats Per Minute, e.g., BPM), converted into a spectrogram as audio input, and BERT embeddings for text input. All Conv2D layers have a kernel of

3 \times 3

, followed by a MaxPooling with

2 \times 2

. For the fully connected layers, we used a Dropout of 0.2.

Following practices established in imbalanced classification problems, we implement a weighted cross-entropy loss function:

L = - \sum_{i = 1}^{C} w_{i} y_{i} log ({\hat{y}}_{i}),

(1)

where

y_{i}

is the class ground truth,

{\hat{y}}_{i}

is the model prediction, and

w_{i}

is the weight for class i, calculated as

w_{i} = \frac{N}{C \times n_{i}},

(2)

with N being the total number of samples, C the number of classes, and

n_{i}

the number of samples in class i.

To address class imbalance during training, we utilize a weighted random sampler that oversamples minority classes according to their corresponding weights, a technique that has shown success in various classification domains [42].

The integration of textual information with spectral features creates a complex classification system that outperforms audio-only approaches by leveraging complementary information sources, consistent with findings in multimodal learning literature [43]. This approach is particularly effective for distinguishing between similar-sounding instruments where spectral features alone may be insufficient for reliable classification.

Table 1 provides a comprehensive mapping between audio and multimodal classification schemes used for track identification and labeling. The hierarchical structure distinguishes between Primary (e.g., Drums, Bass, Others, Vocals) and Secondary classes (Drums and Others subcategories) in the spectral domain, and aligns them with their corresponding multimodal labels used in downstream processing or supervised tasks. This structure supports more robust classification by integrating low-level audio cues with higher-level multimodal semantics.

Modern drum recordings often include tracks such as hi-hats and ride, which play a crucial role in defining the rhythmic and dynamic character of a mix. However, these components exhibit strong acoustic and spectral features similarity to overhead tracks (OH). This similarity presents a significant challenge for spectral classifiers, such as those based on ResNet architectures, in reliably distinguishing them as separate classes. We added these classes in the multimodal classification, as distinguishing hi-hat and ride tracks from OH recordings requires additional information beyond spectral features. Leveraging this extra information enables more accurate and reliable classification without introducing ambiguity or overfitting in the spectral domain.

Special effects tracks (Sfx) such as impacts, risers, and various noise-based sounds, are highly diverse and difficult to classify based on using spectral features alone. These sounds are not explicitly labeled in the audio spectral classification but are implicitly grouped under the broader Others category. In the multimodal classification, however, they are identified as Sfx. This separation allows us to maintain spectral model generality while enabling more accurate identification of diverse Sfx content through multimodal refinement.

4.3. Track Organization and Pattern Recognition

The module streamlines the otherwise manual process of arranging large-scale audio projects by automatically identifying relationships between audio tracks. This is achieved using pattern recognition algorithms based on both temporal and spectral similarity. The goal is to reduce preparation time and improve mix session organization.

To detect alignment between audio tracks, we use a normalized cross-correlation function that measures the similarity between two signals across different temporal delays. Given two audio signals

x_{i} (n)

and

x_{j} (n)

, the normalized cross-correlation

R_{i j} (d)

at a given delay d (measured in samples) is defined as

R_{i j} (d) = \frac{\sum_{n = 0}^{N - 1} (x_{i} (n) - μ_{i}) (x_{j} (n + d) - μ_{j})}{σ_{i} σ_{j}},

(3)

where

μ_{i}

and

μ_{j}

are the mean values of the signals over the analysis window of length N, and

σ_{i}

,

σ_{j}

are their corresponding standard deviations. This method identifies shifted versions of similar content and is robust to typical offsets of up to

\pm 20 ms

, common in multi-microphone recordings.

Rather than processing entire tracks, we extract high-energy, temporally aligned segments

{\tilde{x}}_{i}

and

{\tilde{x}}_{j}

from each audio file, typically of 5 s duration. This segment length was chosen empirically to ensure that, at a common tempo of 120 BPM, each segment captures at least two full musical measures, providing enough temporal context for meaningful classification. Features are then computed over these segments. A combined similarity score

S_{i j}

is calculated as

S_{i j} = w_{1} \cdot ρ (RMS ({\tilde{x}}_{i}), RMS ({\tilde{x}}_{j})) + w_{2} \cdot ρ (MFCC ({\tilde{x}}_{i}), MFCC ({\tilde{x}}_{j})) + w_{3} \cdot ρ ({\tilde{x}}_{i}, {\tilde{x}}_{j}),

(4)

where

ρ

denotes the Pearson correlation coefficient,

RMS (\cdot)

is the Root Mean Square energy envelope, and

MFCC (\cdot)

refers to the Mel-Frequency Cepstral Coefficients.

The weights

w_{1} = 0.4

,

w_{2} = 0.4

, and

w_{3} = 0.2

were empirically selected after testing various configurations on a validation set across multiple sessions involving diverse instrument types. These values yielded a robust balance between temporal envelope similarity (captured via RMS), timbral similarity (via MFCC), and raw waveform correlation. Increasing the contribution of waveform correlation

w_{3}

often resulted in false positives due to coincidental phase alignment, particularly in percussive or ambient tracks. Conversely, reducing the influence of RMS or MFCC diminished the ability of the system to distinguish energy-driven or timbrally distinct sources. The chosen weights thus reflect a general-purpose trade-off that performs well across typical mix scenarios. However, these parameters remain tunable and can be adjusted based on content type or engineer preference in specialized workflows.

Using the computed similarity scores and delay estimations, the system identifies track relationships: stereo pairs are recognized via high correlation and minimal optimal delay, with the optimal sample delay defined as

d_{opt} = \underset{d}{arg max} R_{i j} (d),

(5)

where

d_{opt}

is the delay value yielding the highest correlation. Tracks with

S_{i j} > 0.85

are grouped as likely multi-microphone recordings of the same source.

With this method, duplicated mono signals, identified by near-perfect similarity and zero delay, are consolidated. Conversely, double-tracked performances are retained as distinct but logically grouped based on high similarity and subtle variation.

To ensure accurate matching even in asynchronous or noisy recordings, we use a two-stage delay estimation strategy: a coarse delay estimate is first computed via correlation of smoothed RMS envelopes, followed by fine alignment using waveform-level cross-correlation. This layered approach improves resilience against silence, background noise, or fluctuations in energy profile, while preserving timing precision for downstream processing.

4.4. Gain Staging Module

The gain staging module represents a critical component of our automation system, addressing one of the most time-consuming and technically demanding aspects of session preparation. Proper gain staging establishes the foundation for subsequent mixing decisions and directly impacts the overall signal quality, dynamic range preservation, and processing headroom throughout the production chain [27]. As demonstrated in our time efficiency analysis, professional engineers spend approximately 17 min on initial gain staging for large sessions, while less experienced engineers may require up to 24 min for the same task (as presented in Section 6.3). This inefficiency presents a significant opportunity for automation.

Our approach leverages target loudness values derived from our previous research using GA to optimize perceived quality based on professional engineer evaluations [12]. Although our original research demonstrated comparable performance between GA and NN approaches for loudness control, we specifically selected the GA-based solution for this automation system due to its superior computational efficiency in deployment scenarios. While Neural Networks require significant processing overhead for forward passes through multiple layers and must predict the coefficients for each song, the GA solution provides pre-optimized and pre-determined coefficient vectors that can be directly applied with minimal computational cost. This implementation decision prioritizes rapid processing, reducing gain staging time from minutes to seconds, which aligns with our primary objective of optimizing session preparation workflows.

The implementation draws specifically on the optimization coefficients from one of our highest-performing Romanian sound engineers, whose preference model demonstrated superior performance in both objective and subjective tests from our original research [12]. This engineer’s refined approach to gain balancing, which prioritizes both clarity and appropriate spectral distribution, forms the foundation of our automated gain staging system.

For each track category c identified by our classification system, a target loudness level

L_{c}

is established based on a pre-trained GA that can mimic the preferences of a sound engineer. In order to apply the coefficient value generated by the model to the current track, we must first scale the audio stem to a standard reference loudness

L_{r}

of −45 LUFS, using the ITU-R BS.1770-4 standard [44], which provides perceptually relevant loudness measurements that correspond closely to human hearing characteristics. The required gain adjustment G is then applied as LUFS by simply adding it to the scaled referenced audio, where G is the corresponding pre-computed track coefficient from the pre-trained GA. Therefore, the gain staging procedure is applied as

L_{c} = G + L_{a} - L_{r} = G + L_{a} - (- 45),

(6)

where

L_{c}

represents the target loudness for the instrument category and

L_{a}

represents the initial measured loudness of the track.

Our implementation also incorporates psychoacoustic principles identified in previous research on loudness perception and masking [45]. By considering both the technical measurements and perceptual factors that influence perceived loudness relationships, the system produces results that more closely align with human listening preferences compared to simplistic peak or RMS-based normalization approaches that ignore instrument-specific characteristics and context.

The gain staging module builds directly upon our findings in [12], where we demonstrated that a custom trained GA could effectively model the mixing preferences of individual engineers. While our NN approach offered enhanced adaptability to different musical contexts, the consistent performance of the GA solution, combined with its significantly lower computational requirements, made it the optimal choice for a system where processing speed directly impacts workflow efficiency. By applying these optimized coefficients in our automated system, we bridge the gap between research findings and practical workflow applications, offering significant time savings without sacrificing the artistic quality that characterizes professional mixing.

5. Experimental Setup

5.1. Train Dataset for Audio Spectral Classification

For training our hierarchical classification system, we utilized multiple complementary datasets. The Primary classification model was trained on MUSDB18HQ [46], which provides high-quality multitrack recordings with isolated stems for drums, bass, vocals, and other instruments, precisely matching our Primary classification categories. We trained the network by randomly sampling 3-s segments from each track, converting them into spectrograms using Short-Time Fourier Transform (STFT) with parameters

N_{F F T} = 512

and

{Hop}_{length} = 512

. The 3-s duration was chosen to ensure a reliable estimation of perceptual loudness, which requires a minimum temporal context to be meaningful, as well as including at least one musical measure for a tempo of 120 BPM.

Training was conducted using a batch size of 256 with CrossEntropy loss and Adam optimizer, starting from a learning rate of

1 \times 10^{- 4}

with an Exponential Scheduler (

γ = 0.90

). We considered 100 steps as one epoch. For the Secondary classification models (Drums and Other), we used the same hyperparameter settings while reducing the batch size to 4 to accommodate the more detailed classification requirements.

For the “Other” instrument subclassification, we curated a dataset of 800 high-quality isolated tracks from the Cambridge Multitrack Dataset [15], carefully selecting recordings without significant bleed from other instruments. The selected tracks encompass guitar (both electric and acoustic), keyboard instruments (piano, synthesizer), and orchestral instruments (strings, brass, woodwinds). For the drum subclassification model, we constructed a specialized dataset containing 200 isolated drum tracks including kick, snare, toms and overheads, collected from professional virtual instruments.

For each subclass dataset, we focused on quality over quantity by selecting the most representative 3-s segment with the highest RMS value from each track. This approach ensures that the input data for the models contains the clearest sonic signature of the desired instrument, minimizing the impact of silence, background noise, or ambiguous segments that might confuse the classifier.

The multimodal classification model was trained on a 6-s representation of audio tracks from the Cambridge Multitrack Dataset [15], along with the corresponding track names as text input. For this model, we also used the highest RMS value window to maximize the audio clean input. The text was processed through a BERT model to extract meaningful embeddings. Our final classification schema encompasses 27 distinct instrument classes (illustrated in Table 1 last column), providing fine-grained categorization across the full spectrum of common multitrack recording instruments as detailed in our Supplementary Materials.

With the exception of MUSDB18HQ, which already includes defined validation subsets, we split our custom datasets using an 80:20 ratio for training and validation, respectively. This split was stratified to maintain class distribution consistency between the training and validation sets.

5.2. Evaluation Dataset for the Entire System

To evaluate the effectiveness of our proposed automation system under real-world conditions, we assembled a diverse test dataset comprising five complete multitrack sessions spanning multiple musical genres. The dataset includes recordings from both local artists and licensed materials from professional mixing courses, with track counts ranging from 10 to 60 per session. The complete track listing with ground truth instrument classifications is available in our Supplementary Materials.

This evaluation dataset was specifically designed to represent the diversity encountered in professional environments, featuring considerable variation in track naming conventions, channel configurations, and organizational structures, precisely the variables our automation system aims to standardize. The projects are categorized as follows:

Small project (10 tracks): “Song 1”—A pop production with minimal instrumentation;
Small project (22 tracks): “Song 2”—A folk-rock arrangement with acoustic and electric elements;
Medium project (34 tracks): “Song 3”—An indie rock production with multiple guitar layers;
Medium project (44 tracks): “Song 4”—A complex rock production with numerous orchestral tracks;
Large project (60 tracks): “Song 5”—A full band recording with extensive drum microphones and layered instruments.

The “Song 5” project presents a particularly valuable test case, as it was recorded across three separate studio facilities, each employing distinct naming conventions and organizational practices. This scenario mirrors common challenges in professional environments where collaborative projects involve multiple recording venues or engineers with individualized workflow preferences. For instance, drum tracks from Studio A follow a “[Instrument][Microphone Position/Type]_[Take/Version Number].wav” convention, while Studio B uses labels such as “[Track Name][Instrument][Microphone]_[Take/Version Number].wav”, creating precisely the type of labeling inconsistency our classification system aims to resolve.

The dataset also incorporates common real-world complications such as typographical errors (e.g., “KicCond_02.wav” instead of “KickCond_02.wav”) and contextually ambiguous naming where critical instrument information is entirely omitted. A notable example is “Septembrie-MAIN MANLEY_01_03.wav”, which contains no textual indication that it represents the main vocal track, despite being recorded through a Manley microphone, forcing our system to rely exclusively on spectral characteristics for proper classification.

Our test dataset intentionally incorporates technical variations that typically necessitate manual intervention during session preparation. Approximately 13% of stereo content appears as “fake stereo” (identical mono signals duplicated across left/right channels), and 13% consists of split stereo tracks requiring identification and proper pairing. These technical characteristics align with the challenges identified in our problem formulation, ensuring the evaluation provides meaningful insights into real-world performance.

Each session was manually annotated with ground truth information including:

Correct instrument labels for all 170 tracks across the dataset
Identification of related tracks (stereo pairs, multi-microphone setups)
Classification of redundant tracks and fake stereo channels

6. Results

6.1. Classification Performance

Our inference pipeline (Figure 3) selects the top three RMS windows from each audio track and processes both audio and track name in parallel. The Primary classifier categorizes audio into four main classes with probability threshold

p_{1_m a x} > 0.90

. For Drums or Other categories, a Secondary classifier provides detailed identification with probability threshold

p_{2_m a x} > 0.80

. The multimodal model combines audio spectrograms with track name text, using probability threshold

p_{3_m a x} > 0.95

. The final class is determined by comparing confidence scores from valid predictions, with preference given to the highest confidence result. The values

p_{1_m a x}

,

p_{2_m a x}

, and

p_{3_m a x}

represent the maximum probability values output by the Primary, Secondary, and multimodal classifiers, respectively.

As an example, consider the track “Snare Down.cm_1_03.wav”, which contains the microphone signal recorded from the underside of the snare drum. The system begins by extracting the top three RMS windows from the audio, capturing the segments with the highest energy. Each window is converted into a magnitude spectrogram using STFT. The resulting spectrograms are passed through the Primary classifier, which predicts the class Drums with a maximum softmax probability of

p_{1_m a x} > 0.94

, exceeding the required threshold of

0.90

. Because Drums is a trigger category, the Secondary classifier is also applied. It processes the same spectrogram inputs and produces a more specific classification result: Snare, with

p_{2_m a x} = 0.90

, also passing its respective threshold of

0.80

. In parallel, the multimodal model analyzes both the spectrogram and the track title and outputs

p_{3_m a x} = 0.96

for the class Snare Down, which surpasses its threshold of

0.95

. In this case, among the valid predictions,

p_{3_m a x} = 0.96

from the multimodal model is the highest confidence score. Therefore, the final class assigned to the track is Snare Down.

The performance of our audio spectral classification system was evaluated across the five test projects of varying complexity. As shown in Table 2, our models demonstrated promising results during training and validation. The Primary classification model achieved 83.89% accuracy on the MUSDB18HQ validation set, effectively distinguishing between the four main instrument categories. The specialized Drums classifier demonstrated excellent performance with 95.67% accuracy, while the Other instruments classifier reached 89.10% accuracy. The multimodal classifier, despite having to distinguish between 27 different instrument classes, still achieved a respectable 82.19% accuracy on the created Cambridge validation dataset.

When applied to real-world multitrack projects, performance varied with project complexity (Table 3). Song 1 (10 tracks) and Song 2 (22 tracks) achieved 90.00% and 100.00% final accuracy, respectively. For larger projects, Song 3 (34 tracks) reached 70.58%, while Song 4 (44 tracks) and Song 5 (60 tracks) showed 47.72% and 63.33% accuracy, respectively. However, the error in accuracy is given in many cases by very closely related classes, such as “LeadVox” and “Backing”, or the Main class is identified correctly, but the multimodal or Secondary module is unsure, so in the end, the metric for that instrument is affected, even though our system can partially place it into a correct group of instruments.

The Primary classification model has the overall highest performance of all the models (78.33–100.00%), results consistent with our initial presumption that it is easier to identify the tracks by their main groups. The multimodal approach showed considerable variation (45.45–100.00%), performing well on projects with consistent naming conventions but struggling with ambiguous or misleading track names. The Secondary classification models showed lower performance (47.05–61.66%) compared to the Primary model, highlighting the difficulty in distinguishing between closely related instruments.

6.2. Pattern Recognition Results

The pattern recognition system’s performance, as shown in Table 4, demonstrated consistent excellence in identifying fake stereo tracks, achieving 100.00% accuracy across all projects regardless of size or complexity. This perfect detection of fake stereo content (identical mono signals duplicated across left/right channels) indicates the robustness of our cross-correlation approach in identifying signal duplication. For split stereo track identification and pairing, performance ranged from 62.00% to 100.00%, with a clear trend showing higher accuracy in smaller projects. Song 1 (10 tracks) achieved perfect stereo pair identification, while the largest project, Song 5 (60 tracks), showed the lowest accuracy at 62.00%. This decline in performance correlates with increasing project complexity and the presence of more varied instrument timbres within the same project.

The observed decline in stereo pairing accuracy can be attributed to several factors. First, larger projects typically feature more overlapping frequency content across multiple instruments, increasing the challenge of matching related tracks based on spectral similarity. Second, in complex arrangements like Song 4 and Song 5, stereo pairs often feature intentional processing differences between left and right channels (such as different EQ or compression settings), which reduces correlation values below our matching thresholds. Third, the presence of ambient microphones in larger sessions creates significant cross-talk between tracks, making clean track-to-track matching more difficult. Complex arrangements such as quadruple-tracked guitars or drum room recordings present inherent challenges for automated stereo processing. Quadruple-tracked guitars typically exhibit accurate musical performance with tonal differences resulting from varied microphone positions or amplifier settings. Similarly, room microphones capture similar source material with variations in delay (due to distance) and spectral characteristics from room acoustics. When correlation coefficients exceed our threshold, these tracks may be incorrectly grouped for stereo pairing. The system cannot distinguish between signals that should remain separate, such as two slightly different microphones on a kick drum, potentially resulting in false positive merging. These limitations highlight the challenge of automatic balancing efficiency with the nuanced requirements of complex multi-microphone recording scenarios. When the system incorrectly merges two unrelated mono tracks into a stereo pair, engineers must manually separate them, which adds to the correction workload.

The perfect fake stereo detection across all projects has significant practical implications, as identifying and collapsing fake stereo tracks into mono is a time-consuming task when performed manually. This capability alone eliminates a substantial portion of the organization workload, particularly in projects with limited recording resources where producers often create fake stereo tracks as placeholders during the arrangement process.

6.3. Time Efficiency

The time efficiency results (Table 5) show that professional engineers spent between 9 min (Song 1) and 37 min (Song 5) on manual session preparation, while amateur engineers required 15–50 min. The automated system times (Auto) in Table 5 represent the sum of two components: the system’s automated processing time (approximately 10 s per track) and the manual error correction time (calculated as the professional engineer’s time multiplied by the error rate of our classification systems). This approach provides a realistic assessment of the total time required when implementing our automation system in real-world production environments.

Overall, our automation system reduced total preparation time by approximately 70% compared to the average professional and amateur engineers averaged across all test projects. The largest efficiency gains were observed in smaller projects (Song 1 and Song 2), where the classification accuracy was highest and minimal error correction was needed.

To validate our theoretical time calculations, we conducted actual user testing with professional engineers using our automation system. Video recordings of the complete session preparation process for both manual and automated workflows are available in our Supplementary Materials. For the 60-track project (Song 5), the recorded automated workflow completion time was approximately 7 min (compared to an initial time of aprox. 37 min) for manual work, which falls below our theoretical error-adjusted calculation of 16:49 min shown in Table 5. This discrepancy suggests our theoretical model may be conservative, as it assumes engineers spend equal time correcting each classification error, while in practice, experienced engineers can often identify and correct multiple related errors simultaneously. The video documentation available in our Supplementary Materials confirms the substantial practical time savings achieved by our automation system in real-world usage scenarios.

These results demonstrate that even with imperfect accuracy requiring some manual intervention, the automation system provides substantial workflow efficiency improvements across various project sizes and complexity levels. The time saved could be redirected toward creative aspects of the mixing process, potentially enhancing both productivity and artistic outcomes in professional audio production environments.

7. Discussion

7.1. System Performance and Limitations

Our results demonstrate that Machine Learning-assisted automation can significantly reduce session preparation time while maintaining acceptable accuracy. However, several limitations emerged:

Classification Accuracy for Complex Projects: Accuracy declined for larger projects, particularly Song 4 (47.72% accuracy), suggesting that nearly half of its tracks required manual correction. Future work should include training with more diverse datasets representing real-world production environments.
Handling of Ambiguous Track Names: The multimodal classifier struggled with inconsistent or ambiguous naming conventions, especially with non-English or technical abbreviations. Multilingual text processing could improve text-based classification.
Orchestral Track Classification: Song 4 presented unique challenges with its orchestral content using abbreviated labels (“v1”, “v2” for violin sections) and significant microphone bleed from the live recording environment. Our models, trained primarily on isolated recordings, struggled with this cross-instrument contamination. Training with orchestral recordings containing controlled amounts of bleed could address this limitation.
Stereo Pair Detection: The system showed decreased accuracy in identifying stereo pairs within dense arrangements. Additional spectral and phase relationship features could enhance performance in these scenarios.
Error Propagation: Classification errors affected subsequent processing steps, particularly gain staging. Incorporating confidence scores and adaptive processing could minimize these impacts.

Despite these limitations, the substantial time savings (70% reduction) suggest that even with imperfect accuracy, the automation provides significant value as an assistive tool that accelerates initial setup while allowing engineers to make necessary manual corrections.

7.2. Workflow Integration and User Experience

Currently, our implementation processes WAV files directly rather than integrating with commercial DAW platforms. This limitation stems from restricted access to DAW development kits and proprietary session formats. Future iterations could integrate directly within popular DAW environments as plugins or extensions, eliminating the current two-step process.

Even without direct DAW integration, our evaluation revealed key insights about user experience:

Engineers preferred a two-stage approach with automated initial processing followed by manual correction opportunities.
Visual confidence indicators for classification decisions would enhance user trust and facilitate efficient corrections.
Engineers initially showed resistance to workflow changes but reported increased acceptance after experiencing the time savings.
A system that learns from manual corrections could significantly enhance accuracy and user satisfaction.

Alternative integration strategies include creating standalone applications that generate DAW-specific import templates, middleware solutions interfacing between our engine and DAW session files, or exploring open-source DAW platforms.

7.3. Practical Implications

The demonstrated efficiency improvements have several practical implications:

Enhanced Productivity: The 70% time reduction allows engineers to handle more projects or allocate more time to creative decisions.
Educational Applications: The system could serve as a teaching tool demonstrating professional organization standards and optimal gain staging practices.
Remote Collaboration: In collaborative environments, our system could establish standardized configurations, reducing compatibility issues.
Scalability: The most substantial time savings occurred in large projects (60+ tracks), making our system particularly valuable for complex productions.

Our findings suggest that targeted automation of mechanical tasks can significantly enhance workflow efficiency without compromising creative control, addressing a critical challenge point in contemporary audio production environments.

7.4. Future Work

Several directions for future work emerge from our findings:

Expanding training datasets to include more diverse recording conditions, particularly live multitrack recordings with controlled microphone bleed;
Developing direct integration with commercial DAW platforms through plugins or extensions;
Implementing adaptive learning capabilities to incorporate engineer feedback and corrections;
Extending the classification system to handle more specialized instrument categories and ensemble recordings;
Adding perceptually informed metrics for evaluating gain staging effectiveness.

By addressing these opportunities, future versions could achieve higher accuracy for complex projects while maintaining the substantial efficiency benefits already demonstrated. Additional research into transfer learning approaches could further improve classification performance on limited datasets, while enhanced multimodal analysis techniques might better handle ambiguous naming conventions encountered in collaborative production environments.

8. Conclusions

This paper presented a Machine Learning-assisted automation system for optimizing session preparation time in Digital Audio Workstations. By combining audio content classification, pattern recognition, and intelligent gain staging, the system addresses the time-consuming mechanical aspects of session setup while preserving creative control for engineers. The experimental evaluation demonstrated significant time savings with minimal impact on session quality as assessed by professional engineers. These results suggest that targeted automation, informed by both Neural Networks and rule-based heuristics, can effectively optimize workflow efficiency in professional audio production environments. The key contributions of this work include the following:

A comprehensive automation framework addressing multiple aspects of session preparation;
An effective audio classification model optimized for music production;
A rule-based pattern recognition system for identifying track relationships;
A gain staging approach based on optimized target values and content characteristics;
Empirical validation of the system’s performance in professional contexts.

Our results highlight how practical automation, even with imperfect accuracy, can substantially enhance workflow efficiency in real-world music production environments. For example, preparation tasks that typically require 37 min for a 60-track project can be completed in under 17 min. While classification performance remains robust for smaller projects (90–100% accuracy), larger and more complex projects still present challenges, particularly with live orchestral recordings containing significant microphone bleed.

Supplementary Materials

The following supporting information can be downloaded at: https://drive.google.com/drive/folders/1xRrmXBOT_hJomdu_4J9zM6xjaIOz3nYQ?usp=sharing, accessed on 13 April 2025.

Author Contributions

Conceptualization, B.M. and M.N.; methodology, B.M., M.N. and G.N.; software, M.N., G.N. and H.S.I.; validation, M.N., G.N. and C.P.; data curation, B.M. and M.N.; writing—original draft preparation, B.M., M.N. and G.N.; writing—review and editing, C.P.; formal analysis, C.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUROC	Area Under the Receiver Operating Characteristic
CNN	Convolutional Neural Network
DAW	Digital Audio Workstation
EQ	Equalization
FFT	Fast Fourier Transform
GA	Genetic Algorithm
IMP	Intelligent Music Production
LOEV	Leave-One-EquiVariant
LSTM	Long Short-Term Memory
LU	Loudness Units
LUFS	Loudness Units Full Scale
MFCC	Mel-Frequency Cepstral Coefficients
MLP	Multi-Layer Perceptron
NLP	Natural Language Processing
NN	Neural Network
ResNet	Residual Network
RMS	Root Mean Square
STFT	Short-Time Fourier Transform

References

Kirby, P. The Evolution and Decline of the Traditional Recording Studio; The University of Liverpool: Liverpool, UK, 2015. [Google Scholar]
Music Business Worldwide. Over 100,000 Tracks Are Now Being Uploaded to Streaming Services like Spotify Each Day. Available online: https://www.musicbusinessworldwide.com/its-happened-100000-tracks-are-now-being-uploaded/ (accessed on 13 April 2025).
McGarry, G.; Tolmie, P.; Benford, S.; Greenhalgh, C.; Chamberlain, A. “They’re all going out to something weird” Workflow, Legacy and Metadata in the Music Production Process. In Proceedings of the ACM Conference on Computer Supported Cooperative Work and Social Computing, Portland, OR, USA, 25 February–1 March 2017; pp. 995–1008. [Google Scholar]
U.S. Bureau of Labor Statistics. Occupational Employment and Wage Statistics: Sound Engineering Technicians. Available online: https://www.bls.gov/oes/current/oes274014.htm (accessed on 13 April 2025).
Izhaki, R. Mixing Audio: Concepts, Practices, and Tools; Routledge: London, UK, 2017. [Google Scholar]
Jillings, N. Automating the Production of the Balance Mix in Music Production; Birmingham City University: Birmingham, UK, 2023. [Google Scholar]
Pras, A.; Guastavino, C.; Lavoie, M. The impact of technological advances on recording studio practices. J. Am. Soc. Inf. Sci. Technol. 2013, 64, 612–626. [Google Scholar] [CrossRef]
Vanka, S.; Safi, M.; Roll, J.-B.; Fazekas, G. Adoption of AI technology in the music mixing workflow: An investigation. arXiv 2023, arXiv:2304.03407. [Google Scholar]
De Man, B.; Reiss, J.D.; Stables, R. Ten years of automatic mixing. In Proceedings of the 3rd Workshop on Intelligent Music Production, Salford, UK, 15 September 2017. [Google Scholar]
De Man, B.; Reiss, J. A semantic approach to autonomous mixing. J. Art Rec. Prod. (JARP) 2013, 8, 1–23. [Google Scholar]
Birtchnell, T. Listening without ears: Artificial intelligence in audio mastering. Big Data Soc. 2018, 5, 2053951718808553. [Google Scholar] [CrossRef]
Moroșanu, B.; Negru, M.; Paleologu, C. Automated Personalized Loudness Control for Multi-Track Recordings. Algorithms 2024, 17, 228. [Google Scholar] [CrossRef]
Harding, P. Top-Down Mixing—A 12-Step Mixing Program. In Mixing Music; Routledge: London, UK, 2016; pp. 82–96. [Google Scholar]
Pestana, P.D.; Reiss, J.D. Intelligent audio production strategies informed by best practices. In Proceedings of the AES 53rd International Conference, London, UK, 27–29 January 2014. [Google Scholar]
De Man, B.; Mora-Mcginity, M.; Fazekas, G.; Reiss, J.D. The open multitrack testbed. In Proceedings of the 2nd AES Workshop on Intelligent Music Production, London, UK, 13 September 2016. [Google Scholar]
Moffat, D.; Sandler, M.B. Approaches in intelligent music production. Arts 2019, 8, 125. [Google Scholar] [CrossRef]
Humphrey, E.J.; Bello, J.P. From music audio to chord tablature: Teaching deep convolutional networks to play guitar. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar]
Han, Y.; Lee, K. Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. IEEE Signal Process. Lett. 2016, 23, 1649–1653. [Google Scholar]
Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
Choi, K.; Fazekas, G.; Sandler, M.; Cho, K. Transfer learning for music classification and regression tasks. In Proceedings of the 18th ISMIR Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
Pons, J.; Nieto, O.; Prockup, M.; Schmidt, E.M.; Ehmann, A.F.; Serra, X. End-to-end learning for music audio tagging at scale. In Proceedings of the 19th ISMIR Conference, Paris, France, 23–27 September 2018. [Google Scholar]
Ghosal, D.; Kolekar, M.H. Music genre recognition using deep neural networks and transfer learning. Interspeech 2018, 2087–2091. [Google Scholar] [CrossRef]
Tzanetakis, G.; Cook, P. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 2002, 10, 293–302. [Google Scholar] [CrossRef]
Guinot, J.; Quinton, E.; Fazekas, G. Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations. arXiv 2024, arXiv:2412.18955. [Google Scholar]
Hasumi, T.; Komatsu, T.; Fujita, Y. Music Tagging with Classifier Group Chains. arXiv 2025, arXiv:2501.05050. [Google Scholar]
Bogdanov, D.; Won, M.; Tovstogan, P.; Porter, A.; Serra, X. The MTG-Jamendo dataset for automatic music tagging. In Proceedings of the Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
Katz, B. Mastering Audio: The Art and the Science, 3rd ed.; Focal Press: Oxford, UK, 2020; pp. 106–131. [Google Scholar]
Ma, Z.; De Man, B.; Pestana, P.D.; Black, D.A.; Reiss, J.D. Intelligent multitrack dynamic range compression. J. Audio Eng. Soc. 2015, 63, 412–426. [Google Scholar] [CrossRef]
Hafezi, S.; Reiss, J.D. Autonomous multitrack equalization based on masking reduction. J. Audio Eng. Soc. 2019, 67, 96–107. [Google Scholar] [CrossRef]
Moffat, D.; Sandler, M.B. Machine learning multitrack gain mixing of drums. In Proceedings of the 147th Audio Engineering Society Convention, New York, NY, USA, 16–19 October 2019. [Google Scholar]
De Man, B.; Reiss, J.D. A knowledge-engineered autonomous mixing system. In Proceedings of the 135th Audio Engineering Society Convention, New York, NY, USA, 17–20 October 2013. [Google Scholar]
Tot, J. Multitrack Mixing: An Investigation into Music Mixing Practices. Master’s Thesis, University of York, York, UK, 2018. Available online: https://www.york.ac.uk (accessed on 13 April 2025).
Stickland, M.; De Man, B.; Fazekas, G. A New Audio Mixing Paradigm: Creative Interaction in Real-Time Remote Collaboration. Creat. Ind. J. 2022, 15, 17–34. [Google Scholar]
Bell, A. Beyond Skeuomorphism: The Evolution of DAW Interfaces. J. Art Rec. Prod. 2018, 13, 1–12. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Negru, M.; Moroşanu, B.; Neacşu, A.; Drăghicescu, D.; Negrescu, C. Automatic Audio Upmixing Based on Source Separation and Ambient Extraction Algorithms. In Proceedings of the 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, Romania, 25–27 October 2023; pp. 12–17. [Google Scholar]
Slizovskaia, O.; Kim, L.; Haro, G.; Gomez, E. End-to-end sound source separation conditioned on instrument labels. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 306–310. [Google Scholar]
Baltrusaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed]
Oramas, S.; Nieto, O.; Barbieri, F.; Serra, X. Multi-label music genre classification from audio, text, and images using deep features. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 23–30. [Google Scholar]
ITU-R BS.1770-4; Algorithms to Measure Audio Programme Loudness and True-Peak Audio Level. ITU: Geneva, Switzerland, 2011.
Ward, D.; Reiss, J.D.; Athwal, C. Multitrack Mixing Using a Model of Loudness and Partial Loudness. In Proceedings of the AES 133rd Convention, San Francisco, CA, USA, 26–29 October 2012. [Google Scholar]
Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P. mir_eval: A transparent implementation of common MIR metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October 2014. [Google Scholar]

Figure 1. Audio spectral classification system. We used 3 ResNet-18 models for Primary classification, and Drums or Other for Secondary classification. The input is represented as a magnitude STFT with shape [Channel × Frequency × Time] for all 3 models, and only the output numbers of classes were modified for each network (4 classes for Primary, 5 for Secondary Drums, and 3 for Secondary Others).

Figure 3. Inference pipeline for track classification. The system processes both audio content and track names, selecting the top 3 RMS windows from each audio track for analysis. The Primary classifier (top) determines the broad instrument category with a probability threshold of

p_{1_m a x} > 0.90

. When Drums or Other is detected, a Secondary classifier provides detailed identification with probability threshold

p_{2_m a x} > 0.80

. The multimodal model (bottom) combines audio and text data with probability threshold

p_{3_m a x} > 0.95

. The final class is determined by comparing confidence scores from valid predictions.

Figure 3. Inference pipeline for track classification. The system processes both audio content and track names, selecting the top 3 RMS windows from each audio track for analysis. The Primary classifier (top) determines the broad instrument category with a probability threshold of

p_{1_m a x} > 0.90

. When Drums or Other is detected, a Secondary classifier provides detailed identification with probability threshold

p_{2_m a x} > 0.80

. The multimodal model (bottom) combines audio and text data with probability threshold

p_{3_m a x} > 0.95

. The final class is determined by comparing confidence scores from valid predictions.

Table 1. Classes for audio and multimodal classification.

Audio Classification		Multimodal Classification
Primary	Secondary	Instrument
Drums	Kick	Kick, KickOut, KickSample
	Snare	Snare, SnareDown
	Tom	Tom, FloorTom
	OH	HiHat, Ride, OH, Room
	Percussion	PercLow, PercHigh
Bass	Bass	ElectricBass, AccBass, SynthBass
Others	Guitars	AccGuitar, ElGuitar
	Keyboards	Keyboards, Piano, Electric Piano
	Orchestral	Woodwinds, Brass, Strings, Sfx
Vocals	Vocals	LeadVox, Backing

Table 2. Validation accuracy for spectral and multimodal classification models.

Model	Dataset	Input Type	Classes	Accuracy [%]
Audio Spectral Classification Models
ResNet-18 Primary	MUSDB18HQ [46]	STFT	4	83.89
ResNet-18 Drums	Custom Drums	STFT	5	95.67
ResNet-18 Others	Custom Other	STFT	3	89.10
Multimodal Classification Model
Multimodal	Cambridge [15]	Text + STFT	27	82.19

Table 3. Classification accuracy for test projects. Sub Acc refers to Secondary classification accuracy (Drums or Others if presented), Main Acc to Primary classification accuracy, Multi Acc to multimodal classification accuracy, and Final Acc to the combined system accuracy.

Song	Tracks	Sub Acc [%]	Main Acc [%]	Multi Acc [%]	Final Acc [%]
Song 1	10	60.00	100.00	90.00	90.00
Song 2	22	59.09	81.81	100.00	100.00
Song 3	34	47.05	85.29	61.76	70.58
Song 4	44	59.09	79.54	45.45	47.72
Song 5	60	61.66	78.33	78.33	68.33

Table 4. Stereo processing accuracy for each song.

Song	Tracks	StereoSplit [%]	FakeStereo [%]
Song 1	10	100.00	100.00
Song 2	22	66.66	100.00
Song 3	34	80.00	100.00
Song 4	44	66.66	100.00
Song 5	60	62.00	100.00

Table 5. Average session preparation time (in minutes:seconds) across different proficiency levels and project sizes. Values are displayed in minutes:seconds format.

Level	Task	Song 1	Song 2	Song 3	Song 4	Song 5
Pro	Rename	2:00	4:00	7:00	10:00	11:00
	Fake Mono	1:00	1:00	2:00	3:00	3:00
	Stereo Join	2:00	2:00	2:00	6:00	6:00
	Gain	5:00	9:00	14:00	18:00	17:00
	Total	9:00	15:00	24:00	36:00	37:00
Amator	Rename	3:00	7:00	11:00	12:00	16:00
	Fake Mono	1:00	2:00	4:00	4:00	4:00
	Stereo Join	3:00	4:00	6:00	6:00	7:00
	Gain	8:00	10:00	19:00	24:00	24:00
	Total	15:00	22:00	39:00	45:00	50:00
Auto	Rename	0:20	0:06	2:47	5:58	4:33
	Fake Mono	0:10	0:22	0:34	0:44	1:00
	Stereo Join	1:00	3:12	4:14	6:24	8:28
	Gain	0:26	0:40	0:52	1:22	2:47
	Total	1:41	4:20	8:27	14:28	16:49
Time Reduction		84.53%	77.78%	74.01%	65.14%	61.79%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moroșanu, B.; Negru, M.; Nicolae, G.; Ioniță, H.S.; Paleologu, C. A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations. Information 2025, 16, 494. https://doi.org/10.3390/info16060494

AMA Style

Moroșanu B, Negru M, Nicolae G, Ioniță HS, Paleologu C. A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations. Information. 2025; 16(6):494. https://doi.org/10.3390/info16060494

Chicago/Turabian Style

Moroșanu, Bogdan, Marian Negru, Georgian Nicolae, Horia Sebastian Ioniță, and Constantin Paleologu. 2025. "A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations" Information 16, no. 6: 494. https://doi.org/10.3390/info16060494

APA Style

Moroșanu, B., Negru, M., Nicolae, G., Ioniță, H. S., & Paleologu, C. (2025). A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations. Information, 16(6), 494. https://doi.org/10.3390/info16060494

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations

Abstract

1. Introduction

2. Related Work

2.1. Automation in Audio Production Workflows

2.2. Machine Learning in Audio Content Analysis

2.3. Gain Staging and Level Optimization

3. Problem Formulation

Identified Inefficiencies in Current Workflows

4. Methodology

4.1. System Overview

4.2. Track Identification and Labeling

4.2.1. Audio Spectral Classification

4.2.2. Multimodal Classification

4.3. Track Organization and Pattern Recognition

4.4. Gain Staging Module

5. Experimental Setup

5.1. Train Dataset for Audio Spectral Classification

5.2. Evaluation Dataset for the Entire System

6. Results

6.1. Classification Performance

6.2. Pattern Recognition Results

6.3. Time Efficiency

7. Discussion

7.1. System Performance and Limitations

7.2. Workflow Integration and User Experience

7.3. Practical Implications

7.4. Future Work

8. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI