A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations
Abstract
:1. Introduction
- Track identification and labeling;
- Format normalization (sample rate and bit depth);
- Redundant track identification and cleanup;
- Organization of related tracks (e.g., grouping similar tracks);
- Initial gain staging and level setting.
2. Related Work
2.1. Automation in Audio Production Workflows
2.2. Machine Learning in Audio Content Analysis
2.3. Gain Staging and Level Optimization
3. Problem Formulation
Identified Inefficiencies in Current Workflows
- 1.
- Manual Track Identification and Labeling: Engineers spend considerable time manually identifying instruments from file naming and waveform appearances and applying appropriate labels, a task that scales linearly with track count.
- 2.
- Inefficient Track Organization: Sorting tracks by instrument families (drums, bass, guitars, etc.) is typically performed manually despite following consistent patterns across projects.
- 3.
- Redundant Channel Configuration: Significant time is spent identifying and eliminating fake stereo channels (identical mono signals panned hard left/right) and properly configuring genuine stereo content.
- 4.
- Initial Gain Staging: Setting appropriate initial gain levels for different instrument types relies heavily on audio engineering experience and often requires multiple readjustments during the mixing process.
4. Methodology
4.1. System Overview
4.2. Track Identification and Labeling
4.2.1. Audio Spectral Classification
- Converting the input layer to accept single-channel magnitude spectrogram from 3 s of audio, using a number of 512 for and ;
- Adjusting the final fully connected layer to output probabilities for the distinct instrument classes;
- Employing class weighting to address the imbalanced distribution of instrument typically found in multitrack recordings.
4.2.2. Multimodal Classification
- A Convolutional Neural Network for processing audio spectrograms with six convolutional layers, inspired by architectures that have shown success in music information retrieval tasks.
- A text embedding component utilizing the Sentence-BERT MiniLM-L6-v2 model [39] to generate dense vector representations (384 dimensions) of track filename text after preprocessing.
- A fusion module that concatenates features from both modalities and processes them through fully connected layers, following established practices in multimodal fusion [40].
4.3. Track Organization and Pattern Recognition
4.4. Gain Staging Module
5. Experimental Setup
5.1. Train Dataset for Audio Spectral Classification
5.2. Evaluation Dataset for the Entire System
- Small project (10 tracks): “Song 1”—A pop production with minimal instrumentation;
- Small project (22 tracks): “Song 2”—A folk-rock arrangement with acoustic and electric elements;
- Medium project (34 tracks): “Song 3”—An indie rock production with multiple guitar layers;
- Medium project (44 tracks): “Song 4”—A complex rock production with numerous orchestral tracks;
- Large project (60 tracks): “Song 5”—A full band recording with extensive drum microphones and layered instruments.
- Correct instrument labels for all 170 tracks across the dataset
- Identification of related tracks (stereo pairs, multi-microphone setups)
- Classification of redundant tracks and fake stereo channels
6. Results
6.1. Classification Performance
6.2. Pattern Recognition Results
6.3. Time Efficiency
7. Discussion
7.1. System Performance and Limitations
- Classification Accuracy for Complex Projects: Accuracy declined for larger projects, particularly Song 4 (47.72% accuracy), suggesting that nearly half of its tracks required manual correction. Future work should include training with more diverse datasets representing real-world production environments.
- Handling of Ambiguous Track Names: The multimodal classifier struggled with inconsistent or ambiguous naming conventions, especially with non-English or technical abbreviations. Multilingual text processing could improve text-based classification.
- Orchestral Track Classification: Song 4 presented unique challenges with its orchestral content using abbreviated labels (“v1”, “v2” for violin sections) and significant microphone bleed from the live recording environment. Our models, trained primarily on isolated recordings, struggled with this cross-instrument contamination. Training with orchestral recordings containing controlled amounts of bleed could address this limitation.
- Stereo Pair Detection: The system showed decreased accuracy in identifying stereo pairs within dense arrangements. Additional spectral and phase relationship features could enhance performance in these scenarios.
- Error Propagation: Classification errors affected subsequent processing steps, particularly gain staging. Incorporating confidence scores and adaptive processing could minimize these impacts.
7.2. Workflow Integration and User Experience
- Engineers preferred a two-stage approach with automated initial processing followed by manual correction opportunities.
- Visual confidence indicators for classification decisions would enhance user trust and facilitate efficient corrections.
- Engineers initially showed resistance to workflow changes but reported increased acceptance after experiencing the time savings.
- A system that learns from manual corrections could significantly enhance accuracy and user satisfaction.
7.3. Practical Implications
- Enhanced Productivity: The 70% time reduction allows engineers to handle more projects or allocate more time to creative decisions.
- Educational Applications: The system could serve as a teaching tool demonstrating professional organization standards and optimal gain staging practices.
- Remote Collaboration: In collaborative environments, our system could establish standardized configurations, reducing compatibility issues.
- Scalability: The most substantial time savings occurred in large projects (60+ tracks), making our system particularly valuable for complex productions.
7.4. Future Work
- Expanding training datasets to include more diverse recording conditions, particularly live multitrack recordings with controlled microphone bleed;
- Developing direct integration with commercial DAW platforms through plugins or extensions;
- Implementing adaptive learning capabilities to incorporate engineer feedback and corrections;
- Extending the classification system to handle more specialized instrument categories and ensemble recordings;
- Adding perceptually informed metrics for evaluating gain staging effectiveness.
8. Conclusions
- A comprehensive automation framework addressing multiple aspects of session preparation;
- An effective audio classification model optimized for music production;
- A rule-based pattern recognition system for identifying track relationships;
- A gain staging approach based on optimized target values and content characteristics;
- Empirical validation of the system’s performance in professional contexts.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
AUROC | Area Under the Receiver Operating Characteristic |
CNN | Convolutional Neural Network |
DAW | Digital Audio Workstation |
EQ | Equalization |
FFT | Fast Fourier Transform |
GA | Genetic Algorithm |
IMP | Intelligent Music Production |
LOEV | Leave-One-EquiVariant |
LSTM | Long Short-Term Memory |
LU | Loudness Units |
LUFS | Loudness Units Full Scale |
MFCC | Mel-Frequency Cepstral Coefficients |
MLP | Multi-Layer Perceptron |
NLP | Natural Language Processing |
NN | Neural Network |
ResNet | Residual Network |
RMS | Root Mean Square |
STFT | Short-Time Fourier Transform |
References
- Kirby, P. The Evolution and Decline of the Traditional Recording Studio; The University of Liverpool: Liverpool, UK, 2015. [Google Scholar]
- Music Business Worldwide. Over 100,000 Tracks Are Now Being Uploaded to Streaming Services like Spotify Each Day. Available online: https://www.musicbusinessworldwide.com/its-happened-100000-tracks-are-now-being-uploaded/ (accessed on 13 April 2025).
- McGarry, G.; Tolmie, P.; Benford, S.; Greenhalgh, C.; Chamberlain, A. “They’re all going out to something weird” Workflow, Legacy and Metadata in the Music Production Process. In Proceedings of the ACM Conference on Computer Supported Cooperative Work and Social Computing, Portland, OR, USA, 25 February–1 March 2017; pp. 995–1008. [Google Scholar]
- U.S. Bureau of Labor Statistics. Occupational Employment and Wage Statistics: Sound Engineering Technicians. Available online: https://www.bls.gov/oes/current/oes274014.htm (accessed on 13 April 2025).
- Izhaki, R. Mixing Audio: Concepts, Practices, and Tools; Routledge: London, UK, 2017. [Google Scholar]
- Jillings, N. Automating the Production of the Balance Mix in Music Production; Birmingham City University: Birmingham, UK, 2023. [Google Scholar]
- Pras, A.; Guastavino, C.; Lavoie, M. The impact of technological advances on recording studio practices. J. Am. Soc. Inf. Sci. Technol. 2013, 64, 612–626. [Google Scholar] [CrossRef]
- Vanka, S.; Safi, M.; Roll, J.-B.; Fazekas, G. Adoption of AI technology in the music mixing workflow: An investigation. arXiv 2023, arXiv:2304.03407. [Google Scholar]
- De Man, B.; Reiss, J.D.; Stables, R. Ten years of automatic mixing. In Proceedings of the 3rd Workshop on Intelligent Music Production, Salford, UK, 15 September 2017. [Google Scholar]
- De Man, B.; Reiss, J. A semantic approach to autonomous mixing. J. Art Rec. Prod. (JARP) 2013, 8, 1–23. [Google Scholar]
- Birtchnell, T. Listening without ears: Artificial intelligence in audio mastering. Big Data Soc. 2018, 5, 2053951718808553. [Google Scholar] [CrossRef]
- Moroșanu, B.; Negru, M.; Paleologu, C. Automated Personalized Loudness Control for Multi-Track Recordings. Algorithms 2024, 17, 228. [Google Scholar] [CrossRef]
- Harding, P. Top-Down Mixing—A 12-Step Mixing Program. In Mixing Music; Routledge: London, UK, 2016; pp. 82–96. [Google Scholar]
- Pestana, P.D.; Reiss, J.D. Intelligent audio production strategies informed by best practices. In Proceedings of the AES 53rd International Conference, London, UK, 27–29 January 2014. [Google Scholar]
- De Man, B.; Mora-Mcginity, M.; Fazekas, G.; Reiss, J.D. The open multitrack testbed. In Proceedings of the 2nd AES Workshop on Intelligent Music Production, London, UK, 13 September 2016. [Google Scholar]
- Moffat, D.; Sandler, M.B. Approaches in intelligent music production. Arts 2019, 8, 125. [Google Scholar] [CrossRef]
- Humphrey, E.J.; Bello, J.P. From music audio to chord tablature: Teaching deep convolutional networks to play guitar. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar]
- Han, Y.; Lee, K. Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. IEEE Signal Process. Lett. 2016, 23, 1649–1653. [Google Scholar]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
- Choi, K.; Fazekas, G.; Sandler, M.; Cho, K. Transfer learning for music classification and regression tasks. In Proceedings of the 18th ISMIR Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
- Pons, J.; Nieto, O.; Prockup, M.; Schmidt, E.M.; Ehmann, A.F.; Serra, X. End-to-end learning for music audio tagging at scale. In Proceedings of the 19th ISMIR Conference, Paris, France, 23–27 September 2018. [Google Scholar]
- Ghosal, D.; Kolekar, M.H. Music genre recognition using deep neural networks and transfer learning. Interspeech 2018, 2087–2091. [Google Scholar] [CrossRef]
- Tzanetakis, G.; Cook, P. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 2002, 10, 293–302. [Google Scholar] [CrossRef]
- Guinot, J.; Quinton, E.; Fazekas, G. Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations. arXiv 2024, arXiv:2412.18955. [Google Scholar]
- Hasumi, T.; Komatsu, T.; Fujita, Y. Music Tagging with Classifier Group Chains. arXiv 2025, arXiv:2501.05050. [Google Scholar]
- Bogdanov, D.; Won, M.; Tovstogan, P.; Porter, A.; Serra, X. The MTG-Jamendo dataset for automatic music tagging. In Proceedings of the Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
- Katz, B. Mastering Audio: The Art and the Science, 3rd ed.; Focal Press: Oxford, UK, 2020; pp. 106–131. [Google Scholar]
- Ma, Z.; De Man, B.; Pestana, P.D.; Black, D.A.; Reiss, J.D. Intelligent multitrack dynamic range compression. J. Audio Eng. Soc. 2015, 63, 412–426. [Google Scholar] [CrossRef]
- Hafezi, S.; Reiss, J.D. Autonomous multitrack equalization based on masking reduction. J. Audio Eng. Soc. 2019, 67, 96–107. [Google Scholar] [CrossRef]
- Moffat, D.; Sandler, M.B. Machine learning multitrack gain mixing of drums. In Proceedings of the 147th Audio Engineering Society Convention, New York, NY, USA, 16–19 October 2019. [Google Scholar]
- De Man, B.; Reiss, J.D. A knowledge-engineered autonomous mixing system. In Proceedings of the 135th Audio Engineering Society Convention, New York, NY, USA, 17–20 October 2013. [Google Scholar]
- Tot, J. Multitrack Mixing: An Investigation into Music Mixing Practices. Master’s Thesis, University of York, York, UK, 2018. Available online: https://www.york.ac.uk (accessed on 13 April 2025).
- Stickland, M.; De Man, B.; Fazekas, G. A New Audio Mixing Paradigm: Creative Interaction in Real-Time Remote Collaboration. Creat. Ind. J. 2022, 15, 17–34. [Google Scholar]
- Bell, A. Beyond Skeuomorphism: The Evolution of DAW Interfaces. J. Art Rec. Prod. 2018, 13, 1–12. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Negru, M.; Moroşanu, B.; Neacşu, A.; Drăghicescu, D.; Negrescu, C. Automatic Audio Upmixing Based on Source Separation and Ambient Extraction Algorithms. In Proceedings of the 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, Romania, 25–27 October 2023; pp. 12–17. [Google Scholar]
- Slizovskaia, O.; Kim, L.; Haro, G.; Gomez, E. End-to-end sound source separation conditioned on instrument labels. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 306–310. [Google Scholar]
- Baltrusaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
- Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
- Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed]
- Oramas, S.; Nieto, O.; Barbieri, F.; Serra, X. Multi-label music genre classification from audio, text, and images using deep features. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017; pp. 23–30. [Google Scholar]
- ITU-R BS.1770-4; Algorithms to Measure Audio Programme Loudness and True-Peak Audio Level. ITU: Geneva, Switzerland, 2011.
- Ward, D.; Reiss, J.D.; Athwal, C. Multitrack Mixing Using a Model of Loudness and Partial Loudness. In Proceedings of the AES 133rd Convention, San Francisco, CA, USA, 26–29 October 2012. [Google Scholar]
- Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P. mir_eval: A transparent implementation of common MIR metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October 2014. [Google Scholar]
Audio Classification | Multimodal Classification | |
---|---|---|
Primary | Secondary | Instrument |
Drums | Kick | Kick, KickOut, KickSample |
Snare | Snare, SnareDown | |
Tom | Tom, FloorTom | |
OH | HiHat, Ride, OH, Room | |
Percussion | PercLow, PercHigh | |
Bass | Bass | ElectricBass, AccBass, SynthBass |
Others | Guitars | AccGuitar, ElGuitar |
Keyboards | Keyboards, Piano, Electric Piano | |
Orchestral | Woodwinds, Brass, Strings, Sfx | |
Vocals | Vocals | LeadVox, Backing |
Model | Dataset | Input Type | Classes | Accuracy [%] |
---|---|---|---|---|
Audio Spectral Classification Models | ||||
ResNet-18 Primary | MUSDB18HQ [46] | STFT | 4 | 83.89 |
ResNet-18 Drums | Custom Drums | STFT | 5 | 95.67 |
ResNet-18 Others | Custom Other | STFT | 3 | 89.10 |
Multimodal Classification Model | ||||
Multimodal | Cambridge [15] | Text + STFT | 27 | 82.19 |
Song | Tracks | Sub Acc [%] | Main Acc [%] | Multi Acc [%] | Final Acc [%] |
---|---|---|---|---|---|
Song 1 | 10 | 60.00 | 100.00 | 90.00 | 90.00 |
Song 2 | 22 | 59.09 | 81.81 | 100.00 | 100.00 |
Song 3 | 34 | 47.05 | 85.29 | 61.76 | 70.58 |
Song 4 | 44 | 59.09 | 79.54 | 45.45 | 47.72 |
Song 5 | 60 | 61.66 | 78.33 | 78.33 | 68.33 |
Song | Tracks | StereoSplit [%] | FakeStereo [%] |
---|---|---|---|
Song 1 | 10 | 100.00 | 100.00 |
Song 2 | 22 | 66.66 | 100.00 |
Song 3 | 34 | 80.00 | 100.00 |
Song 4 | 44 | 66.66 | 100.00 |
Song 5 | 60 | 62.00 | 100.00 |
Level | Task | Song 1 | Song 2 | Song 3 | Song 4 | Song 5 |
---|---|---|---|---|---|---|
Pro | Rename | 2:00 | 4:00 | 7:00 | 10:00 | 11:00 |
Fake Mono | 1:00 | 1:00 | 2:00 | 3:00 | 3:00 | |
Stereo Join | 2:00 | 2:00 | 2:00 | 6:00 | 6:00 | |
Gain | 5:00 | 9:00 | 14:00 | 18:00 | 17:00 | |
Total | 9:00 | 15:00 | 24:00 | 36:00 | 37:00 | |
Amator | Rename | 3:00 | 7:00 | 11:00 | 12:00 | 16:00 |
Fake Mono | 1:00 | 2:00 | 4:00 | 4:00 | 4:00 | |
Stereo Join | 3:00 | 4:00 | 6:00 | 6:00 | 7:00 | |
Gain | 8:00 | 10:00 | 19:00 | 24:00 | 24:00 | |
Total | 15:00 | 22:00 | 39:00 | 45:00 | 50:00 | |
Auto | Rename | 0:20 | 0:06 | 2:47 | 5:58 | 4:33 |
Fake Mono | 0:10 | 0:22 | 0:34 | 0:44 | 1:00 | |
Stereo Join | 1:00 | 3:12 | 4:14 | 6:24 | 8:28 | |
Gain | 0:26 | 0:40 | 0:52 | 1:22 | 2:47 | |
Total | 1:41 | 4:20 | 8:27 | 14:28 | 16:49 | |
Time Reduction | 84.53% | 77.78% | 74.01% | 65.14% | 61.79% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moroșanu, B.; Negru, M.; Nicolae, G.; Ioniță, H.S.; Paleologu, C. A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations. Information 2025, 16, 494. https://doi.org/10.3390/info16060494
Moroșanu B, Negru M, Nicolae G, Ioniță HS, Paleologu C. A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations. Information. 2025; 16(6):494. https://doi.org/10.3390/info16060494
Chicago/Turabian StyleMoroșanu, Bogdan, Marian Negru, Georgian Nicolae, Horia Sebastian Ioniță, and Constantin Paleologu. 2025. "A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations" Information 16, no. 6: 494. https://doi.org/10.3390/info16060494
APA StyleMoroșanu, B., Negru, M., Nicolae, G., Ioniță, H. S., & Paleologu, C. (2025). A Machine Learning-Assisted Automation System for Optimizing Session Preparation Time in Digital Audio Workstations. Information, 16(6), 494. https://doi.org/10.3390/info16060494