1. Introduction
Percussion instruments are musical instruments that produce sound when struck by a percussion mallet, beater, or hand, or when scraped, rubbed, or struck against another similar instrument [
1]. The percussion section of an orchestra usually consists of instruments such as the timpani, snare drum, bass drum, tambourine (membranophones), cymbals, and triangle (idiophones). Beyond the orchestral setting, percussion instruments play a crucial role in various musical genres, from the pulsating beats of jazz and rock to the complex rhythms of world music, and the percussive family also includes a vast array of exotic and culturally specific instruments like djembes, congas, tabla, and marimbas. Although they are the most widely played instruments, they have not been as extensively studied as wind or string instruments [
2,
3].
For the study of percussion musical instruments, vibroacoustic analysis is of major importance [
4,
5,
6]. The literature contains research studies on the vibroacoustic behavior of percussion instruments. For example, Skrodzka et al. [
7] performed a modal analysis of a batter head of a snare drum and measurements of the instrument sound spectrum. Modal analysis and the acoustic radiation measurements of the kettledrum were performed by Tronchin [
8]. Sunohara et al. [
9] experimentally measured the sound spectrum, the vibration of the body, drumstick driving force, directivity, and the sound intensity vector of Japanese wooden drums (mokugyo). More recently, detailed numerical simulations were performed to study the vibroacoustic behavior of crash and splash cymbals [
10,
11], while nonlinear sound syntheses of cymbals were studied in the work presented in [
12,
13]. Moreover, ongoing research focuses on robotic-based experimental measurements [
14,
15,
16,
17,
18].
The existing research on the automated generation of databases (DBs) of percussion sounds and their classification and recognition via machine learning (ML) is limited. Recently, Boratto et al. [
19,
20,
21] recorded 276 audio samples corresponding to four drum cymbals made of three different bronze alloys. There were environmental and microphone variations during the recording procedure. Chhabra et al. [
22] implemented drum instrument classification using machine learning. Recently, Li et al. [
23] performed an audio recognition of Chinese traditional instruments, including percussion instruments, based on machine learning.
In this research work, we introduce an integrated method for the development of large percussion sound DBs, generated by known impact force spatiotemporal conditions, which are mapped and classified by audio recognition via machine learning. A novel 3D Auto-Drum Machine (3D-ADM) system capable of generating and collecting cymbal impact drum sounds is developed and presented. The capabilities of the 3D-ADM, along with initialization and calibration, sound production, and recording protocol for a reliable and repeatable measurement system, are demonstrated. The 3D-ADM excites the percussion instruments with a wooden drumstick at various points, sequentially, along a radial path at programmable spatial intervals, with known impact forces. The audio signal data produced at each excitation point of the vibrating object under study, including the meta-data details of spatial coordinates of the excitation point and the impact force value, are recorded. The data collected during the calibration and initialization of the 3D-ADM are fed into a transformer-based audio neural network (ML model) that has been pretrained on speech signal. The internal representations of the ML model are visualized in 2D by a dimensionality reduction technique, which verifies the assumption that the collected data are separable (almost linearly) based on material and geometry. Additionally, based on the visualization process, it is expected that when enough data become available through the process described, models that perform geometry and material classification, as well as excitation-point estimation through regression, will be trained. On one hand, the ML-based analysis offers confirmation that the collection process is robust and consistent; on the other hand, it acts as a proof of principle, indicating that further development of large databases of 3D-ADM data will allow for the implementation of innovative research pathways relating to vibroacoustic characteristics, playing (excitation) positions, and generated sounds.
The generated and recorded percussion sounds are accompanied by the spatial excitation coordinates and the correspondent impact forces, allowing for the development of large and detailed DBs required for machine learning models. Thus, a high repetition rate for data production may be achieved and gathered in the future within sound DBs capable of training ML models. These DBs may serve as a reference for investigating the alternative and cost-effective materials and geometries with relevant sound characteristics [
24]. Using an ML classification algorithm, it becomes possible to determine how much an alternative or cheaper material and/or manufacturing processes may alter the resulting sound. For future studies, this could be complemented by hearing tests. Moreover, apart from percussion instruments, this study could be applied to other musical instruments, also involving impulse testing [
2,
3,
25]. Furthermore, the generation of sound DBs will constitute a foundation to properly train ML models, aiding in the development of accurate vibroacoustic numerical models and the exploration of percussion instrument sound synthesis [
10,
11,
12,
13].
The contributions of this paper can be summarized as follows: (i) a system that can automate the process of collecting percussive audio data, including the excitation force and spatial position is presented; (ii) a qualitative analysis based on a pretrained ML model is developed that further evaluates the consistency of the data collection process; and (iii) the presented results demonstrate the feasibility of large sound database development. Such databases can properly train ML models for the possible use of alternative and cost-effective materials and geometries with similar sound characteristics and the development of accurate vibroacoustic numerical models used in percussion instruments sound synthesis.
Two 8-inch splash cymbals a classic medium thin, and a bell-shaped, made of a B8 Bronze and MS63 alloy, respectively, along with an 8-inch circular aluminum sheet used as a reference, are chosen for this research study. In
Section 2, the methodology is presented. The experimental setup, the 3D-ADM system initialization and calibration, and the ML model are described in detail. In
Section 3, the results of the data capturing, as well as the results of the exploratory analysis using ML are presented and discussed. The conclusions are described in
Section 4.
2. Methodology
2.1. Experimental Setup
The 3D-ADM system developed for the generation and recording of the impact sounds is presented in
Figure 1a. The 3D-ADM system is developed to excite the object under study by a known impact force. The circular aluminum sheet under study in
Figure 1 is used as a reference and has an 8-inch diameter and 3 mm thickness. As shown in
Figure 1c, the excitation mechanism can excite (strike) the sheet at any point along the radial path over the X-axis. The generated vibration can be detected and recorded by the 3D-ADM system, which is fully automated and Computer Numerical Controlled (CNC). This process is repeatable and accurate since the excitation conditions remain constant until the predefined and programmed number of points are excited. The corresponding CAD model presents the 3D-ADM in detail, shown in
Figure 1b. A permanent metal stud is welded at the center of the front edge of the stage on the
X-axis, to host and hold the object under study. The reference metal sheet, with a radius R1, is held at height H1 by bolts, whose torque is measured by a gauge while fastened, and is maintained the same for all the samples studied. The stepper motor SM1-X translates the stage along the
X-axis, and the stepper motor SM2-Z translates the excitation mechanism along the
Z-axis.
The excitation mechanism is rigidly and permanently attached to the center of the horizontal beam of 3D-ADM, as shown in
Figure 1c. For the excitation, any type of drumstick can be used. Herein, a modified drumstick (STK) incorporating an impact hammer (IH, Model 086E80, PCB, Depew, NY, USA) on one side of its tip is used. The STK is held by a double ring holder (HLD), which allows for movement along a circular path due to a central shaft and two pillow block bearings. The movement of the STK is induced by an electromagnetic piston (EP), driven by a circuit control system, denoted as the current controller (CCR) in
Figure 1a, and results in the impact force applied. The STK is restored to its equilibrium position by a spring (SP), which is attached to the STK by a ring holder. The G-code synchronizes the spatial motion of the CNC 3D-ADM on the XZ plane with the motion of the STK.
The emitted sound is recorded by a measurement microphone (MIC, Model MiniSPL, NTI, Schaan, Liechtenstein), which is connected to an audio interface (AI, Model QUAD Capture, Roland, Los Angeles, CA, USA). Additionally, a miniature accelerometer (ACC, Model TLD352A56, PCB, Depew, NY, USA) can be used for the detection of the vibration. Thus, the system allows for the simultaneous recording of two signals, one by ACC (converted to wav format), and one by MIC, providing additional capabilities for vibroacoustic measurements to the 3D-ADM system.
The measurements are performed in the recording studio of the Department of Music Technology and Acoustics of the Hellenic Mediterranean University to eliminate the influence of background noise and reverberation on the recorded signals. The emitted sounds are recorded within sequential time windows, from excitation point to point, thus any noise produced by the machine and the motors is avoided. The development of a large training sound data base will further follow by implementing recording studio directional microphones.
2.2. System Initialization & Calibration
The initialization and calibration of the 3D-ADM system is performed using the circular 8-inch Al sheet set at a height H1, as presented in
Figure 1b. The modified drumstick with the hammer tip is used for the spatial zero-set of the X and Z axis, with reference to the excitation point ExP-1, the first ExP to be measured. Therefore, angle θ is set to 0° by adjusting the spring accordingly, and thus securing the perpendicular hit of the hammer tip on the flat metal sheet.
Figure 2a shows the coordinate system to be used for the G-code with an origin on ExP-1, and the next equidistant points along the radial path of the metal sheet, to be excited from X = 0 to R1, while the Z coordinates are set to zero since the object under study is flat. The coordinates of the 16 ExPs are imported in the G-code, setting a sequence of 16 equidistant points along the radial path of the metal sheet to be struck. The total number of the excitation points is defined by the user, requiring the corresponding changes to the G-code without any limitation. The spring is adjusted to its initial position to set the angle θ between the axis of drumstick and the plane of the Al sheet to θ = 3°. The microphone is fixed with a direction to the center of the circular Al sheet under study at a distance of 20 cm, and the accelerometer is attached 30 mm far from the center on the circular sheet, without affecting its vibroacoustic behavior.
The calibration of the 3D-ADM system is based on the CCR, which is equipped by a potentiometer that enables the control of the impact force via the variation of the current intensity that drives the EP. Given the initialization parameters, the angle θ is set to 3° by the help of spring and the CCR is set to deliver an impact force of 40 N, which is validated by the IH sensor. To automate the measuring and recording procedure, the time delay (td) for vibration relaxation before the excitation of the next ExP is determined. A trial excitation is performed with a nominal force of 40 N and the accelerometer records the duration of the vibrational signal to be ~2 s. Therefore, a td of 10 s is introduced in the G-code to allow the synchronization of the EP movement with the CCR between each ExP excitation, and to provide sufficient relaxation time for the vibrating Al sheet.
After initialization and calibration, the efficiency of the 3D-ADM system in terms of repeatability and accuracy is determined. The G-code developed for the sequential excitation and the recording of the 16 ExPs is executed seven more times. All the signals of the accelerometer, the force measured by the hammer, and the sound captured by the microphone are recorded and stored.
Figure 2b shows the mean force value for each excitation point, which reveals a maximum deviation of 1.3 N at ExP-16 from the impact force of 40 N. The deviation of the mean value is expressed as a percentage in
Figure 3. The mean value of the absolute deviation percentage for all 16 ExPs is 1.53%. The demonstrated results reveal that the smaller deviation corresponds to points from ExP-5 to ExP-12. Such force value deviations result only in small changes in the sound energy, which is recorded by the microphone, and thus may be neglected. The same process was also repeated for the impact forces of 20 N, 30 N, and 50 N, resulting in similar deviation values, securing the independency of the measurements accuracy from the excitation parameters. It should be noted that the contact duration between the drumstick and the metal sheet was measured to be ~15 ms for all cases.
The described protocol secures the measurement process and their accuracy with the help of the modified drumstick. The drumstick is rotated by 180° along its axis within the HLD without any other system alterations, ensuring the predefined value of the same impact force on the same ExPs. The proposed measuring methodology can be directly applied to any type of sample utilizing impact excitations under different force values.
2.3. Machine Learning Model
The data collected by the 3D-ADM system can produce a sufficiently large dataset to study many aspects regarding the relations between the sound and physical attributes of objects (e.g., geometry, material, force, and the position of impact among others). To verify the efficiency and usability of such a database, the consistency of this early developed data collection process must be investigated (i.e., the pairs of same intended ExPs and forces produce similar waveforms). Additionally, a proof-of-concept overview regarding the prediction capabilities on the newly developed and currently small dataset collected from two splash cymbals and the Al sheet of different materials and geometries must be explored.
A pretrained ML model on speech data through self-supervised learning is employed for the exploratory analysis. Such models are pretrained on large datasets of speech audio data and they can be readily employed with, or even without, fine-tuning on music-related tasks [
26]. The specific model used in this study is DistilHuBERT [
27], which is a light-weight version of the HuBERT model [
28], which is a “discretized” version of the wave2vec 2.0 model [
29]. This model takes an audio waveform as the input with a 16 kHz sample rate and converts it into a contextualized representation of 20 ms frames. This representation encompasses information for each 20 ms-long frame of audio into 768-dimensional vectors that capture the context of the entirety of frames in the sequence through the transformer attention mechanism [
30]; vectors that belong to the same context are more similar. Since we are not fine-tuning the system, we use the “frozen” version as given in the work presented in [
27], where the details for the system hyper-parameters can be found.
The motivation behind using DistilHuBERT and not any other network trained with self-supervision (e.g., MERT [
31]) is two-fold. First, to use a reliable system, which has been tested in several tasks that are not limited to music. Second, to test our results on a system that was built to identify speech (not music), since speech models are sensitive to noise-related spectra because they can identify fine details in the spectra of fricatives. Although other models should be tested in the future, in the context of this work it suffices determining the prospects of such pretrained models, which are robust and “noise-sensitive” enough, capable of processing the collected acoustic data.
To account for different waveform lengths, average pooling is applied, i.e., the average of all 768-dimensional vectors in the sequence are averaged out for each dimension, leading to a single average 768-dimensional representation for each audio file. The process for extracting the 768-dimensional representation for a recorded waveform is depicted in
Figure 3. This model is not fine-tuned to any downstream task, but it is rather used in its readily available pretrained state.
4. Conclusions
A novel automated process for the generation, collection, classification, and recognition of audio files corresponding to percussion sounds is presented. The sounds are produced by known impact force excitation from the developed 3D-ADM, free from human interference. The proposed excitation and measurement ADM system includes a microphone and a miniature accelerometer, used to record the sound and vibration, resulting after excitation. The machine is CNC programmable, allowing for variations of the excitation points and the impact force. The repeatability and the efficiency of the measurement procedure is explored and validated.
After initialization and calibration, the 3D-ADM system is used to excite two cymbals and a flat circular plate, which differ in material and geometry. The recorded sounds are processed by a ML model pretrained on speech signals. The visualizations of the internal representations of the ML model validate the consistency of the process followed for data measurements and collection. The results presented demonstrate the capability of generation of large sound databases and provides pointers for future work regarding ML-based modeling to relate materials, geometries, playing positions, and generated sounds. This work might include fine-tuning ML models to classify the material, geometry, force, and position of impact either separately (i.e., give a probability for each attribute independently), conditionally (e.g., given a material and force/position of impact, estimate the geometry), or jointly (i.e., give a single probability for a combination of attributes).