*2.1. Environmental Acoustic Databases*

The environmental acoustic databases employed by the machine-hearing research community have been generally created from live recordings directly and/or synthetically generated by artificially mixing sound events with certain acoustic environments (i.e., background noise). The latter allows control of the SNR of the mixture, and dealing with data scarcity of specific audio events in real-life contexts—which is one of the key problems when trying to gather representative data from live environments [25]—while the former entails a huge effort for data collection and subsequent manual annotation to generate the labeled database or ground truth [23,26,27].

Regarding real-life environmental acoustic databases, in [28] a 1133-min audio database including 10 different acoustic environments, both indoor and outdoor was introduced. On the other hand, the MIVIA audio events dataset was designed for surveillance applications focused on the identification of glass breaking, gun shots, and screams (https://mivia.unisa.it/datasets/audio-analysis/miviaaudio-events/) [21]. The training dataset is about 20 h, while the test set is about 9 h. Moreover, the same research laboratory developed a smaller dataset of about 1 h duration also for surveillance purposes focused on road audio events, which contains sound events from tire skidding and car crashes (https://mivia.unisa.it/datasets/audio-analysis/mivia-road-audio-events-data-set/) [29].

In [23], a real-life acoustic database collected from the urban and suburban environments of the pilot areas of the Life DYNAMAP project [13] is described. The database composed of 9 h and 8 min was obtained through an in situ recording campaign. This acoustic database was developed for discriminating road-traffic noise from ANE through the ANED [18]. The ANEs, which only represented 7.5% of the annotated data, were subsequently classified in 19 different subcategories after manual inspection, showing SNR levels with respect to background noise that ranged from −10 dBm to +15 dBm and showing a high heterogeneity of intermediate SNR levels. When comparing both environments, it can be observed that the number of ANEs in the urban area is approximately four times higher than in the suburban area, also including events with larger acoustic salience. Nevertheless, it is worth mentioning that the recordings in the urban area were conducted at the street level of the preselected locations [30] within District 9 of Milan, while the recordings in the suburban area were conducted on the A90 ring-road portals surrounding Rome (see [31] for further details).

One of the main sources employed to build acoustic databases is the well-known Freesound online repository (https://www.freesound.org). For instance, *freefield1010* is a database composed of 7690 audio clips tagged as "field recording" in the metadata of the original recordings uploaded in this online repository, totaling over 21 h of audio [32]. In [33], 60 h of real field recordings uploaded in Freesound from urban environments were used to build the UrbanSound database. The database is composed of 27 h, which includes 18.5 h of verified and annotated sound event occurrences classified in 10 sound categories (i.e., air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music). Moreover, the authors also provide an 8.75 h subset—denoted as UrbanSound8k—designed to train sound classification algorithms and obtained after arbitrarily fixing the number of items to 1000 slices per class. In [34], a mixture of sound sources from Freesound mixed with real-life traffic noise was considered to train and evaluate an AED algorithm (considering two SNRs levels: +6 dB and +12 dB). Finally, "ESC: Dataset for Environmental Sound Classification" [35] is composed of three subsets: *(i)* ESC-50, a strictly balanced 50 classes of various environmental sounds obtained through manual annotation, *(ii)* ESC-10, as a reduced 10 classes subset of the former as a proof-of-concept dataset, and *(iii)* EC-US, which contains 250,000 recordings directly extracted

from the "field recording"-tagged category of Freesound. Nevertheless, due to the uncontrolled origin of the sound sources uploaded to Freesound and similar online repositories (see [23] for further examples), involving a wide variation in the recording conditions and quality, the derived environmental acoustic databases may become unsuitable for reliable CASA- and AED-based systems evaluation purposes [27,32].

In [27], the described acoustic dataset covers both indoor and outdoor environments, including real-life recordings of predefined acoustic event sequences and individual acoustic events synthetically mixed with background recordings by considering specific SNRs levels, such as −6 dB, 0 dB, and +6 dB. In [25], the authors developed a mixed acoustic database composed of acoustic data from real-life recordings, which was subsequently extended with synthetic mixtures of extra events of interest to increase database diversity. In [21], an acoustic database for surveillance purposes that includes sound events such as screams, glass breaking, and gunshots was also artificially generated from indoor and outdoor environments considering different SNR levels (from +5 to +30 dB). Moreover, a small environmental acoustic database containing 20 scenes mixing background noise with car, bird, and car horn samples synthesized with SimScene software [36] was described in [37]. Following a similar approach, in [38], the TUT Sound Events Synthetic 2016 (TUT-SED-2016 for short) was introduced. The 566 min dataset is composed of synthetic mixtures created by mixing isolated sound events from 16 sound event classes from the original TUT database (The reader is referred to http://www.cs.tut.fi/sgn/arg/taslp2017-crnn-sed/tut-sed-synthetic-2016 for a detailed explanation).

Finally, it is worth mentioning the recent development of an open-source library for the synthesis of soundscapes named Scaper, mainly focused on SED-related applications [39]. This library provides an audio sequencer to generate synthetic soundscapes following a probabilistic approach including isolated sound events. The proposal allows control of the characteristics of the sound mixtures, such as the number of events, and their type, timing, duration, and SNR level with respect to the background noise. The authors validate their proposal through the development of the URBAN-SED database from UrbanSound8k as an example of the result of this kind of data augmentation. The synthetic generation of acoustic databases is a potential solution to address data scarcity when training Deep Neural Networks (DNN) for CASA-related problems (e.g., see [38,40]). Nevertheless, although the artificial generation of sound mixtures allows the creation of controlled training and evaluation environments, it has been also stated that these artificially generated databases could not represent the variability encountered in real-life environments accurately enough [18,26].
