*2.2. Challenge-Oriented Acoustic Datasets*

Over the last decade, the CASA research community has provided publicly available datasets and standard metrics to evaluate the development of their investigations. One of the seminal international efforts that emerged to evaluate systems developed to model the perception of people, their activities, and interactions was the CLassification of Events, Activities and Relationships (CLEAR) competition, which included in its 2006 and 2007 editions specific data for acoustic event detection and classification mainly collected from indoor environments—specifically, from meeting rooms (see e.g., [41–43]). Although other attempts emerged to provide evaluation material for CASA-focused systems, such as DARESounds.org initiative [44] or TRECVID Multimedia Event Detection competition (focused on audiovisual and multi-modal event detection (https://www.nist.gov/itl/iad/mig/multimedia-eventdetection)), neither them nor CLEAR led to the establishment of a reference evaluation challenge for the CASA research community [20].

Later, the IEEE AASP supported a new competition named Detection and Classification of Acoustic Scenes and Events (DCASE), which started in 2013 at WASPAA conference (http://www. waspaa.com/waspaa13/d-case-challenge/index.html). The database that was provided for that challenge contained both live and synthetic recordings [45]. The results of that competition can be found in [27]. Since 2016, DCASE has become an annual competition including different challenges covering acoustic scene classification, sound event detection in synthetic and in real-life audio, domestic

audio tagging, or the detection of bird or rare sound events, to name a few (see [22,46,47] for further details) (http://dcase.community/events), mainly thanks to the contribution of several researchers who have made available different datasets for public evaluation.

Among them , it is worth mentioning the creation of the "TUT Database for Acoustic Scene Classification and Sound Event Detection" from real-life recordings [26]. This database allows the evaluation of automatic event or acoustic scene detection systems within 15 different real-life acoustic environments, such as lakeside beach, bus, cafe/restaurant, car, city center, forest path, grocery store, home, library, metro station, office, urban park, residential area, train, and tram. It is worth mentioning that the sound events subset, which covers both indoor and outdoor environments, it is mainly focused on surveillance and human activity monitoring. Finally, it is worth noting that in the recent announcement of the DCASE2019 competition, an Urban Sound Tagging (UST) challenge has been presented. The goal of this task is to predict whether each of 23 sources of noise pollution is present or absent in a 10-s scene. In this challenge, the audio in the dataset has been acquired with the acoustic sensor network of the SONYC project: SOunds of New York City [4], and it provides a simplified taxonomy of the sounds of the city in two levels, 8 coarse categories, and 23 fine labels. The challenge dataset includes 2351 recordings in the training split and 443 in the validation split, making a total of 2794 10-s audios. The full taxonomy and details of the SONYC project dataset can be found in [33].
