*3.2. Field Observation of Soundscapes of Bird Vocalizations*

Figure 7 shows the song distribution on polar coordinate of azimuth (angle) and elevation (radius) in each time slot (A–E) in Figure 5. Each plot represents the direction of localized bird song (with annotations for species names). We mainly observed songs of Blue-and-white Flycatcher (*Cyanoptila cyanomelana*), Varied Tit (*Sittiparus varius*) and Coal Tit (*Periparus ater*). When we focus on a Blue-and-white Flycatcher, the individual tended to sing at much higher positions than other species, repeatedly moving to other high positions and singing a few times over B to C and returning to the starting position (A) in D. This reflects the fact that this species tends to sing on high trees along streams in his territory.

In contrast, the songs of the other two species tended to be localized at lower elevation angles, suggesting that they tend to sing at lower positions around the microphone. We also see that the localized positions formed multiple clusters, indicating they tended to move slightly. Thus, we could quantitatively observe the spatial structure of the soundscape in which one species tended to occupy the high elevation range, while the other species occupied the lower range.

Table 1 shows some indices on the localized sounds in each time slot. We observe see several changes in the soundscape structure of bird songs. The number of localized songs gradually increased over the time slots, indicating that actively singing individuals (i.e., Varied Tit) entered this acoustic scene. Elevation angle variation became smallest at C, indicating that the Blue-and-white Flycatcher was probably at a relatively distant tree, considering this species tends to sing on top of a tree. The high values of azimuth variation in B and C reflect that the Blue-and-white Flycatcher moved during the period and other species sang in the opposite direction. Thus, we can grasp the dynamics of the soundscape structures around the microphone array by looking at the changes in these types of indices.

**Figure 7.** A song distribution on the polar coordinate of azimuth (angle) and elevation (radius) in each time slot (A–E) in Figure 5.



#### **4. Discussion and Conclusions**

We discussed the applicability of robot audition techniques to understand the soundscape dynamics of bird vocalizations in forests. We focused on the elevation information of localized bird songs in addition to azimuth information. A speaker test for DOA estimation of replayed vocalizations showed that the observed DOA of a distant target sound matched well with expected values in both azimuth and elevation angle when no other bird vocalizations were present near the target. A field observation of several individuals reflected well the ecologically plausible structures of the soundscape of bird species in the experimental forest, showing vertical species structures of bird vocalizations. Several statistical indices of localized songs can also summarize the detailed changes in the structures of the soundscape.

The localization of bird vocalization is based on various components including both hardware (i.e., microphone arrays) and software (i.e., HARK). Finer resolution of the estimated DOA would be an important factor for this purpose because many sound sources other than those of the target species or individuals always coexist in fields. Improving the resolution of the MUSIC spectrum by increasing the number of steering vectors (i.e., candidate directions for DOA estimation) would be useful, but requires a great deal of computational cost, especially for azimuth-elevation estimation. The interpolation of the MUSIC spectrum used for finer 2D localization of bird vocalizations with two microphone arrays [22] would be efficient in this case. The balanced settings of DOA resolution along with interpolation would be beneficial for long-term analyses for biodiversity surveys. Further consideration of the effects of microphone channel geometry on the localization accuracy of bird vocalizations is part of our future work.

A systematic comparison of other sound source localization and separation techniques, including adaptive filtering, is important for more practical applications of robot audition techniques to bird behavioral observations. In this study, we employed the simplest and most standard methods (SEVD-MUSIC and GHDSS) employed in HARK, expecting that it will provide a baseline result because the method has been shown to be applicable to field observations of birds in previous studies as introduced in Introduction. We also expected that using such a simple method would be appropriate to examine the basic effects of acoustic noise in the natural environment, and advanced methods can improve the results (e.g., the MUSIC based on generalized singular value decomposition (GSVD-MUSIC) for better speech recognition [23], and the MUSIC method based on incremental generalized eigenvalue decomposition (iGEVD-MUSIC) for drone audition [24]).

Also, this research has an experimental rather than a theoretical aspect. Still, we believe it is important for considering the trends and challenges in robotic applications to show an example of the application of robotics to field observations of natural sounds. At the same time, we believe that a report on sound source localization in both elevation and azimuth angles is particularly important for birds that can fly. The report will contribute to the practical application of related techniques to ecoacoustics, as microphone arrays are expected to be used more frequently in this field.

The spatial localization of bird songs using multiple microphone arrays (i.e., an array of arrays) is a promising approach to determine the precise location of vocalizations. A system with three microphone array units estimated the location of two color banded Great-Reed Warbler's song posts in a reed marsh with a mean error distance of 5.5 m from the location of the observed song posts [18]. Also, various types of animal vocalization systems based on many microphone units deployed over fields have been proposed recently [25]. Gayk et al. successfully 3D triangulated, using a time difference of arrival (TDOA) approach, calls of warblers using a large microphone array unit system in which channels were far apart from each other [26]. However, it could be costly to deploy and calibrate multiple units in field observations. Our approach based on a single but multi-channel array unit, showing good accuracy of azimuth-elevation angles of bird vocalizations, suggest another possibility to better capture bird vocalizations while keeping deployment costs low.

Our results show how our observation method could be used to noninvasively monitor rare birds in the field. For example, Matsubayashi et al. evaluated the practical effectiveness of localization technology for auditory monitoring of endangered Eurasian bittern (*Botaurus stellaris*) which inhabits wetlands in remote areas with thick vegetation, using a 8-ch microphone array unit [27]. They successfully localized booming calls of at least two males in a reverberant wetland, surrounded by thick vegetation and riparian trees. In addition to the non-invasiveness to the ecosystem where the target birds inhabit, our recording system has lower deployment cost for field observers. We believe that our monitoring system, given advantages and limitations presented in this study, offers a practical tool for field ecologists, e.g., to estimate abundance and distribution of rare species.

However, estimating the distance of sound sources from microphones and twodimensional (spatial) localization of them are important or essential for more detailed ecological surveys. We think that extracting any complementary information about their distance from separated sounds (e.g., relative amplitude [28]) would be a novel direction to better capture the structure of the soundscape with a single microphone array unit.

From another perspective, there is increasing interest and development of sound event localization and detection for various environmental sounds using microphone arrays. For example, the workshop on detection and classification of acoustic scenes and events (DCASE) provided a dataset (STARSS22) for sound source localization and classification of domestic sounds in indoor environments [29]. A competition of sound source localization and classification has been conducted, and participants discuss issues arising from the task (e.g., [11]). Experimental reports on the sound source localization of distant and elevated calls in a forest environment where many species of birds coexist, which was investigated in this study, could contribute to further progress in these fields because it may provide different insights into sound localization in harsher conditions unique to natural acoustic environments.

Although camera trap-based animal monitoring combined with object detection algorithms is widely used [30], it is challenging to capture small animals, such as songbirds, because they are basically far distant from the device, and there exists the problem of backlighting. This experiment shows that it is possible to quantitatively extract the dynamics of the use of niches among species, which could only be described verbally or roughly before, even when the method is based only on azimuth and elevation angle information.

The increasing interest and popularity of 3D audio in public has made portable 3D recording equipment more accessible (e.g., Zoom H3-VR; Zoom Inc., GoPro MAX; GoPro Inc.). It is worth mentioning that these microphone units (or cameras with multiple microphones) are inexpensive, portable and easily affordable, even with the significant disadvantage of poor sound source localization performance due to their small size. This study also suggests the possibility of using this type of portable and easily available microphone array in ecoacoustics, which can contribute to citizen science of ornithology [31] in addition to the recent development of bird song extraction apps based on deep learning techniques (e.g., BirdNet [32]). One of the problems in the application of these approaches is the low accuracy in detection of vocalizations overlapped with each other or with other environmental sound sources. The robot audition techniques can resolve this problem by separating sound sources by making use of spectogram information from multiple channels, as discussed in this paper.

The future work includes practical comparisons of the efficiency of bird song localization and separation between the microphone arrays adopted in this study and such commercially available ones in order to further explore the applicability of robot audition techniques to ecoacoustic research.

**Author Contributions:** Conceptualization, K.H., H.O. and R.S.; field experiment K.H. and H.O.; formal analysis and writing, K.H. and R.S., supervision, S.M., T.A., K.N. and H.G.O., funding: R.S. and K.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by JSPS/MEXT KAKENHI: JP21K12058, JP20H00475, JP19KK0260.

**Institutional Review Board Statement:** The experimental procedures have been approved by the planning and evaluation committee in the Graduate School of Information Science, Nagoya University (GSI-H30-1).

**Data Availability Statement:** The experimental data are available upon request.

**Acknowledgments:** We thank Noriyoshi Kaneko and Shiro Murahama for designing and conducting a speaker test. We also thank Naoki Takabe (Nagoya Univ.) for conducting a field recording in the experimental forest.

**Conflicts of Interest:** The authors declare no conflict of interest.
