**1. Introduction**

Currently, most of Taiwan's raw materials for energy production, including coking coal, fuel coal, crude, and liquefied natural gas [1], are imported and have a large and immediate impact on the environment. Therefore, the government has actively developed green energy, including offshore wind farms [2], but most sites overlap with Indo-Pacific humpback dolphin reservation zones. The noise from pile driving during construction may impact marine mammals and cause auditory injury, ranging from temporary threshold shift (TTS) to permanent threshold shift (PTS) in hearing [3]. To minimize the noiseinduced impact on cetaceans caused by construction and the operation of wind turbines, establishing a marine mammal detection mechanism is a priority. The traditional method to detect cetaceans is visual, whereby marine mammal Observers (MMOs) work from vehicles, using the naked eye to search for cetaceans, an operation that is expensive and offers only a low probability of success; moreover, it is limited to daylight hours. Underwater acoustics provide an alternative technique to detect marine mammals, and the cetacean call can be used as a specific characteristic of detection. We used passive acoustic monitoring (PAM) to develop an algorithm and NTU\_PAM to monitor cetacean calls followed by motion tracking. In addition to overcoming the weaknesses of the visual method, NTU\_PAM can show the correlation between the results of the visual method and PAM.

**Citation:** Hung, C.-T.; Chu, W.-Y.; Li, W.-L.; Huang, Y.-H.; Hu, W.-C.; Chen, C.-F. A Case Study of Whistle Detection and Localization for Humpback Dolphins in Taiwan. *J. Mar. Sci. Eng.* **2021**, *9*, 725. https://doi.org/10.3390/ jmse9070725

Academic Editors: Michel André, Christine Erbe and Giuseppa Buscaino

Received: 30 April 2021 Accepted: 28 June 2021 Published: 30 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Cetaceans produce two major types of cetacean calls [4,5]: (1) the "whistle" is a continuous, narrow-band, and frequency-modulated signal that is thought to be a form of social communication; (2) the "click" is considered a bio-sonar and is a short, broadband, and directional impulse signal used to navigate, detect, and identify objects. In marine mammal research, PAM has proved a useful tool. For example, (1) Spaulding et al. [6] built a near-real-time buoy system to automatically detect North Atlantic right whale calls in Cape Cod Bay and near the Boston Harbor. When the buoy system detects a whale call, an alarm signal is transmitted and the call is recorded. (2) Linnenschmidt et al. [7] equipped an acoustic data logger on a porpoise to record clicks and determine the relationship among the click, movement, and diving behavior. (3) Akamatsu et al. [8] used an underwater pulse event recorder (A-TAG) to record clicks and analyze critical parameters such as interclick interval (ICI).

Previous studies of cetacean whistle detection have been vigorous. Gannier et al. [9] developed Seafox software to extract whistle characteristics (length, beginning frequency, ending frequency, maximum frequency, minimum frequency, etc.) on a time spectrogram and used a regression tree to classify five dolphin species. Lai [10] used the mel-frequency cepstral coefficient to simulate human auditory features, namely the critical band and auditory masking, and to extract the characteristics of the whistle. The whistle characteristics were then used in a support vector machine (SVM) to identify the cetacean species. Caldwell and Caldwell [11–13] hypothesized that signature whistle variations, which dolphins emit and which carry information, are required so distinctive whistles can be used to identify individual dolphins. Datta and Sturtivant [14] considered two whistle features on a spectrogram (overall contour shape and detailed contour structure difference) as parameters of the signature whistle and grouped whistles using the hidden Markov model (HMM) method. Bahoura and Simard [15] used an artificial neural network to classify blue whale calls. The above research is based on supervised machine learning methods requiring numerous sets of clean training data, manually labeling the calls, and building the model. These models are only suitable for specific or regional species.

To avoid the disadvantages of labeling, training, and specific targeting, Gillespie et al. [16] developed a whistle detector based on image processing on a spectrogram which is implemented as the Whistle and Moan Detector module in PAMGuard. PAMGuard software includes a user-friendly, human–machine interface and modules for data processing and marine mammal detection [17] and has been widely used for real-time marine mammal monitoring. Lin [18] devised a non-targeted algorithm on the MATLAB platform that helps users grasp the position of whistles across many audio files, making further processing convenient. Lin et al. [19] first denoised the spectrogram and then detected the whistle characteristics. Gillespie's and Lin's methods include four main steps: (1) spectrogram, (2) image processing, (3) whistle feature extraction, and (4) combination of the whistle data points. We applied the same pattern to develop the whistle detection algorithm. A similar concept is applied in steps 2–4, but the detailed methods are different. We also compared NTU\_PAM and PAMGuard, which is regarded as a standard of whistle detection.

Tracking cetaceans is another recent primary research subject. Janik et al. [20] deployed three hydrophones to form a two-dimensional, triangular array in Beauly Firth, Northern Scotland, U.K. The interhydrophone distances were 208, 513, and 506 m. An artificial sound was then projected at a depth of 1 m. The time difference of signal arrival for each pair of hydrophones became markers to conduct localization of a sound source. Wang et al. [21] deployed a two-dimensional, cross-shaped array consisting of five hydrophones from the side of the boat at a depth of 1 m in Pearl River Estuary, China, and Beibu Gulf of Guangxi, China. The inter-hydrophone distances were 1.47, 1.54, 2.08, and 2.18 m. The boat followed

the dolphin group at a close distance to receive the dolphin call, and they used the difference in arrival time of a sound at each hydrophone pair to localize the targets. Wiggins et al. [22] deployed a tracking high-frequency acoustic recording package (HARP) [23] consisting of four hydrophones at 3 m above the seafloor offshore of Southern California to track beached whales and dolphins. Wiggins et al. [24] also deployed four HARPs offshore of Southern California to track whistling dolphins. Both of Wiggins's methods used the TDOA method. Building on the demonstrated effectiveness of TDOA for tracking and localization, we utilized four hydrophone stations to form a kilometer-scale array for tracking the source based on TDOA.

We designed an experiment that simulated different whistle types in the real field and developed four PAM stations to track the artificial source. Four stations were deployed near Taichung Harbor to record the simulated calls. After processing the detected algorithm, finding the whistle time, and tracking the source, we compared the results from the algorithm and the moving path of the boat carrying the sound source. In this study, we developed an algorithm that does not require a trained model for the automatic detection of the whistle. The algorithm is based on the time length and frequency band of the whistle feature. Furthermore, the automatic detection algorithm and localization method were combined as NTU\_PAM. NTU\_PAM can work as an auxiliary tool for MMO during the daytime, and it can function as the main monitoring tool at night.

## **2. Whistle Detector Algorithm**

Passive acoustic monitoring has been used widely in marine monitoring to amass longitudinal data and requires high-efficiency algorithms to assist researchers in finding the required file segments. We developed a whistle detector algorithm, which was then improved according to Li's prototype algorithm [25]. The algorithm can detect any creature producing a whistle and the whistle's detected frequency range, depending on the species. There are six main processes in the algorithm:


A flow chart of the algorithm is shown in Figure 1. In order to present whistles clearly on the spectrogram, some processes are based on image processing. Each process will be described in detail. Figure 2 shows each step of the results.

**Figure 1.** Whistle detector algorithm flow chart.

**Figure 2.** Each step of the results.

### *2.1. Spectrogram*

We used the STFT [26], which adds a window function to obtain the frequency domain information changed by the time domain. This establishes a frame to slide on the time domain signal and extracts the signal in the frame, which convolves with the window function to perform the Fourier transform. This information is used to produce the spectrogram. The window function is the Hamming window [27], the frame length is 0.01 s, and the overlap is 90%. The STFT formula is shown in Equation (1), where *w*(*t*) is window function and *x*(*t*) is raw data.

$$S\_{t,f} = \int\_{-\infty}^{+\infty} w(t-\tau)x(t)e^{-ift}dt\tag{1}$$

#### *2.2. Denoising on the Time Axis of the Spectrogram*

Whistle length is long compared to impulse noise; therefore, we use the moving average method to remove impulse noise on the spectrogram. Every 20 points on the time axis of each single frequency band are averaged to build a new spectrogram; the formula is shown in Equation (2), where *St*, *<sup>f</sup>* is the original spectrogram and *S*- *<sup>t</sup>*, *<sup>f</sup>* is the new spectrogram after denoising.

$$S\_{t,f}' = \frac{1}{20} \sum\_{n=0}^{19} S\_{t-n,f} \tag{2}$$

#### *2.3. Removing Salt and Pepper Noise*

A median filter, often used in image processing and a technique for nonlinear signal processing, was used to remove salt and pepper noise [28]. The median of every 3-by-3 matrix on the spectrogram is calculated. The formula is shown in Equation (3), where *S*- *<sup>t</sup>*, *<sup>f</sup>* is the spectrogram after the denoising and *S*-- *<sup>t</sup>*, *<sup>f</sup>* is the spectrogram after using the median filter.

$$S\_{t,f}'' = median \left( S\_{t+i,f+j}' \right); \ i, j = -1, 0, 1 \tag{3}$$

#### *2.4. Satisfying PSD and SNR Conditions*

Since a whistle is a narrow frequency band signal, with the occurrence of a whistle, its PSD is much larger than that of the point whose frequency is very close to the whistle. The definition of SNR in this study is shown in Equation (4). If the PSD is larger than the PSD threshold and the SNR is larger than the SNR threshold simultaneously at a data point, the data point will be replaced by one. If this is not the case, the data point will be replaced by zero. The formula is shown in Equation (5). The new spectrogram *Bt*, *<sup>f</sup>* is a binary image. The default value of the SNR threshold and the PSD threshold are 6 dB and 40 dB (re 1 μPa2/Hz), respectively.

$$\text{SNR} = \frac{\text{2S}\_{t,f}^{\prime\prime}}{\left(\text{S}\_{t,f}^{\prime} + \text{S}\_{t,f}^{\prime} - 1\right)}\tag{4}$$

$$B\_{t,f} = \begin{array}{c} 1, \ SNR\_{t,f} \ge \text{SNR}\_{threshold} \& \ S\_{t,f}'' \ge PSD\_{threshold} \\ 0, \text{otherwise} \end{array} \tag{5}$$

#### *2.5. Extracting the Whistle*

As mentioned in Section 2.4, the whistle is a narrow frequency band and a continuous signal. In this method, the nearby data points whose value is one are connected and labeled as a segment. Next, two conditions are set: the frequency bandwidth threshold and the time length threshold. Lastly, the segments whose frequency bandwidth is smaller than the frequency bandwidth threshold and whose time length is longer than the time length threshold are retained. The binary image *Bt*, *<sup>f</sup>* will be refreshed as a new image *B*- *<sup>t</sup>*, *<sup>f</sup>* . The default values of frequency bandwidth threshold and time length threshold are 300 Hz and 0.06 seconds, respectively.

#### *2.6. Clustering*

The k-means method [29] is used to cluster the data points in *B*- *<sup>t</sup>*, *<sup>f</sup>* . According to the difference of frequency and time, some of the whistle segments from Section 2.5 and above are merged. If the time interval of two segments is smaller than 0.3 seconds and the difference of frequency between two segments is smaller than 1 kHz simultaneously, two segments will be considered as one whistle segment. After merging, the k (number of clusters) is decided by the new number of segments. Each data point automatically combines into k whistles by calculating Euclidean distance of frequency and time index in *B*- *<sup>t</sup>*, *<sup>f</sup>* . Each whistle's start time, end time, start frequency, and end frequency are recorded after k-means.

#### **3. Localization Method**

TDOA was used to track the whistle. We devised an experiment to track the moving path of the artificial source by a whistle detector algorithm and TDOA.

#### *3.1. Time Difference of Arrival (TDOA)*

TDOA is often used in signal source positioning [30]. It only requires the received signal time and the speed that the signal travels. Once the signal is received at the two

receiving stations, the difference in arrival time can be used to draw the hyperbola of possible location by the equation shown in Equations (6) and (7). If we have three receiving stations, least two hyperbolas are produced, as shown in Figure 3, and their intersection will be the signal source location. To realize this hypothesis, the receiving stations must be time-synchronized.

$$\sqrt{\ (\mathbf{x} - \mathbf{x}\_1)^2 + \ (y - y\_1)^2} \ \ -\sqrt{\ (\mathbf{x} - \mathbf{x}\_3)^2 + \ (y - y\_3)^2} \ \ = \mathfrak{c}(t\_1 - t\_3) \tag{6}$$

$$\sqrt{\left(\mathbf{x} - \mathbf{x}\_2\right)^2 + \left(\left(y - y\_2\right)^2\right)} - \sqrt{\left(\mathbf{x} - \mathbf{x}\_3\right)^2 + \left(y - y\_3\right)^2} = \mathfrak{c}(t\_2 - t\_3) \tag{7}$$

where *t*1, *t*2, and *t*<sup>3</sup> are the times when the same signal arrives at different hydrophones; (x, y) is the position of the unknown signal source; and c is the sound speed from the local sound speed profile.

**Figure 3.** TDOA schematic.

#### *3.2. Taichung Harbor TDOA Experimental Configuration*

We deployed four hydrophone stations near Taichung Harbor, an area where Indo-Pacific humpback dolphins are extremely active [31,32]. The locations of the hydrophones are shown in Figure 4, and the exact latitude and longitude are shown in Table 1. The Beaufort Sea state was below 3, and the ambient noise is illustrated in Figure 5 as a percentile level. The highest PSD was around 95 dB (re 1 μPa2/Hz) from 60–70 Hz on L50, possibly produced by shipping noise, and the PSD from 3 kHz–10 kHz was around 65 dB (re 1 μPa2/Hz).

**Figure 4.** Hydrophone station locations.


**Table 1.** Latitude and longitude of hydrophone stations.

**Figure 5.** Ambient noise percentile level: Ln is the noise level exceeding *n*% of the measurement time, i.e., L50 is the noise level exceeding 50% of the measurement time.

The SoundTrap ST500 hydrophone recorder was used at point J3, and three Wildlife Acoustics SM3M hydrophone recorders were used at points J1, J2, and J4. They were deployed using the bottom-mounted method with sampling frequency set to 96 kHz. To achieve time synchronization for all recorders, we produced an impulse signal as a benchmark for correcting the time before deploying. To simulate the whistle of an actual Indo-Pacific humpback dolphin, which features a frequency range of 3–9 kHz, three kinds of artificial sound signals were employed: (a) rising frequency (5–9 kHz), (b) U-type (9–5–9 kHz), and (c) decreasing frequency (9–5 kHz), with a time length of one second, as shown in Figure 6. The source level (SL) was 160 dB (re 1 μ Pa at 1 m). The underwater acoustic projector SQS-23 was placed at a water depth of 5 m (Figure 7), since Indo-Pacific humpback dolphins often stay about 5 m below sea level [33]. Figure 8 shows where the artificial sound signals were played, every 10 seconds for 10 minutes, in the 15 spots (T1–T15) outside Taichung Harbor.

**Figure 6.** (**a**) Rising frequency; (**b**) U-type; (**c**) decreasing frequency.

**Figure 7.** Schematic of the installation position of the projector.

#### *3.3. Experimental Data Analysis Method*

In this experiment, the SNR of the received signal was larger than 10 dB, exceeding the NTU\_PAM-recommended SNR threshold of 6 dB. The signals recorded by each of the hydrophones at the four stations when the source was at point T10 are shown in Figure 9. To find the artificial whistle within the sound file, NTU\_PAM was used to extract information, namely the start and end times from the raw data of the four hydrophones. However, the extracted time information was not precise enough for TDOA. For increased accuracy, the raw data of the start and end times of the whistle were directly analyzed without being processed by the algorithm. The time of the J2 station was considered as the central time, and cross-correlation analysis with the full frequency band raw data of the central station and three other stations was performed to determine the time difference, as shown in Equations (8) and (9), where *X*<sup>2</sup> is J2 station's whistle raw data; *Xo* is the three other stations' whistle raw data; *R* is the result of cross-correlation; and *td* is the time difference, which was used to obtain the location of the signal source by the TDOA method.

$$R(\tau) = \int X\_2(t)X\_o(t-\tau)dt\tag{8}$$

$$td = \max(R(\tau))\tag{9}$$

**Figure 9.** Received signal of each station when the source was at point T10. (**a**) J1 station; (**b**) J2 station; (**c**) J3 station; (**d**) J4 station.

#### **4. Results**

#### *4.1. Comparison with PAMGuard*

As mentioned, PAMGuard software is widely used in the field of marine mammal observation. In this research, the performance of NTU\_PAM and the Whistle and Moan Detector module of PAMGuard were compared using the same hardware (an i9-9900 CPU from Intel Corporation with 64 GB of memory). The test audio is a two-minute sound file, rich in whistles and with a sampling frequency of 96 kHz, recorded near the sea area of Yunlin, Taiwan [34]. We manually confirmed that the file contained a total of 33 whistles.

When the PAMGuard Whistle and Moan Detector's parameters were set at a window length of 2048 data points (0.02 s) and 1024 data points (0.01 s), and when the overlap ratios were 50% and 90%, the NTU\_PAM's recommended window length was 0.01 s with an overlap ratio of 90% and SNR set to 6 dB. As shown in Table 2, PAMGuard with settings of window length at 1024 data points, 90% overlap ratio, and 6 dB SNR shows the closest result of the 47 detected whistles to the manually confirmed 33 whistles. A total of 30 whistles were detected by NTU\_PAM.




**Table 2.** *Cont.*

#### *4.2. Experimental Results*

At least three signal receiving stations were used to calculate TDOA. When the intersection of the hyperbolic curves is plural, the center point is taken as the final judgment location. To verify localization accuracy, GPS data from the experimental ship bearing the sound source were compared to results from TDOA.

In the series of graphs in Figure 10, the blue dot is the hydrophone station position (J1, J2, and J4), the red dot is the signal source position of the experimental ship's GPS record, and the yellow star is the TDOA positioning result. The results from the first experiment testing the rising frequency (5–9 kHz) signal are shown in Figure 10a. The positioning accuracy was higher when the sound source was nearer to the center positions J1 and J2 from the group of hydrophone stations. The nearest positioning points T4 to T11 showed an average positioning error of 24.7 m, and the overall positioning error was 143.5 m, which was affected by the lower accuracy of the outer point.

**Figure 10.** (**a**) Result of rising frequency signal; (**b**) result of decreasing frequency signal; (**c**) result of U-shaped signal.

The second experiment was the decreasing frequency (9–5 kHz) signal, and its positioning trend was similar to the rising frequency signal (Figure 10b). It also showed higher positioning accuracy when the signal source was close to the J1 and J2 stations. The average positioning error of T4 to T11 was 44.8 m, larger than that of the rising frequency signal, and the overall positioning error was 145.9 m. Finally, the U-shaped (9–5–9 kHz) signal displayed a similar trend as the aforementioned signals (Figure 10c). The average positioning error of T4 to T11 was 39.6 m, but the overall positioning error was the smallest of the three signals at 116.1 m.

#### **5. Discussion**

In the comparison between PAMGuard and NTU\_PAM, the results were close to the number of whistles that was manually confirmed and showed that both performed well on whistle detection. The reason for the different numbers detected may be that PAMGuard is a real-time auxiliary tool mainly provided to visual method researchers for detecting the occurrence of a call; as such, it only needs a few window lengths of data to detect the whistle. As to the amount of audio data required, NTU\_PAM needs one second or more of data to build a spectrogram and to initiate processing. However, PAMGuard may, at times, break one call into several calls, as shown in Figure 11. According to the results, NTU\_PAM is suitable for to processing measurements captured over a longer duration, and it proves as robust as PAMGuard.

**Figure 11.** Output results of rising frequency signals: (**a**) NTU\_PAM; (**b**) PAMGuard with 1024 data points window length, 90% overlap ratio and 6 dB SNR.

In the localization experiment, the TDOA method proved useful for localizing the whistle source. Figure 12 plots the errors of the three different types of signals at each spot and indicates that the error is small when the source is inside the region of the four hydrophone recorders (points T4–T11); when outside the region (points T1–T3 and T12– T15), location was only approximate (Figure 13). The results of this experiment indicate strengths in using the NTU\_PAM for successful tracking of cetaceans.

**Figure 12.** Distribution of localization errors.

**Figure 13.** Detection region consisted of hydrophones.

#### **6. Conclusions**

In this research, we devised and developed the NTU\_PAM algorithm, which performs whistle detection and whistle localization based on the TDOA method. The results showed NTU\_PAM is able to localize and track the whistle sound source with high accuracy. In the future, MMOs can monitor the moving path of marine mammals via the visual method combined with NTU\_PAM, making it possible to monitor cetaceans without being limited by daylight hours.

**Author Contributions:** Conceptualization, C.-T.H., W.-L.L., Y.-H.H., W.-Y.C. and C.-F.C.; methodology, C.-T.H., W.-L.L., Y.-H.H., W.-Y.C. and C.-F.C.; software, C.-T.H., W.-L.L. and W.-Y.C.; formal analysis, W.-Y.C. and W.-C.H.; writing—original draft preparation, C.-T.H. and W.-Y.C.; writing review and editing, C.-T.H. and C.-F.C.; supervision, C.-F.C.; project administration, C.-F.C.; funding acquisition, C.-F.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Science and Technology, Taiwan (MOST 109-2221-E-002-198-MY3).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Request from the corresponding author of this article.

**Acknowledgments:** The authors would like to thank the Formosa Plastics Group and the Formosa Petrochemical Corporation, 1141946-00, for the 2 minutes of data. The authors would like to thank the Ministry of Science and Technology, Taiwan, for the funding; and Professor Lien-Siang Chou for help with marine mammal knowledge.

**Conflicts of Interest:** The authors declare no conflict of interest.
