**Hearing in Noise: The Importance of Coding Strategies—Normal-Hearing Subjects and Cochlear Implant Users**

### **Pierre-Antoine Cucis 1,2,\* , Christian Berger-Vachon 3,4, Ruben Hermann 1,2, Fabien Millioz 5, Eric Truy 1,2 and Stéphane Gallego 6,7**


Received: 7 December 2018; Accepted: 11 February 2019; Published: 20 February 2019

**Abstract:** Two schemes are mainly used for coding sounds in cochlear implants: Fixed-Channel and Channel-Picking. This study aims to determine the speech audiometry scores in noise of people using either type of sound coding scheme. Twenty normal-hearing and 45 cochlear implant subjects participated in this experiment. Both populations were tested by using dissyllabic words mixed with cocktail-party noise. A cochlear implant simulator was used to test the normal-hearing subjects. This simulator separated the sound into 20 spectral channels and the eight most energetic were selected to simulate the Channel-Picking strategy. For normal-hearing subjects, we noticed higher scores with the Fixed-Channel strategy than with the Channel-Picking strategy in the mid-range signal-to-noise ratios (0 to +6 dB). For cochlear implant users, no differences were found between the two coding schemes but we could see a slight advantage for the Fixed-Channel strategies over the Channel-Picking strategies. For both populations, a difference was observed for the signal-to-noise ratios at 50% of the maximum recognition plateau in favour of the Fixed-Channel strategy. To conclude, in the most common signal-to-noise ratio conditions, a Fixed-Channel coding strategy may lead to better recognition percentages than a Channel-Picking strategy. Further studies are indicated to confirm this.

**Keywords:** cochlear implant; coding strategy; Fixed-Channel; Channel-Picking; vocoder simulation; normal-hearing

#### **1. Introduction**

In 2012, cochlear implants (CIs) had successfully restored partial hearing to over 324,200 deaf people worldwide [1]. In most cases, users of modern CIs perform well in quiet listening conditions. Four CI manufacturers are presently on the French market: Cochlear® and Neurelec®/Oticon Medical® for Channel-Picking (CP) strategies and Advanced Bionics® and Med-El® for Fixed-Channel (FC) strategies. For most CI users, performances for speech perception significantly decrease in noisy environments [2].

All modern sound coding strategies are based on the analysis of acoustic information by a bank of band-pass filters and each strategy has its own philosophy [3].

Two coding schemes are mainly in use. FC strategies transmit all available channels to the electrodes, which usually stimulate at a high rate. CP strategies (sometimes called n-of-m strategies) use various stimulation rates (a high, medium or low rate), estimate the outputs of all the available channels (m) and select a subset of channels (n) with the largest amplitudes.

The present study focuses on the relative contribution of FC strategies and CP strategies on syllable recognition in noise. We wish to compare the efficiency of the FC and CP coding strategies, first in simulation and secondly with CI users.

#### *1.1. Sound Coding Strategies*

In practice, a wide variation of outcomes is observed amongst implanted patients, which is probably linked to the duration of deafness, the age at implantation, the age at onset of deafness, the duration of implant use and the patient's social environment [4].

Some studies showed a superiority of an FC strategy over a CP strategy particularly in noise [5]. Others like Skinner et al. and Kiefer et al. [6,7] showed a positive difference, for speech recognition, in favour of the Advanced Combination Encoder (ACE) (CP strategy) over Continuous Interleaved Sampling (CIS) (FC strategy) and Spectral Peak Picking (SPEAK) (CP strategy) [6]. Brockmeier et al. [8] compared the musical activities and perception of cochlear implant users and concluded that CIS, SPEAK and ACE did not differ significantly. No clear advantage for a particular coding scheme has been identified yet.

The number of spectral channels required for speech recognition depends on the difficulty of the listening situation for both FC and CP strategies [9,10]. For FC strategies, all channels are transmitted to the corresponding electrodes, usually between 12 and 16, leading to a relatively large amount of spectral information that may be blurred by the current spread in the cochlea. When the stimulation rate is high, some results suggest that this rate may be beneficial to speech perception [11]. Another feature of the strategies lies in the non-overlapping (interleaved) pulse delivery; pulses are brief with a minimum delay between them and rapid variations in speech can be tracked [12].

#### *1.2. Influence of Noise*

The assessment of the performance of CI users in noise has become of great interest as it is considered to be representative of daily listening conditions.

In noise, the natural gaps in speech are filled and speech envelopes are distorted making speech recognition more difficult. The CP coding strategies may select noise-dominated channels, instead of the dominant speech channels, at low signal-to-noise ratios (SNRs) [13]. Unlike the CP strategies, FC strategies transmit the information of all available channels leaving the task of selecting the informational cues to the auditory system.

The presence of noise reduces the effective dynamic range for CI users by compressing the region of audibility into the upper position of the dynamic range [14]. Good speech perception in noise is a target in the management of deafness [15–18] and this aspect is also of great importance when CI coding strategies are concerned. Thus, tests in noise are more sensitive to changes in the fitting parameters and more ecological than tests in quiet conditions.

#### *1.3. Simulation with Normal-Hearing Subjects*

Considering the heterogeneity in a group of CI users, it is usually difficult to draw strong conclusions. Additionally, as the FC and CP strategies are fitted to each group of CI users, the heterogeneity of the populations is increased.

On the contrary, a simulation work, which can be done with NH subjects, allows greater homogeneity of the participants. In this case, the same subject can face different situations [19], such as coding schemes and SNRs, allowing one to focus on different controllable features such as time and amplitude cues and to ensure efficient paired comparisons. However, the results observed with NH listeners cannot be directly extrapolated to CI users and many studies have been conducted on this subject. Dorman extensively studied this matter in 2000 [20,21] and stated that "performance of the NH listeners established a benchmark for how well implant recipients could perform if electrode arrays were able to reproduce, by means of the cochlea, the stimulation produced by auditory stimulation of the cochlea and if patients possessed neural structures capable of responding to the electrical stimulation" [22]. They also indicated that the best CI users achieved scores that were within the ranges of scores observed with NH subjects. On the contrary, other authors point out the limitations of using vocoders to simulate electric hearing and the importance of making experiments with CI users [23].

Consequently, both approaches (with CI and NH subjects) seem necessary; with NH subjects, we can evaluate the consequences of the coding strategies and with CI users we can evaluate the real aspect on a clinical point of view. Practically, for a given strategy, several fitting procedures are recommended by the manufacturers and each CI is fitted to the patient.

#### **2. Material & Methods**

#### *2.1. Participants*

The work presented in this paper follows a previous pilot study [24] and was approved by the French Ethics Committee "Sud-Est 2" (August, 27, 2014, ID-RCB: 2014-A00888-39), under the supervision of the HCL (Civil Hospitals of Lyon). The participants were recruited between November 2014 and April 2016. They were all informed at least a week before entering the study, verbally and in writing and they filled out a consent form.

#### 2.1.1. Normal-Hearing Subjects

Twenty NH subjects participated in this experiment. Their age ranged from 18 to 33 years old, with an average of 25 years. They were recruited among the students of the Claude Bernard Lyon 1 University, through a recruitment notice sent via email. An otologic examination was performed before entering the study in order to exclude subjects with previous pathologies or deafness. All these subjects were considered to have normal hearing according to recommendations of the International Bureau for Audio-Phonology, as their auditory thresholds were below 20 dB HL for all frequencies between 250 and 8000 Hz.

#### 2.1.2. Cochlear Implant Subjects

Forty-five CI users were included in this study. Their ages ranged from 18 to 60 years old, with an average of 37 years. They were recruited in the general population of CI users who have their classical follow-up examination in our tertiary referral centre. Nineteen subjects were fitted with an FC strategy (Advanced Bionics® and Med-El®) and twenty-six had a CP strategy (Cochlear® and Neurelec®/Oticon Medical®); the CI population was constituted of two groups (one for each coding scheme). CI users included in the experiment were people implanted unilaterally and bilaterally. In the case of people with bilateral implantation, only one implant was tested: the one giving the best outcomes according to the patient. Demographic details are indicated in Appendix A.

#### *2.2. Stimuli*

The acoustic material incorporates Fournier's lists mixed in with a cocktail-party noise.

#### 2.2.1. Fournier's Disyllabic Lists

These lists were adapted to test the participants. They were created by Jean-Etienne Fournier in 1951 and are approved by the French National College of Audiology (C.N.A.). Forty lists with a male voice are available and each list is constituted of 10 two-syllable common French words (e.g., le bouchon = the cork), leading to 20 syllables per list. They are a French equivalent to the American Spondee lists (e.g., baseball). The recognition step was one syllable (5%).

#### 2.2.2. Noise

In this study, we used cocktail-party noise. It was a voice mix of eight French-speaking people, four males and four females. This kind of noise was sufficiently heterogeneous for the task and the masking was rather invariable throughout a session.

#### *2.3. Hardware*

Stimuli were recorded on a CD (44.1 kHz sampling frequency, 16-bit quantization) and presented using a PHILIPS CD723 CD player connected to a Madsen orbiter 922® Clinical audiometer to control the general volume and the SNR. The sound was delivered in free field with two JBSYSTEMS ISX5 loudspeakers for CI users and with a TDH 39 headset for NH subjects. Devices used in our experiment are regularly calibrated and checked according to the NF EN ISO 8253 standard.

#### *2.4. Experimental Conditions and Procedures*

For the two groups of subjects, the experiment consisted in speech audiometry in noise with one syllable as the error unit. For a fixed speech level of 60 dB SPL, the maximum level delivered was below 65 dB SPL. According to the conditions requested by the ethics committee, it did not exceed the 80 dB SPL limitation recommended for professional noise exposure.

#### 2.4.1. Normal-Hearing Subjects

Processed stimuli were delivered to only one ear, as in the experiment conducted with CI users. Furthermore, we chose to test the subjects in the right ear considering a lateralization of the treatment of sounds and especially that speech understanding seems associated with the left hemisphere activity [25,26].

For a fixed speech level of 60 dB SPL, five SNR were tested for each sound-coding scheme [FC and CP (8 out of 20)]. The lower SNR was −3 dB and the higher was +9 dB with 3 dB steps between each tested SNR. For the SNR of +9 dB, the recognition percentage was 100%. Each combination (coding scheme + SNR) was assigned to a Fournier's list so that the lists were not repeated. Each session started with a short training period to help the listener understand the instructions. Then the 10 noise and coding scheme conditions were randomly presented to each subject (1 list per condition: 5 SNRs and 2 coding schemes). The sessions lasted about 15 min (plus half an hour for the auditory check).

#### 2.4.2. Cochlear Implant Users

The procedure was slightly different with the CI users as the task was more difficult for them than for the NH subjects. The speech level was fixed to 60 dB SPL. Most of the CI users did not reach the 100% recognition score; the percentage regularly increased with the SNR. The SNRs were presented from +18 dB to −3 dB with 3 dB steps. Only one strategy (corresponding to their CI) could be tested with a patient. Lists were presented in increasing order of difficulty (from +18 dB to −3 dB of SNR) to avoid discouragement effects; this procedure was the same for both coding schemes.

CI users were tested at the beginning of their periodical clinical check-up and device setting, which occurs at the "CRIC" (Cochlear Implant Setting Centre) located in the ORL department of the Edouard-Herriot hospital. The patient follow-up consists of an appointment with a speech therapist, a setting of the implant parameters by an audiologist and a clinical examination by a physician.

The following tasks were realized in our work:


### *2.5. Implant Simulation*

For the simulation of "CI like" speech processing, we used a vocoder implemented in Matlab® (MathWorks, Natick, MA) to simulate an FC and a CP coding strategy. We did not simulate channel interaction in this study.

A diagram representing the signal processing performed by the vocoder is shown in Figure 1. The different steps of the signal processing are as follows:


**Figure 1.** Block diagram representing the signal processing performed by the n-of-m simulator.


**Table 1.** Centre and cut-off frequencies of the vocoder coding.

#### *2.6. Mathematical Analysis of the Data*

#### 2.6.1. Comparison of the Percentages

The score for each test was the number of correctly repeated syllables (20 syllables per condition) expressed as a percentage.

In the case of NH subjects, we used a two-way repeated-measure ANOVA (coding scheme × SNR). For CI users, we used a two-way mixed model ANOVA [coding scheme × SNR] on intelligibility scores. Because the groups were relatively small and the data were not normally distributed, all the post-hoc analyses were performed with non-parametrical tests: Mann–Whitney's test for unpaired data and Wilcoxon's test for paired data.

We also calculated the Cohen's d term as an effect size index for each average score tested [29]. Cohen's d is a quantitative measure of the magnitude of a phenomenon: a large absolute value indicates a strong effect. Cohen's *d* is defined as the difference between two means divided by a standard deviation for the data.

#### 2.6.2. Curve Fitting with a Sigmoid Function

The recognition percentages versus the SNR can be classically represented by a sigmoid curve regression (Figure 2).

Three parameters were considered on this curve:


These analytical values are represented on the sigmoid curve. The minimum recognition is 0% (measured for SNR = −3 dB). Thus, the sigmoid equation is

$$y = \frac{a}{1 + e^{-b(x-c)}}$$

where


**Figure 2.** Fitting of the recognition percentages by a sigmoid curve.

#### 2.6.3. Bonferroni Correction

We considered the Bonferroni correction as an indicator but we did not adjust our probability (*p*) thresholds because of the small number of comparisons and the indicative orientation of this work [30]. The main objective was to look for clues that will need to be further investigated in the future. Streiner et al. [31] "advise against correcting in these circumstances but with the warning that any positive results should be seen as hypothesis generating, not as definitive findings." Consequently, to avoid overcorrection, we used the Holm–Bonferroni method, which adjusts the rejection criteria of each of the individual comparisons. The lowest *p*-value is evaluated first with a Bonferroni correction involving all tests. The second is evaluated with a Bonferroni correction involving one less test and so on for the other tests. Holm's approach is more powerful than the Bonferroni approach but it still keeps control on the Type 1 error.

#### **3. Results**

#### *3.1. Normal-Hearing Subjects*

#### 3.1.1. Recognition Percentages

The results of syllable recognition versus the SNR are shown in Figure 3. Significant differences are indicated by an asterisk.

**Figure 3.** Syllable recognition function of the signal-to-noise ratio by NH subjects with the CI simulator using both strategies. Bars indicate the standard deviation. The asterisks indicate the significant differences (5% threshold).

#### 3.1.2. Statistical Analysis

The ANOVA showed a significant effect of the SNR [F (4,95) = 519; *p* < 10−4] and of the coding scheme [F (1,95) = 16; *p* < 10−4]; there was no significant interaction between them [F (4,95) = 1.95; *p* = 0.11]. Consequently, a post-hoc analysis was performed for the coding scheme.

For each SNR, comparisons were made with paired Wilcoxon's tests (on the 20 subjects who participated in the experiment). In the whole experiment, we had five paired series (one for each SNR); for each paired series we had 20 pairs of values (one per subject). For the extreme SNR values (−3 dB and +9 dB), the recognition percentages were not significantly different between FC and CP (Table 2). *P*-values were below 5% for the SNRs 0 dB, +3 dB and +6 dB.

Using the Holm–Bonferroni correction, the first corrected decision threshold was 1% and differences become not significant since the lowest *p*-value was 0.019. For SNRs +3 and 0 dB, differences were close to significance and worth further investigation; additionally, the Cohen's effect sizes were respectively strong (0.89) and medium (0.68). This coheres with the general ANOVA results.


**Table 2.** Percentage comparisons for normal-hearing subjects between the simulation strategies.

#### 3.1.3. Sigmoid Parameters

The comparison of the sigmoid parameters (Table 3) showed that the *x*50% values were different between FC and CP (*p* = 0.038). No differences were found for the slope and the plateau. Considering the Holm–Bonferroni correction, the first adjusted decision threshold was 1.7%. The effect size for *x*50% was strong (0.85).


**Table 3.** Comparison of the sigmoid parameters, for normal-hearing listeners.

#### *3.2. Cochlear Implant Users*

#### 3.2.1. Recognition Percentages

CI users with FC stimulations and with CP have been gathered into two groups (FC and CP); percentages are shown in Figure 4.

**Figure 4.** Syllable recognition function of the signal-to-noise ratio by CI users. Bars indicate the standard deviation. The asterisks indicate the significant differences (5% threshold).

#### 3.2.2. Statistical Analysis

The ANOVA indicated a significant effect of SNR [F (1,301) = 146; *p* < 10<sup>−</sup>4] but not for the coding scheme [F (1,43) = 0.66; *p* = 0.42]. A significant interaction was seen between them [F (1,301) = 2.23; *p* = 0.032], which may need further investigation.

The recognition percentages are indicated in Figure 4 and Table 4. We can see that the plateau was not reached for high SNRs and an inversion of the performances may be noticed between CP and FC at +15 dB.


**Table 4.** Percentage comparisons for cochlear implant users between the coding strategies.

#### 3.2.3. Sigmoid Parameters

Gathering the four implants according to their coding schemes (FC and CP), Mann–Whitney's tests indicated a significant difference only for *x5*0% (*p* = 0.042) (Table 5). After considering the Holm–Bonferroni correction, this difference needs to be discussed. The effect size was medium (0.73).


**Table 5.** Comparison of the analytical values, for cochlear implant users.

We also looked for a possible link between *x*50% and *y*max (Figure 5). The scatter plot indicates that all the situations can be observed with every implant. No correlation was seen for any manufacturer (pMed-El = 0.62, pAdvanced Bionics = 0.47, pCochlear = 0.055, pNeurelec = 0.55).

**Figure 5.** Speech recognition plateau versus the *x*50% parameter for cochlear implant users.

#### **4. Discussion**

Several items have been considered in this study: the coding scheme, the influence of noise and the simulation of CI coding in NH subjects.

#### *4.1. On the Coding Strategy*

The choice of a coding strategy is a delicate matter and some studies have shown that CI users have a subjective preference for a particular strategy that is not always the one that yields the best performances [32].

Additionally, many technical parameters concerning the coding scheme, such as stimulation rate, gain control, update rate and filter settings, influence the final results and have an effect on the performances related to the coding strategy [28,33].

It is interesting to note that the four manufacturers have taken different stimulation strategies and the results are dispersed; it is therefore difficult to draw definitive conclusions. For any manufacturer, all coding strategies can be implemented within the processor.

From our results with NH listeners, the Bonferroni correction lowered the significance limit but the Cohen analysis indicated that the differences were reliable. The FC strategy presented better recognition percentages than CP, particularly in the SNR range 0 to +6 dB with effect sizes medium (0 and +6 dB) and strong (+3 dB). Moreover, the comparison between the *x*50% highlighted a strong effect size in favour of the FC strategy.

Of course, our results only stand for a CP strategy with an extraction of 8 channels out of 20 and an FC strategy with 20 channels. However, this seems to be an interesting hint, as the conditions and the subjects were the same for both strategies (CP or FC), used the same random approach, had the same SNRs and had the same signal processing (window length, sampling rate, channel band-pass, quantization, etc.), the subjects were of the same type (range of age, education, etc.) and we were able to use paired comparisons. However, this simulates an "ideal case" where there is no channel interaction and no pitch-shift due to the insertion depth of the electrode array and where all the channels are functional. This is why these results should not be taken on their own without taking into account the experiment conducted with the CI subjects.

With the CI subjects, considering the non-saturation of the percentages, we raised the SNR up to +18 dB and FC led to, albeit not significant, higher scores than CP in the 0–12 dB range. A small inversion of the results was also observed above +12 dB SNR, which can be linked to the significant interaction between SNR and coding scheme. Nevertheless, results are to be taken with caution, considering the wide dispersion of the results and the comparison of relatively small and unbalanced groups. Because of the difference in the number of included patients in each group, we used nonparametric tests to compare them; they are well adapted for this kind of comparisons. While they are generally less powerful than parametric tests, they are more robust. However, like in the experiment with NH subjects, the comparison between the *x*50% showed an interesting difference in favour of the FC strategies. With a medium effect size, this result would be worth investigating further in another study.

Our results are consistent with studies that showed a superiority of the FC strategy over the CP strategy, particularly in noise [5]. This was less true when CP strategies, such as ACE, with a high stimulation rate were introduced [6,7]. No clear advantage for a particular coding scheme can be identified taking all the literature on the subject.

In many studies, when the FC strategy was used, the stimulation rate was an important factor as the possibility to follow the quick changes in the signal helps the recognition performances mostly for consonants [34,35].

#### *4.2. Cochlear Implant Users and Normal-Hearing Subjects*

Despite the fact that the groups of CI users were heterogeneous, the general recognition behaviour was the same for CI users and NH subjects, whatever implant was used. With NH listeners, for a SNR of +9 dB, the 100% recognition level was reached.

With CI users, the plateau was not always reached with a SNR of +18 dB; additionally, it was below the 100% measured with NH subjects. For a +9 dB SNR (maximum tested in simulation), the CI users' performances were below the scores observed with NH subjects; the mean scores with CI users ranged from 50 to 75%; this is consistent with previous studies [36].

The same results were also seen with the *x*50% (sigmoid fitting), which was better in NH subjects than with CI users.

With the CI users, an inversion occurred between +12 and +15 dB SNR and it was also seen for +18 dB SNR; performances observed with a CP strategy were higher than the performances with an FC strategy.

The reliability of the data obtained from CI users is a real issue. Is there a link between the plateau and the *x*50%? The scatter point diagram of the four CI user populations is shown in Figure 5. It shows that, for each manufacturer, all possibilities exist, either with a good plateau and a poor *x*50% or vice versa. All intermediate situations were found and the correlation coefficients were not significantly different from zero.

As the very goal is to provide an opportunity for every CI user to hear in everyday life [28], the work ahead is important. The efficiency of a CI is affected by many factors such as the recognition and linguistic skills, the degrees of nerve survival and the technical choices that are made when fitting the device and the variations are wide with every subject.

#### *4.3. Listening in Noise*

Listening in noise is a clear challenge, which is not handled in the same way by CI users and NH people. Noise flattens the spectrum and the subsequent structures in the auditory system do not react identically [37]. The study of speech recognition in noise has become of great interest as it is present in daily listening conditions. Additionally, we can see the coding behaviour for different SNRs (floor and ceiling effect and intermediate situation).

In this study, the CI user group was older on average than the NH group. In general, older people have lower speech perception scores in noise, even with normal or age-related hearing, compared to young people. However, the purpose of the study was to test both groups and see if a similar trend (between FC and CP coding schemes) could emerge and not to compare CI users with NH subjects.

Another finding was the effect of noise on performances for each strategy, which makes this study interesting in that each manufacturer can set any coding scheme on their devices. Consequently, it is worthwhile to evaluate the results through different approaches. Our work may suggest that the strategy is noise-dependent.

The number of channels needed to understand speech in noise or in quiet is an important issue. Studies have indicated that, when speech is processed in the same manner as in CIs and presented in quiet to NH listeners, sentence recognition scores are higher than 90% with as few as 4 channels [38,39]. In the literature, results show that more channels are needed to understand speech in noise than in quiet [10] but selecting more than 12 channels may not yield significant improvements on the recognition performances [21]. These considerations orientated the choice of our parameters.

In noise, performances of CI users reach a plateau as the number of channel increases; for NH subjects performances continue to increase (up to 100%), suggesting that, CI subjects could not fully utilize the spectral information provided by the number of electrodes, possibly because of the channel interaction [38]. As indicated above, trends are similar for NH and CI listeners but results are not interchangeable. It is sensible to say that more channels imply more information but this also implies more overlap between the electrodes. This conflict needs to be studied in the future; we can simulate channel interaction with NH subjects.

The acoustical material (in our case the Fournier's lists and the cocktail party noise) seemed to be well adapted to the situation.

#### **5. Conclusions**

A simulation study of NH listeners measured syllable recognition in a noisy environment, using both Fixed-Channel and Channel-Picking coding schemes. The results were also compared with CI users' performances. CI users were divided into two groups corresponding to the coding schemes available. Twenty NH subjects and 45 CI users participated in this experiment. The acoustic material was the Fournier French dissyllabic lists mixed with a cocktail-party noise.

The results obtained in the simulation with the NH subjects indicated an advantage of the fixed-channel strategy over the channel-picking coding in a middle SNR range (from 0 to +6 dB); parameters (patients, technology and protocol) were well controlled in this approach. This trend was confirmed using the sigmoid curve regression. The results seemed similar with the CI users.

Nevertheless, results were less reliable with CI users, probably due to the wide dispersion in the patients' results. Additionally, an inversion of the coding strategy was seen with high SNRs, with CI users. This aspect should be examined in the future, considering its practical application and we need to consider the physiological and electrical phenomena involved in a multichannel stimulation such as channel interaction. Simulation and tests with CI users are useful as they give two complementary insights into the difficult task of determining an "optimal" sound coding strategy to enhance the auditory performance of CI users.

**Author Contributions:** Conceptualization, C.B.-V., E.T. and S.G.; formal analysis, P.A.C. and S.G.; investigation, P.A.C.; methodology, P.A.C., C.B.-V. and S.G.; resources, E.T.; software, P.A.C. and F.M.; supervision, C.B.-V, E.T. and S.G.; visualization, P.A.C. and C.B.-V; writing—original draft preparation, P.A.C. and C.B.-V; writing—review and editing, P.A.C., C.B.-V, F.M., R.H., E.T. and S.G.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to thank the people and institutions who participated in this study: Kevin Perreault who initiated the work, Charles-Alexandre Joly and Fabien Seldran for their scientific contribution, Evelyne Veuillet for contacts with the ethic-committee, the members of the CRIC team of the Edouard Herriot University hospital of Lyon for their collaboration, the normal-hearing subjects and the cochlear implant users who entered the study and the Hospitals of Lyon and the Polytechnic School of Lyon for their administrative support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Characteristic N Gender** Male 23 Female 22 **Ear** Right 32 Left 13 **Origin of deafness** Congenital 17 Acquired 18 Unknown 10 **Age in years at implantation** 1–5 years 9 6–10 years 3 11–20 years 6 >20 years 27 **Duration in years of implant use** 1–5 years 14 6–10 years 14 11–15 years 7 16–20 years 9 >20 years 1

1–10 years 4 11–20 years 18 21–30 years 4 31–40 years 8 >40 years 5 Unknown 7

Cochlear 13 Med-El 12 Advanced Bionics 7 Neurelec/Oticon Medical 13


#### **References**

1. NIDCD. Available online: https://www.nidcd.nih.gov/ (accessed on 14 June 2017).

**Duration of deafness in years**

**Cochlear implant**

**Coding strategy**

2. Fetterman, B.L.; Domico, E.H. Speech recognition in background noise of cochlear implant patients. *Otolaryngol. Head Neck Surg.* **2002**, *126*, 257–263. [CrossRef] [PubMed]

Channel-picking (SPEAK, ACE . . . ) 26 Fixed-channel (FS4, HiRes . . . ) 19


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **In-Depth Exploration of Signal Self-Cancellation Phenomenon to Achieve DOA Estimation of Underwater Acoustic Sources**

#### **Fang Wang 1,\*, Yong Chen <sup>1</sup> and Jianwei Wan <sup>2</sup>**


Received: 9 December 2018; Accepted: 6 February 2019; Published: 8 February 2019

**Abstract:** In the ocean environment, the minimum variance distortionless response beamformer usually has the problem of signal self-cancellation, that is, the acoustic signal of interest is erroneously suppressed as interference. By exploring the useful information behind the signal self-cancellation phenomenon, a high-precision direction estimation method for underwater acoustic sources is proposed. First, a pseudo spatial power spectrum is obtained by performing unit circle mapping on the beam response in the direction interval. Second, the online calculation process is given for reducing the computational complexity. The computer simulation results show that the proposed algorithm can obtain satisfactory direction estimation accuracy under the conditions of low intensity of acoustic source, strong interference and noise, and less array snapshot data.

**Keywords:** minimum variance distortionless response; signal self-cancellation; direction estimation; underwater acoustic source; spatial power spectrum

#### **1. Introduction**

Underwater acoustic source localization determines the altitude or depth, range, and bearing angle of the underwater target, that is, the three coordinates of the underwater target in the elliptical coordinate system [1]. The estimation of bearing angle (or direction of arrival (DOA)) of underwater acoustic source is an important and indispensable step in underwater acoustic source localization. In fact, in some underwater acoustic source localization methods, it is the target direction information that is used to estimate the target distance [2]. Using a vector hydrophone (or vector hydrophone array) is perhaps one of the simplest and most straightforward methods of underwater acoustic source DOA estimation. The vector hydrophone is capable of simultaneously measuring sound pressure and particle velocity along one to three orthogonal directions. Therefore, only a single vector hydrophone can generate a directional beam pattern. A Directional Autonomous Seafloor Acoustic Recorder (DASAR) system consisting of several vector hydrophones has been reported [3]. By using a vector hydrophone, directional industrial noise is effectively suppressed, and weak marine mammal sounds can be successfully detected.

Conventional beamforming (or delay-and-sum beamforming) is also one of the common DOA estimation methods [4–6]. Based on the ray-path approximation for the sound channel's impulse response, the frequency-difference beamforming method for the sparse hydrophone array is proposed in [7,8], which can estimate the signal phase difference by using the conventional delay-and-sum beamforming of the field product at the difference frequency. In contrast, by determining the array weight coefficients in a nonlinear manner, the minimum variance distortionless response

(MVDR) method can achieve higher angular resolution than conventional beamforming [4]. In order to cope with the model mismatch problems in the uncertain ocean environment, such as the signal look direction mismatch, the signal spatial signature mismatch due to local scattering or wavefront distortion [9,10], many robust MVDR-based adaptive beamforming algorithms have been proposed [10–12].

Similar to the MVDR-based methods, the subspace-based high-resolution DOA estimation techniques also use information carried by the covariance matrix. The most representative subspace-based DOA estimation methods may be the Multiple Signal Classification (MUSIC) algorithm [13], the Estimation of Signal Parameters via Rotational Invariance Technique (ESPRIT) [14], and the Propagator Method (PM) [15]. The key to the subspace-based DOA estimation methods is the estimation of the signal subspace (or noise subspace). To achieve this purpose, one can first perform eigendecomposition on the sample covariance matrix, then construct the signal subspace with the eigenvectors corresponding to the larger eigenvalues, and form the noise subspace with the eigenvectors corresponding to the smaller eigenvalues. Another more sophisticated approach is to reconstruct the sample covariance matrix according to the Toeplitz structure of the covariance matrix [16], and then obtain the signal subspace in a similar way as above. In contrast to the above eigenvalue-based methods, the eigenvector pruning algorithm implements the estimation of the signal subspace by using the statistical properties of eigenvectors of signal-free sample covariance matrix [17].

In many underwater acoustic source DOA estimation scenarios, the number of acoustic sources distributed in the underwater far field is much smaller than the number of hydrophones in the observation array. This is the intrinsic basis of the popular sparse-based DOA estimation methods. In the least absolute shrinkage and selection operator (LASSO) method [18], the signal amplitude vector is obtained by solving an *l*1-norm regularized least-squares problem. The LASSO method contains the *l*1-norm constraint on the solution vector, thus making the result of the solution vector sparse [19]. By linear transformation of the solution vector, the weighted LASSO method [20] imposes certain structural constraint on the solution vector to achieve efficient processing of spatially extended sources (e.g., underwater embedded objects in acoustic imaging [21]). The total variation norm regularization method [22] for DOA estimation of spatially extended sources can be seen as a special case of the weighted LASSO method, which uses the band matrix to realize the linear transformation of the solution vector, so that the solution vector has block sparsity [19,23]. Besides, using the information contained in the covariance matrix, the sparse spectrum fitting (SpSF) algorithm [24] first performs a vectorization operation on the covariance matrix, and then fits the estimated covariance matrix and the ideal covariance matrix under the *l*2-norm. At the same time, considering the sparsity of the source, *l*1-norm penalization is imposed on the source strength vector. However, SpSF algorithm is based on the assumption that ambient noise is white Gaussian noise. Therefore, Yang L., Yang Y. X., and Wang Y. proposed the directional noise field sparse spectrum fitting (DN-SpSF) algorithm [25], which uses the slowly varying characteristics of the noise spectral density function to derive the general expression of the covariance matrix of underwater directional ambient noise, and takes an optimization process similar to the SpSF algorithm.

The naturally occurring ambient noise in the ocean is generally considered to be a nuisance [1,26]. Therefore, one of the purposes of the sonar signal processing algorithms is to distinguish the desired signal from the ambient noise and suppress the ambient noise as much as possible. However, recent studies have shown that ocean ambient noise actually contains a lot of useful information [27,28], and could be used for the underwater imaging [26], the geoacoustic inversion [29,30], and the determination of seabed sub-bottom layer profile [31–33]. A similar situation is that when the presumed steering vector is mismatched with the actual steering vector, MVDR-based beamformers exhibit a so-called signal self-cancellation phenomenon, that is, the signal of interest is erroneously treated as interference, thereby being greatly suppressed. Therefore, signal self-cancellation is commonly regarded as a nuisance, and the existing related algorithms are intended to reduce the effect of signal self-cancellation. To the best of our knowledge, there is currently no research on how to use the information contained

in the signal self-cancellation phenomenon. Therefore, this paper does not take signal self-cancellation as a troublesome thing, but explores the potential information behind the signal self-cancellation phenomenon and uses it to achieve high-precision DOA estimation of underwater acoustic sources.

The main contributions of this paper are: (1) Treating the signal self-cancellation problem of the MVDR-based beamformers from a new perspective, that is, although for the MVDR-based beamformers, signal self-cancellation is a nuisance, it also contains favorable information and can be used for DOA estimation of underwater acoustic sources. (2) A novel unit circle mapping method is proposed, which effectively correlates the signal self-cancellation and the beam response curves by uniformly mapping all beam response sample values in the direction interval to a unit circle. (3) DOA estimation performance of the proposed method is analyzed in the underwater acoustic propagation simulation environment, and the performance comparisons with existing DOA methods is also completed.

#### **2. Signal Self-Cancellation of MVDR Beamformer**

Assume that the sensor array is a horizontal linear array composed of *M* omnidirectional hydrophones. In addition, suppose that there are *N* far-field narrowband underwater acoustic signals impinging on the hydrophone array, and their directions of arrival are equal to *θ*1, *θ*2, ..., *θN*. Let **x**(*k*) be the array snapshot vector at time *k*, then it can be expressed as

$$\mathbf{x}(k) = [\mathbf{a}(\theta\_1), \mathbf{a}(\theta\_2), \dots, \mathbf{a}(\theta\_N)] \begin{bmatrix} s\_1(k) \\ s\_2(k) \\ \dots \\ s\_N(k) \end{bmatrix} + \mathbf{n}(k), \tag{1}$$

where *si*(*k*), *i* = 1, 2, ··· , *N* represents the amplitude of the *i*th received underwater acoustic signal at time *k*, **n**(*k*) denotes the array noise vector at time k, where the data element on the *j*th row corresponds to the recorded noise of the *j*th hydrophone, *j* = 1, 2, ··· , *M*. Besides, **a**(*θ*) is the array manifold (or steering vector ) towards direction *θ*, which can be formulated as

$$\mathbf{a}(\theta) = \left[1, \mathbf{c}^{j(2\pi/\lambda)d\sin(\theta)}\right. \left. \begin{array}{c} \text{ $\epsilon$ } \dots \text{ $\epsilon$ } \end{array} \right. \left. \begin{array}{c} \text{ $\epsilon$ } \end{array} \right. \left. \begin{array}{c} \text{ $\epsilon$ } \end{array} \right. \left. \begin{array}{c} \text{ $\epsilon$ } \end{array} \right. \tag{2}$$

where the superscript T represents the transposition operation, *λ* denotes the signal wavelength, and *d* is the distance between adjacent hydrophones.

Beamformer can preserve the received signal impinging from a specific direction while suppressing signals in other directions, that is, with spatial filtering capability. The MVDR beamformer achieves the above objects by solving the convex optimization problem as follows,

$$\min\_{\mathbf{w}} \mathbf{w}^{H} \mathbf{R}\_{\mathbf{i}+\mathbf{n}} \mathbf{w} \quad \text{s.t.} \quad \mathbf{w}^{H} \mathbf{a}(\theta\_{d}) = 1,\tag{3}$$

where *θ<sup>d</sup>* is the specified direction (i.e., the direction of the desired signal), **w** denotes the weight vector of MVDR beamformer, and **Ri**+**<sup>n</sup>** represents the interference-plus-noise covariance matrix. In Equation (3), the objective function is equal to the power of the interference and noise passing through the beamformer, and the constraint guarantees that the gain of the signal in the specified direction is 1.

It should be pointed out that in some practical applications, such as passive sonar detection, the array received snapshot data contains both interference, noise, and the desired signal. Therefore, in this case, it is difficult to directly obtain a covariance matrix only for interference and noise. A simple solution is to replace the interference-plus-noise covariance matrix directly with the sample covariance matrix. Therefore, the optimization problem about the MVDR beamformer needs to be re-expressed as

$$\min\_{\mathbf{w}} \mathbf{w}^H \mathbf{R} \mathbf{w} \quad \text{s.t.} \quad \mathbf{w}^H \mathbf{a}(\theta\_d) = 1,\tag{4}$$

where **Rˆ** represents the sample covariance matrix, which can be directly calculated from several array snapshot data,

$$\hat{\mathbf{R}} = \frac{1}{K} \sum\_{k=1}^{K} \mathbf{x}(k)\mathbf{x}^{H}(k),\tag{5}$$

where *K* is the number of array snapshots that can be used to calculate the sample covariance matrix. In Equation (4), the objective function is equal to the total power of the desired signal, interference, and noise passing through the beamformer. Please note that when the array snapshot number *K* is large enough, the sample covariance matrix is approximately equal to the theoretical signal covariance matrix (i.e., **<sup>R</sup>** = <sup>E</sup>{**x**(*k*)**x***H*(*k*)}). Furthermore, when the specified direction is completely equal to the DOA of the desired signal, it is easy to prove that the optimization problems in Equations (3) and (4) are equivalent.

However, in the practical applications, such as underwater acoustic source localization, even if the number of available array snapshots is sufficient, there are still many unfavorable factors that make the performance of the MVDR beamformer significantly degraded. For example, the specified direction is usually difficult to accurately equal the DOA of the desired signal, which leads to a direction error. In addition, when the acoustic signal propagates a long distance in the inhomogeneous ocean medium, the wavefront of the acoustic wave will no longer be a theoretical plane wave, and a so-called random wavefront fluctuation occurs. Other negative factors include, errors in mounting positions of the hydrophones, errors in the amplitude and phase gain of the hydrophones. Thus, the given steering vector is equal to the sum of the true steering vector and the steering vector error, that is,

$$
\bar{\mathbf{a}} = \mathbf{a} + \mathbf{a}\_{\mathbf{e}} \tag{6}
$$

where **a¯** is the given steering vector with respect to the desired signal, **a** represents the actual steering vector for the desired signal, and **ae** denotes the steering vector error caused by the aforementioned unfavorable factors. From Equations (4) and (6), it can be found that when the steering vector error in Equation (6) is not equal to zero, the constraint in Equation (4) will become **w***H***a¯** = 1, which means that the MVDR beamformer will retain a certain signal corresponding to the given steering vector **a¯**, and Not the desired signal. Even worse, the minimization of the objective function in Equation (4) will result in the power of the desired signal being greatly reduced as it passes through the MVDR beamformer. This phenomenon in which the desired signal is cancelled is often referred to as the signal self-cancellation of the MVDR beamformer.

#### **3. SSC-MVDR Algorithm for DOA Estimation**

According to the analysis in the previous section, we already know that in the ideal case, that is, when there is no steering vector error, the steering vector model of the linear array can be expressed by Equation (2). Through further analysis, we can also find that the direction error, which is one of the unfavorable factors in practical applications, only changes the DOA of the desired signal without changing the steering vector model. However, other unfavorable factors, including random wavefront fluctuations, hydrophone position errors, and hydrophone amplitude phase errors, will affect the representation of the steering vector model.

In this paper, it is assumed that the MVDR beamformer has a certain degree of direction error, and the expression of the steering vector model is known. Specifically, the actual steering vector has the following form,

$$\overline{\mathbf{a}}(\theta) = [1, a\_1 e^{j((2\pi/\lambda)d\sin(\theta) + \phi\_1)}, \dots, a\_{M-1} e^{j((M-1)(2\pi/\lambda)d\sin(\theta) + \phi\_{M-1})}]^\top \tag{7}$$

where *αi*, *i* = 1, 2, ··· , *M* − 1 and *φi*, *i* = 1, 2, ··· , *M* − 1 are known constants, the former representing the amplitude deviation of the steering vector and the latter representing the phase deviation of the steering vector.

To propose the SSC-MVDR algorithm for DOA estimation, the following two cases are specifically analyzed. First, when the direction error is not equal to zero, the MVDR beamformer appears to cancel the desired signal, that is, the beam response produces sharp nulls for the desired signal; second, when the direction error is exactly equal to zero, the beam response of the MVDR beamformer produces a main lobe of a certain width for the desired signal. Furthermore, we also assume that the approximate interval of the DOA of the desired signal is known (in fact, a coarse estimate of the DOA of the desired signal can be obtained by conventional beamformer. Even in complex uncertain ocean environment, conventional beamformer still exhibits sufficient robustness). If the direction interval of the desired signal is sufficiently narrow, we will find that in the above two cases, the beam response in the direction interval appears as two different shapes. Specifically, in the first case, the beam response in the direction interval is a curve containing a steep null; and in the second case, a relatively flat curve is obtained by the beam response in the direction interval.

Inspired by the above analysis, we first define the direction interval of the desired signal as Θ, and discretely sample the direction interval to get *L* direction samples, that is, *ϑ<sup>i</sup>* ∈ Θ, *i* = 1, 2, ··· , *L*. Then, the beam response of the MVDR beamformer on the above *L* direction samples is calculated.

$$B(\theta\_i) = |\frac{\mathbf{a}^H(\theta\_d)\mathbf{\hat{R}}^{-1}\mathbf{\bar{a}}(\theta\_i)}{\mathbf{\bar{a}}^H(\theta\_d)\mathbf{\hat{R}}^{-1}\mathbf{\bar{a}}(\theta\_d)}|, \quad i = 1, 2, \cdots, L,\tag{8}$$

where *B*(*ϑi*) is the beam response of the MVDR beamformer on the direction *ϑi*.

Although the shape of the beam response curve is intuitively distinguishable, how to quickly distinguish the shape of the beam response curve in the direction interval by calculation is a major problem faced by this algorithm. In this paper, we present a unit circle mapping method, whose main idea is to uniformly map all beam response sample values in the direction interval to a unit circle,

$$B\_m(\theta\_i) = |\frac{\overline{\mathbf{a}}^H(\theta\_d)\hat{\mathbf{R}}^{-1}\overline{\mathbf{a}}(\theta\_i)}{\overline{\mathbf{a}}^H(\theta\_d)\hat{\mathbf{R}}^{-1}\overline{\mathbf{a}}(\theta\_d)}|e^{j(2\pi/L)i}, \quad i = 1, 2, \cdots, \ , L,\tag{9}$$

where *Bm*(*ϑi*) represents the unit circle mapped value of the beam response on the direction *ϑi*. The essence of the unit circle mapping method is to convert a series of scalars into directional vectors. Specifically, the amplitude of the *i*-th vector is equal to the *i*-th beam response sample value, and the phase of the *i*-th vector is equal to (2*π*/*L*)*i*. If all beam response sample values are equal to 1, then the converted vectors are exactly on the unit circle, so the above method is called the unit circle mapping method. Next, all the unit circle mapped values are summed. In the summation process, the beam response curves of the two different shapes will correspond to significantly different results. Specifically, for a beam response curve segment containing a steep null, the magnitude of the summation result is related to the depth of the null, and the deeper the null, the greater the magnitude of the summation result. The phase of the summation result differs from the phase of the null on the unit circle by approximately 180 degrees. For the relatively flat beam response curve segment, the unit circle mapped values have almost cancelled each other during the summation process, so that the summation result is approximately equal to zero. Therefore, we define the pseudo spatial power spectrum as follows,

$$P\_{\rm SSC-MVDR}(\theta) = 1/\left|\sum\_{i=1}^{L} |\frac{\bar{\mathbf{a}}^{H}(\theta)\hat{\mathbf{R}}^{-1}\bar{\mathbf{a}}(\theta\_{i})}{\bar{\mathbf{a}}^{H}(\theta)\hat{\mathbf{R}}^{-1}\bar{\mathbf{a}}(\theta)}|e^{j(2\pi/L)i}|, \quad \theta \in \Theta. \tag{10}$$

It should be noted that the pseudo spatial power spectrum *P*SSC−MVDR(*θ*) is only defined in the direction interval Θ. When the direction *θ* is exactly equal to the DOA of the desired signal, the pseudo spatial power spectrum is expected to achieve a maximum. In practical applications, to achieve the desired estimation performance of the SSC-MVDR algorithm, a reasonable direction interval should be selected. Generally, the center of the direction interval can be set as the coarse estimate of the DOA of the acoustic source, and the width of the direction interval should be set to a suitable value to ensure that the true DOA of the acoustic source always falls within the direction interval, and meanwhile, the beam response segment in the direction interval contains only the null generated by the signal self-cancellation, and does not contain other unrelated nulls. Therefore, the direction interval width is not only related to the error of the coarse estimation of the DOA, but also to the specific position of the nulls in the beam response.

The implementation principle and detailed steps of the SSC-MVDR algorithm are shown in Figure 1. First, the sample covariance matrix is calculated according to the array snapshots, then the beam response of the MVDR beamformer is calculated, and the beam response on the direction interval is mapped to the unit circle, and finally the pseudo spatial power spectrum is calculated. Meanwhile, in Figure 1, examples of the beam response of the MVDR beamformer and examples of the unit circle mapping of the beam response curve segment on the direction interval are also given. Specifically, Figure 1a,b are the results when there is no signal self-cancellation. In the Figure 1a, the direction interval is indicated by two dashed lines, and the beam response curve segment in this direction interval is relatively flat. Therefore, in the Figure 1b, all the mapping vector amplitudes are approximately equal. Moreover, since the phases of the mapping vectors are uniformly distributed in the range of 0 to 360 degrees, the magnitude of the sum of the mapping vectors will be very small. Therefore, the pseudo spatial power spectrum for this case will be very large, as shown by the higher red dashed line in Figure 1e. Figure 1c,d correspond to the case where there is signal self-cancellation. In the Figure 1c, the direction interval is also indicated by two dashed lines, and the beam response curve segment in this direction interval contains a steep null. Therefore, in the Figure 1d, the magnitudes of some mapping vectors are much smaller than the amplitudes of other mapping vectors. Similarly, the phases of the mapping vectors are uniformly distributed in the range of 0 to 360 degrees, thus the magnitude of the sum of the mapping vectors will be large, and the pseudo spatial power spectrum for this case will be very small, as shown by the lower red dashed line in Figure 1e.

**Figure 1.** Implementation principle and detailed steps of the SSC-MVDR algorithm. (**a**) example of the beam response of the MVDR beamformer when there is no signal self-cancellation. (**b**) example of the unit circle mapping of the beam response curve segment on the direction interval when there is no signal self-cancellation. (**c**) example of the beam response of the MVDR beamformer when there is signal self-cancellation. (**d**) example of the unit circle mapping of the beam response curve segment on the direction interval when there is signal self-cancellation. (**e**) example of the pseudo spatial power spectrum.

#### **4. Online Computation of SSC-MVDR Algorithm**

The computational complexity of an algorithm is one of the important factors we need to consider when applying the algorithm to the actual project. In the SSC-MVDR algorithm (i.e., Equation (10)), the inverse of the sample covariance matrix and the corresponding matrix operations are the main calculation steps. In the following, the online computation process of the SSC-MVDR algorithm is given and a way to reduce the amount of calculation is provided.

First, the iterative calculation process of the sample covariance matrix is as follows

$$
\hat{\mathbf{R}}(k+1) = \gamma \hat{\mathbf{R}}(k) + \frac{1}{k+1} \mathbf{x}(k+1) \mathbf{x}^H(k+1), \tag{11}
$$

where **R**ˆ (*k* + 1) and **R**ˆ (*k*) represent the sample covariance matrices at times *k* and *k* + 1, respectively. *γ* is a constant less than 1 but very close to 1. When the array snapshots are non-stationary, the coefficient *γ* is used to ensure that the SSC-MVDR algorithm still works reliably. Using the Woodbury matrix identity, the inverse of the sample covariance matrix at time *k* + 1 can be expressed as,

$$\hat{\mathbf{R}}^{-1}(k+1) = \gamma^{-1}(\hat{\mathbf{R}}^{-1}(k) - \frac{\mathbf{y}(k+1)\mathbf{y}^H(k+1)}{(k+1)\gamma + \mathbf{x}^H(k+1)\mathbf{y}(k+1)}),\tag{12}$$

where **y**(*k* + 1) represents the product of the inverse of the sample covariance matrix at time *k* and the array snapshot at time *k* + 1, that is, **y**(*k* + 1) = **R**ˆ <sup>−</sup>1(*k*)**x**(*k* + 1).

Second, the symbol *g* is introduced to represent the generalized inner product of the steering vectors with respect to the inverse of the sample covariance matrix,

$$\mathfrak{g}\_{k+1}(\theta,\theta\_i) = \mathfrak{a}^H(\theta)\mathfrak{R}^{-1}(k+1)\mathfrak{a}(\theta\_i),\tag{13}$$

Substituting Equation (12) into Equation (13), the generalized inner product *g* can be equivalently expressed as,

$$g\_{k+1}(\theta,\theta\_i) = \gamma^{-1}(g\_k(\theta,\theta\_i) - q\_{k+1}(\theta,\theta\_i)),\tag{14}$$

where *qk*<sup>+</sup>1(*θ*, *ϑi*) is defined as

$$q\_{k+1}(\theta,\theta\_i) = \frac{\mathbf{\bar{a}}^H(\theta)\mathbf{y}(k+1)\mathbf{y}^H(k+1)\mathbf{\bar{a}}(\theta\_i)}{(k+1)\gamma + \mathbf{x}^H(k+1)\mathbf{y}(k+1)}.\tag{15}$$

Finally, the pseudo spatial power spectrum at time *k* + 1 can be calculated by substituting Equation (14) into Equation (10), that is,

$$P\_{\rm SSC-MVDR}^{k+1}(\theta) = 1/|\sum\_{i=1}^{L}|\frac{\mathcal{G}\_k(\theta,\theta\_i) - q\_{k+1}(\theta,\theta\_i)}{\mathcal{G}\_k(\theta,\theta) - q\_{k+1}(\theta,\theta)}|e^{j(2\pi/L)i}|, \quad \theta \in \Theta. \tag{16}$$

It can be seen from Equation (16) that the calculation of the pseudo spatial power spectrum at time *k* + 1 depends on the results of two functions, which are the generalized inner product at time *k* and the function *q* at time *k* + 1, respectively. Please note that the former is known during the calculation process at time *k* + 1 (because it has been obtained in the previous calculation), and the latter is calculated as Equation (15). Since it only involves vector operations, its computational complexity is relatively small.

#### **5. Simulation Results and Analysis**

In computer simulations, it is assumed that the linear array consists of 10 omnidirectional hydrophones, and the spacing of adjacent hydrophones is set to half the wavelength of the narrowband acoustic signal. In addition, assuming that there are two underwater acoustic sources in the far field, their directions of arrival are set to 30 degrees and 60 degrees, respectively. In the simulations below, the intensity of the acoustic source at 30 degrees is set to be variable for testing the DOA estimation performance of the SSC-MVDR algorithm, while the intensity of the acoustic source at 60 degrees is set to always be 30 dB (relative to noise) for testing the algorithm performance in a strong interference environment. Meanwhile, the received noise of the linear array is assumed to be spatially white Gaussian noise.

It should be noted that various unfavorable factors that may be encountered in the complex uncertain ocean environment are also considered in the simulations, including direction error, random wavefront fluctuations, hydrophone position errors, and hydrophone amplitude phase errors. Therefore, the steering vector model in Equation (7) is used while assuming that the amplitude deviation coefficients of the steering vector and the phase deviation coefficients of that are known.

In Figures 2 and 3, the pseudo spatial power spectrums of the SSC-MVDR algorithm are given under different acoustic source intensities and different snapshot numbers (i.e., the number of snapshots used to calculate the sample covariance matrix in Equation (5)). Specifically, Figure 2 corresponds to a friendly simulation environment where the signal-to-noise ratio (SNR) of the acoustic source in the direction of 30 degrees is set to 20 dB, the number of snapshots is set to 100, while Figure 3 corresponds to a poor simulation environment where the SNR of the same acoustic source is only −10 dB, and the number of snapshots is only 30. For performance comparison, the spatial power spectrums of the MVDR method [4], the MUSIC method [13] and the PM method [15] under the same simulation conditions are also given in Figures 2 and 3. Compared to other algorithms, the SSC-MVDR algorithm exhibits sharper peaks near the true DOA of the acoustic signal in Figures 2 and 3.

**Figure 2.** Normalized spatial power spectrums obtained by the MVDR method, the MUSIC method, the PM method and the SSC-MVDR method when the acoustic source intensity is 20 dB and the snapshot number is 100.

Figures 4 and 5 show the DOA estimation accuracy results of the SSC-MVDR algorithm. In Figure 4, the number of snapshots is fixed at 100, and the intensity of the acoustic source is gradually increased from −20 dB to 20 dB. In Figure 5, the acoustic source intensity is fixed at −10 dB, and the number of snapshots is gradually increased from 10 to 100. The DOA estimation accuracy is evaluated by the root mean square error (RMSE) of DOA estimation, which is defined as follows,

$$\text{RMSE} = 20 \log \sqrt{\frac{1}{S} \sum\_{i=1}^{S} (\theta\_i - \theta\_a)^2} \tag{17}$$

In Equation (17), RMSE is actually the logarithm of the root mean square error, so the unit of RMSE is dB. The number of independent computer simulations is represented by *S*. In the following simulations (i.e., from Figures 4–7), *S* = 100 is set, which means that each simulation result requires 100 independent runs. ˆ *θ<sup>i</sup>* represents the DOA estimate obtained by the *i*th computer simulation, and *θ<sup>a</sup>* is the corresponding true DOA value. Meanwhile, the results of the DOA estimation accuracy of the MVDR method [4], the MUSIC method [13], the Root-MUSIC method [34], the TLS-ESPRIT method [35] and the PM method [15] are also given in Figures 4 and 5. It can be seen from Figure 4 that the RMSE results of the SSC-MVDR algorithm is significantly lower than that of other methods when the acoustic source intensity is in the range of −20 dB to −10 dB. When the number of snapshots is between 20 and 90, the RMSE results of the SSC-MVDR algorithm in Figure 5 is also significantly lower than that of other algorithms.

**Figure 3.** Normalized spatial power spectrums obtained by the MVDR method, the MUSIC method, the PM method and the SSC-MVDR method when the acoustic source intensity is −10 dB and the snapshot number is 30.

**Figure 4.** DOA estimation accuracy results obtained by the MVDR method, the MUSIC method, the Root-MUSIC method, the TLS-ESPRIT method, the PM method and the SSC-MVDR method when the number of snapshots is fixed at 100, and the intensity of the acoustic source is gradually increased from −20 dB to 20 dB.

**Figure 5.** DOA estimation accuracy results obtained by the MVDR method, the MUSIC method, the Root-MUSIC method, the TLS-ESPRIT method, the PM method and the SSC-MVDR method when the acoustic source intensity is fixed at −10 dB, and the number of snapshots is gradually increased from 10 to 100.

Although the interference acoustic source in the far field has been considered in the above simulation (that is, an interference acoustic source with an intensity of 30 dB and a direction of 60 degrees is always included in the simulation settings), it is still necessary to perform simulation and analysis for the case of multiple interference acoustic sources. The results of the direction estimation accuracy under multiple interference acoustic sources are shown in Figure 6. In Figure 6, the intensities of the multiple interference acoustic sources are set to be the same, and the interference-to-noise ratio (INR) of each interference is gradually increased from 0 dB to 50 dB. Besides, *NI* represents the number of interference acoustic sources. For example, *NI* = 3 means that there are 3 interference acoustic sources in the far field at the same time, their directions are 60 degrees, 10 degrees, and −20 degrees, respectively. As can be seen from Figure 6, the three curves corresponding to different numbers of interferences are almost coincident.

**Figure 6.** DOA estimation accuracy results obtained by the SSC-MVDR algorithm when there are multiple interfering sound sources with the source intensity gradually increased from 0 dB to 50 dB.

Finally, the influence of the direction interval, which is one of the important parameters of the SSC-MVDR algorithm, on the performance of the proposed algorithm is analyzed by simulation. In Figure 7, the horizontal axis is marked as the window width, that is, the width of the direction interval Θ. The three curves in Figure 7 are the accuracy results of DOA estimation obtained by the SSC-MVDR algorithm when the parameter *θ<sup>e</sup>* takes different values, where *θ<sup>e</sup>* represents the deviation of the center of the direction interval from the true DOA. It can be seen from Figure 7 that when the width of the direction interval is in the range of 10 degrees to 20 degrees, the RMSEs obtained by the SSC-MVDR algorithm is significantly lower than that when different window widths out of the above range is used. The reason for the above results is that when the center of the direction interval deviates from the true DOA and the selected direction interval is too narrow, the actual DOA of the acoustic source will fall outside the selected direction interval, thus the obtained DOA estimation must be wrong. However, when the selected direction interval is too wide, it will cause the beam response segment in the direction interval to contain some unwanted or even unfavorable information, such as other nulls in the beam response that are independent of signal self-cancellation.

**Figure 7.** DOA estimation accuracy results obtained by the SSC-MVDR algorithm when the width of the direction interval varies between 1 and 40 degrees, and the deviation of the center of the direction interval from the true DOA is equal to 0, 2, and 4 degrees, respectively.

#### **6. Conclusions**

The MVDR beamforming-based underwater acoustic source localization techniques often encounter many unfavorable factors in the ocean environment, such as direction error, random wavefront fluctuation, hydrophone position error, hydrophone gain error, etc. These unfavorable factors lead to signal self-cancellation problems and severe performance degradation of the MVDR beamformer. Therefore, the signal self-cancellation problem is generally considered to be a disadvantage of the MVDR beamformer and is suppressed. On the contrary, by exploiting the signal self-cancellation phenomenon, this paper proposes a high-precision DOA estimation method for the underwater acoustic sources. First, the beam response of the MVDR beamformer in the direction interval is calculated according to the steering vector model. Then, the pseudo spatial power spectrum is calculated using the unit circle mapping technique. Finally, to reduce the computational complexity, the online calculation process of the pseudo spatial power spectrum is given. The computer simulation results show that the SSC-MVDR algorithm can obtain satisfactory direction estimation accuracy under the conditions of low intensity of acoustic source, strong interference and noise, and less array snapshot data. At the same time, computer simulation also gives reasonable suggestions for the width of the direction interval.

**Author Contributions:** Conceptualization, J.W.; Methodology, F.W., Y.C.; Software, F.W.; Writing—original draft, F.W.; Writing—review and editing, F.W.

**Funding:** This research was funded by National Natural Science Foundation of China (61601209), Natural Science Foundation of Jiangxi Province, P.R. China (20171BAB202003), and Science and Technology Research Project of Education Department, Jiangxi Province, P. R. China (8140).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Channel Modelling and Estimation for Shallow Underwater Acoustic OFDM Communication via Simulation Platform**

#### **Xiaoyu Wang , Xiaohua Wang, Rongkun Jiang, Weijiang Wang, Qu Chen and Xinghua Wang \***

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China; wangxiaoyu@bit.edu.cn (X.W.); wangxiaohuabit@163.com (X.W.); jiangrongkun@bit.edu.cn (R.J.); wangweijiang@bit.edu.cn (W.W.); chenqu@bit.edu.cn (Q.C.) **\*** Correspondence: wangxinghuabit@163.com

Received: 30 December 2018; Accepted: 23 January 2019; Published: 28 January 2019

**Abstract:** The performance of underwater acoustic (UWA) communication is heavily dependent on channel estimation, which is predominantly researched by simulating UWA channels modelled in complex and dynamic underwater environments. In UWA channels modelling, the measurement-based approach provides an accurate method. However, acquirement of environment data and simulation processes are scenario-specific and thus not cost-effective. To overcome such restraints, this article proposes a comprehensive simulation platform that combines UWA channel modelling with orthogonal frequency division multiplexing (OFDM) channel estimation, allowing users to model UWA channels for different ocean environments and simulate channel estimation with configurable input parameters. Based on the simulation platform, three independent simulations are conducted to determine the impacts of receiving depth, sea bottom boundary, and sea surface boundary on channel estimation. The simulations show that UWA channel estimation is greatly affected by underwater environments. The effect can be mainly attributed to changing acoustic rays tracing which result in fluctuating time delay and amplitude. With 10 m receiving depth and flat sea bottom, the channel estimation achieves optimal performance. Further study indicates that the sea surface has stochastic effects on channel estimation. As the significant wave height (SWH) increases, the average performance of channel estimation shows improvements.

**Keywords:** UWA communication; channel modelling; OFDM; channel estimation; simulation platform

#### **1. Introduction**

In recent years, underwater acoustic (UWA) communication has been widely employed in military affairs [1], ocean exploration [2], pollution monitoring [3], offshore oil drilling [4], etc. In view of these applications, UWA communication technology shows great potential as an area of research. Nevertheless, the nature of UWA channel (such as low available bandwidth, time-varying multipath, and the speed of sound) hinders the efficiency of communication devices [5]. Orthogonal frequency division multiplexing (OFDM), a robust method of encoding digital data on multiple carrier frequencies, is generally used in UWA time-dispersive channels and has the ability to render inter-symbol interference (ISI) negligible by embedding a cyclic prefix [6]. Furthermore, in order to suppress inter-channel interference (ICI) [7] introduced by Doppler spread in the OFDM system, the knowledge of channel impulse response (CIR) is essential, which could be acquired with the help of pilot signals [8]. Therefore, the estimation of channel is the key factor in the performance of UWA communication.

Although the estimation of UWA channel still remains a sophisticated problem due to complex underwater environments, channel modelling is an efficient way to perform preliminary evaluation and decide parameters such as estimation algorithm, pilot interval, and number of carriers for the OFDM communication device [9]. UWA channel modelling can be classified to two generic approaches: geometry-based and measurement-based [10].

Geometry-based UWA channel modelling relies on mathematical analysis in physical characteristics. Zajic [11] proposed a geometry-based model for multiple-input multiple-output (MIMO) mobile-to-mobile (M-to-M) shallow UWA channel by taking both macro- and microscattering effects into account. Qarabaqi and Stojanovic [12] developed a statistical model of UWA channels that incorporates physical aspects of acoustic propagation with the effects of inevitable random channel variations. Naderi et al. [13] constructed a stochastic channel model for wideband single-input single-output (SISO) shallow UWA channels under the assumption of rough ocean surface and bottom. These geometry-based models generally display stochastic channel responses in consideration of the time-varying characteristic of UWA channels collected through measurement campaigns. Although geometry-based channels could achieve a high level of accuracy, the physical features of the channels are limited in specific cases. For example, the probability distribution functions (PDF) of UWA channel envelope was revealed to be Rice distribution in [14], Rayleigh distribution in [15], log-normal distribution in [16], and K-distribution in [17].

The differential distributions above imply that the precise UWA channels should be modelled with real environment data. BELLHOP [18,19] is a beam tracing model that considers ocean environment properties data, such as sea surface boundary, sea bottom boundary, and sound speed profile (SSP). Utilizing the precision of BELLHOP, numerous measurement-based UWA channels were built through the model. In [20], Tomasi et al. compared real UWA channel data measured from an experiment with the channel obtained through the BELLHOP model. The result showed a great agreement in the conditions of calm ocean, which confirmed the feasibility of BELLHOP model. With strong winds, the measured channels were slightly worse than the result of the BELLHOP model. The deviation was caused by the lack of consideration on the sea surface. It also revealed the limitation of the BELLHOP model when dealing with stochastic parameters. In [21–23], shallow UWA channels were modelled in the eastern shore of Johor and the Taiwan Strait through the BELLHOP model and environment data were sourced from various databases. These approaches of measurement-based channel modelling provided an accurate way to analyse the channel characteristics such as transmission loss, CIR and ray tracing. However, environment data was scenario-specific and hard to acquire. In [24], Gul et al. designed a graphical user interface (GUI) with configurable parameters to model UWA channels and showed channel effect over an acoustic signal. The GUI brought much convenience to constructing environment files and visualizing simulation results. Nevertheless, the accuracy of modelling might be deteriorated as empirical data, including Munk SSP, was used in the GUI.

In this paper, three contributions are presented:


the OFDM channel estimation part. Based on this framework, users can analyse the estimation performance of different modelling channels, modulation schemes, and estimation algorithms to assess the implementation scheme of the UWA communication device.

3. Based on the simulation platform, three simulations are conducted. Realistic UWA channels in the East China Sea are modelled to analyse the factors that could influence the performance of channel estimation. On the one hand, deterministic channels are modelled with different receiving depth and types of sea bottom to compare the performance of channel estimation. The result shows that channels with 10 m receiving depth and flat sea bottom yield optimal performance of channel estimation. On the other hand, a batch of channels is modelled with stochastic sea surface in different significant wave heights (SWH) and the result of channel estimation performance is synthesized. Overall, the channels with high SWH have a good average performance.

The rest of this article is structured as follows. In Section 2, the article provides a brief overview of channel estimation in OFDM communication system. The implementation of simulation platform is demonstrated in Section 3. In Section 4, the effects of receiving depth, sea bottom boundary, and sea surface boundary on UWA channel estimation are analysed by modelling and comparison. Finally, a concise conclusion is made in Section 5.

#### **2. Channel Estimation in OFDM**

The main principle of OFDM is to transmit data stream in low rate through numerous orthogonal subcarriers simultaneously [25]. Figure 1 shows a typical end-to-end configuration of OFDM communication system. The transmitted data is modulated on *N* subcarriers and passed through the UWA channels.

**Figure 1.** Block diagram of orthogonal frequency division multiplexing (OFDM) system transceiver.

The CIR of UWA channel can be expressed as [26],

$$h(n) = \sum\_{l=0}^{L-1} h\_l \delta(n-l) \tag{1}$$

where *L* is the length of channel and *hl* is the coefficient of *l*th tap. The *L* can be calculated by *L* = *τmax*/*Ts* where *τmax* is the maximum delay of the channel and *Ts* is the length of an OFDM symbol.

Now we suppose the guard interval (CP) length is no shorter than the maximum delay of UWA channel, which means the current received OFDM symbol is not affected by the previous symbol. Then the received signal processed by FFT can be expressed as,

$$\mathbf{Y} = \mathbf{X}\mathbf{H} + \mathbf{V} \tag{2}$$

where **Y**, **X**, **H**, and **V** denote the matrix of received symbol, transmitted symbol, channel frequency response, and noise in frequency domain, respectively. It can also be presented as,

$$
\begin{bmatrix} Y[0] \\ Y[1] \\ \vdots \\ Y[N-1] \end{bmatrix} = \begin{bmatrix} X[0] & 0 & \cdots & 0 \\ 0 & X[1] & & \vdots \\ \vdots & & \ddots & 0 \\ 0 & \cdots & 0 & X[N-1] \end{bmatrix} \begin{bmatrix} H[0] \\ H[1] \\ \vdots \\ H[N-1] \end{bmatrix} + \begin{bmatrix} V[0] \\ V[1] \\ \vdots \\ \vdots \\ V[N-1] \end{bmatrix} \tag{3}
$$

In the channel estimation problem of OFDM system, the matrix of transmitted pilot symbol **Xp** and the matrix of received pilot symbol **Yp** are usually given and the estimate <sup>∧</sup> **Hp** of channel frequency response **Hp** needs to be figured out.

Least square (LS), minimum mean square error (MMSE), and linear minimum mean square error (LMMSE) are the most traditional algorithms used in OFDM channel estimation. LS algorithm focuses on finding the estimate <sup>∧</sup> **Hp** to minimize the value of following cost function,

$$J\left(\hat{\mathbf{H}}\_{\mathbf{P}}\right) = \left\|\mathbf{Y}\_{\mathbf{P}} - \hat{\mathbf{Y}}\_{\mathbf{P}}\right\|\tag{4}$$

The solution to the LS channel estimation is as follows [26,27],

$$\stackrel{\wedge}{\mathbf{H}\_{\mathbf{LS}}} = (\mathbf{X\_{\mathbf{p}}^{\mathbf{H}} \mathbf{X\_{\mathbf{p}}})^{-1} \mathbf{X\_{\mathbf{p}}^{\mathbf{H}} \mathbf{Y\_{\mathbf{p}}}} = \mathbf{X\_{\mathbf{p}}^{-1} \mathbf{Y\_{\mathbf{p}}}} \tag{5}$$

MMSE algorithm aims at finding the estimate <sup>∧</sup> **Hp** to minimize the value of mean square, and can be expressed as,

$$J\left(\stackrel{\diamond}{\mathbf{H}}\_{\mathbf{P}}\right) = E\left\{ \left\| \mathbf{H}\_{\mathbf{P}} - \stackrel{\diamond}{\mathbf{H}}\_{\mathbf{P}} \right\|^2 \right\} \tag{6}$$

The solution to the MMSE channel estimation is [26,27],

$$\overset{\wedge}{\mathbf{H}}\_{\text{MMSE}}^{\wedge} = \mathbf{R}\_{\text{H}\_{\text{P}}\text{H}\_{\text{P}}} \left[ \mathbf{R}\_{\text{H}\_{\text{P}}\text{H}\_{\text{P}}} + \left( \mathbf{X}\_{\text{P}} \mathbf{X}\_{\text{P}}^{\text{H}} \right)^{-1} \text{N} \sigma\_{n}^{2} \right]^{-1} \overset{\wedge}{\mathbf{H}}\_{\text{LS}} \tag{7}$$

where **RHpHp** is the autocorrelation matrix of **Hp**. *N* is number of subcarriers. *σ<sup>n</sup>* is the variance of noise.

MMSE algorithm is much more complicated than LS algorithm because of matrix inversion. Suppose that the output of code modulation is equiprobable, then **XpX<sup>H</sup> p** −<sup>1</sup> can be replaced with *E* **XpX<sup>H</sup> p** −<sup>1</sup> and the solution of LMMSE is presented as [28,29],

$$\overset{\diamond}{\mathbf{H}}\_{\text{LMMSE}}^{\diamond} = \mathbf{R}\_{\text{H}\_{\text{P}}\text{H}\_{\text{P}}} \left[ \mathbf{R}\_{\text{H}\_{\text{P}}\text{H}\_{\text{P}}} + \frac{\beta}{SNR} I \right]^{-1} \overset{\diamond}{\mathbf{H}}\_{\text{LS}} \tag{8}$$

where *β* = *E* 3 **Xp** 2 4 *E* 3 1/**Xp** 2 4 is a coefficient associated with code modulation scheme and *SNR* = *E* 3 **Xp** 2 <sup>45</sup>*Nσ<sup>n</sup>* <sup>2</sup> is signal-to-noise ratio. Since LMMSE algorithm has similar performance and lower complexity than MMSE algorithm, LS and LMMSE algorithms are used to estimate UWA channels in this paper.

#### **3. Implementation of Simulation Platform**

#### *3.1. Basic Framework*

Based on the GUI of MATLAB, the simulation platform is mainly composed of two parts, namely, UWA channel modelling and OFDM channel estimation. User interface of the simulation platform is illustrated in Figure 2. Figure 2a shows the interface of UWA channel modelling, which is divided

into four major components. In the zone of modelling parameters, users can set the position and depth of transmitter and receiver, as well as signal frequency, number of beams, and SWH. Figures of normalized CIR and eigenrays trace are presented in the middle. The status indicator displays the consequences of script function in progress. Furthermore, the channel list is used to save or delete modelled channels. The interface of OFDM channel estimation is illustrated in Figure 2b. The parameters, such as modulation type, number of carriers, simulated channel, and algorithm, are freely set in the zone of estimation parameters. Figures of the simulation results are plotted in the lower part and can be saved with specified names if necessary.

**Figure 2.** The user interface of simulation platform. (**a**) underwater acoustic (UWA) channel modelling. (**b**) OFDM channel estimation.

Figure 3 shows the flow chart of using the simulation platform in this paper. When conducting a simulation, first, users should enter proper channel parameters and click 'Generate Channel' button repeatedly until enough channels are saved in the list. Then, parameters of OFDM channel estimation and UWA channels need to be configured. For deterministic channels, a UWA channel needs to be selected from the channel list. While simulating with stochastic channels, users should choose a group of channels by clicking 'Add Channel' button. After that, curves of BER and MSE will be plotted by clicking the 'Generate Figure' button. Repeating the progress of modifying parameters and plotting curve, result of performance comparison will be produced. Eventually, the simulation figures can be saved in high-resolution by clicking 'Save Figure' button.

The design of simulation platform puts forward a high-efficiency approach to the research on UWA channels. One of the distinct advantages lies in the reduction of workload for finding real data, building input files, and processing output data. The users using simulation platform without any previous experience in BELLHOP model or MATLAB programing can still perform simulation for research. Besides, the GUI of simulation platform makes using BELLHOP model more intuitive and provides users with immediate visual feedback about the results of the program. Moreover, based on the real environment data acquired from open databases, the simulation platform provides users a flexible way to choose parameters. In sight of the accuracy and convenience of simulation platform, users can modify position, depth, frequency, SWH, and even the number of beams to study the UWA channel. At the same time, the factors of OFDM modulation, pilot pattern, and estimation algorithms can also be analysed by changing these values. In addition, by combining UWA channel modelling with OFDM channel estimation, the simulation platform helps users analyse characteristics of UWA channel and pick the appropriate OFDM implementation scheme for UWA communication device.

**Figure 3.** Flow chart of using simulation platform.

#### *3.2. Channel Modelling*

The most common models for solving the problem of UWA communication are propagation models. Wave equation, derived from the equations of state, continuity, and motion, is the theoretical base for all mathematical models of acoustic propagation. The simplified wave equation is as follows,

$$
\nabla^2 \Phi = \frac{1}{c^2} \frac{\partial^2 \Phi}{\partial t^2} \tag{9}
$$

where ∇<sup>2</sup> is the Laplacian operator. *<sup>φ</sup>* is the potential function. c is the speed of sound. t is the time.

Furthermore, there are five canonical models for solving wave equation [9]: ray theory, normal mode, multipath expansion, fast field, and parabolic equation model. Among the five models above, ray theory model is both applicable and practical in high frequency, and is therefore widely used in UWA channel simulation. Based on the theory of Gaussian beams [30], BELLHOP is one of the most effective implementations of ray model for solving the ray equations with cylindrical symmetry [31],

$$\begin{aligned} \frac{dr}{ds} &= c\xi(s), \quad \frac{d\xi}{ds} = -\frac{1}{c^2} \frac{\partial c}{\partial r},\\ \frac{dz}{ds} &= c\zeta(s), \quad \frac{d\zeta}{ds} = -\frac{1}{c^2} \frac{\partial c}{\partial z'} \end{aligned} \tag{10}$$

where *r*(*s*) and *z*(*s*) are the ray coordinates in cylindrical coordinates and *c*(*s*)[*ξ*(*s*), *ζ*(*s*)] is the tangent versor along the ray. With the initial conditions (*r*(0) = *rs*, *z*(0) = *zs*, *ξ*(0) = cos *<sup>θ</sup><sup>s</sup> cs* , and *<sup>ζ</sup>*(0) = sin *<sup>θ</sup><sup>s</sup> cs* . *θ<sup>s</sup>* is the launching angle. [*rs*, *zs*] is the source position. *cs* is the sound speed at the source position), the coordinates of ray can be obtained by

$$
\pi = \int\_{\Gamma} \frac{ds}{c(s)}\tag{11}
$$

The BELLHOP model is designed to simulate two-dimensional acoustic ray tracing for a specific SSP in waveguides with flat or variable absorbing boundaries [18,19,32]. The program of BELLHOP offers various output options, including ray coordinates, eigenray coordinates, acoustic pressure, travel time, and amplitudes. In order to generate a valid output of UWA channels, the parameters for

modelling should be set correctly in BELLHOP files (environment file, altimetry file, bathymetry file, etc.) according to the user's guide [32].

#### *3.3. Data Interface*

This article aims at modelling realistic UWA channels through BELLHOP model, which means the simulation platform should implement the constructing of input files with real data. Although modelling in measurement-based approaches is a complex process with numerous parameters that need to be determined, there are lots of open databases available on the Internet that can be interfaced through script functions. The data flow diagram of simulation platform is illustrated in Figure 4.

**Figure 4.** Data flow diagram.

In the channel modelling, users enter basic channel parameters such as position, frequency, and number of beams to the GUI. After getting the data of position, the simulation platform will be able to read real data about sea surface boundary, SSP, and sea bottom boundary from the databases. Sea surface boundary is a stochastic parameter that can be generated from the Gauss-Lagrange model. The function of Gauss-Lagrange is as follows,

$$\mathbf{x}(t,\boldsymbol{\mu}) = \sum\_{j=0}^{N} \sqrt{S\_j \Delta \omega} \rho\_j R\_j \cos(k\_j \boldsymbol{\mu} - \omega\_j \mathbf{t} + \Theta\_j + \theta\_j) \tag{12}$$

where *ρ<sup>j</sup>* is the amplitude response and *θ<sup>j</sup>* is the phase response. WafoL [33], a toolbox of MATLAB, is used to solve the function and generate Gauss-Lagrange waves. In order to make the waves closer to reality, amplitudes of Gauss-Lagrange waves should be adjusted according to real SWH, which can be obtained from altimetry satellite missions in Aviso+ [34]. The speed of sound in sea water is the basic variable of acoustic channel, which is affected by many factors such as water temperature, salinity, and depth. World Ocean Atlas 2013 (WOA2013) [35] is a data product of National Oceanic and Atmospheric Administration (NOAA) where the ocean properties can be accessed with the index of time and position. With the real data, SSP in any geographical location can be calculated by the UNESCO equation [36]. Sea bottom boundary can reflect and scatter the sound ray. According to Google Map API [37], the simulation platform gets a set of discrete depth samples in order to interpolate the sea bottom terrain.

After getting the values of sea surface boundary, SSP, and sea bottom boundary, the simulation platform will be able to construct BELLHOP files (environment file, altimetry file, bathymetry file, etc.) and calculate the eigenray coordinates, travel time and amplitudes of UWA channel. Therefore, the eigenrays tracing is plotted on the simulation platform with the coordinates. Furthermore, the CIR is calculated by accumulating travel time and amplitudes for each multipath. For the sake of convenient simulation in OFDM communication system, the CIR is normalized and saved in channel list. In OFDM

channel estimation, the modelled channels are simulated with Monte Carlo method based on specific input parameters of channel estimation and the BER and MSE curves are plotted for analyses.

#### **4. Simulations and Analysis**

Control signals being transmitted from sea surface to underwater in shallow sea is one of the most commonly used applications of UWA communication. This article focuses on this application and analyses the factors in the estimation of UWA channel. There are two scenarios of shallow sea illustrated in Figure 5a with a range of 2 km from the southeast of Meizhou island in the East China Sea, which are selected to simulate on the proposed platform. Scenario A is located between 119.6590◦ E, 24.8385◦ N and 119.6689◦ E, 24.8541◦ N with an approximate flat sea bottom at the depth of about 70 m. Scenario B is located between 119.5708◦ E, 24.3453◦ N and 119.5609◦ E, 24.3297◦ N where the depth of sea bottom increases from 50 m to 80 m. The time of databases is set at noon on June 22. According to WOA2013, with an increase of sea depth, the temperature gradually decreases from 27.9 ◦C to 16.7 ◦C and the salinity slowly increases from 33.5 psu to 34.2 psu. The SSP calculated by temperature and salinity is shown in Figure 5b, varying from 1513 m/s to 1540 m/s. In addition, the average SWH can also be found in Aviso+ with a value of 1.29 m.

Three simulations are set to model channels in different receiving depth, sea bottom boundaries, and sea surface boundaries, respectively, and the influence on channel estimation performance is analysed afterwards. The detailed simulation parameters are tabulated in Table 1.

**Figure 5.** (**a**) Geographical location of two scenarios in simulations. (**b**) sound speed profile (SSP) curve generated from WOA2013.



#### *4.1. Receiving Depth*

When sending control signals, transmitting device is usually located below the surface of the sea while the depth of receiving device is not fixed. The first simulation is designed to model UWA channels with different receiving depths and examine the influence to OFDM channel estimation. In this simulation, the energy converter is 10 m in depth and the acoustic signal is evenly emitted from a 30-degree emission angle. The depth of receiving device is set to 10 m, 30 m, 50 m, and 70 m, respectively. In order to control the effect of sea surface and sea bottom in this simulation, UWA channels are modelled in the sea area of scenario A (flat bottom boundary) with flat sea surface boundary.

Figure 6 illustrates the eigenrays path of modelled channels during propagation. The color of lines indicates whether eigenrays hit surface boundary or bottom boundary of the sea, and basically there are four cases: The black line represents eigenray hitting both boundaries; the blue line represents eigenray hitting bottom only; the green line represents eigenray hitting surface only; and the red line represents eigenray hitting neither bottom nor surface. As can be seen in the eigenrays figure, the main eigenray type is black line. In shallower depth, the number of blue lines is greater than that in deeper depth.

**Figure 6.** Eigenrays of channel with different receiving depth. The color of lines indicates whether eigenrays hit surface boundary or bottom boundary of the sea. Receiving depth of each figure: (**a**) 10 m; (**b**) 30 m; (**c**) 50 m; (**d**) 70 m.

Different types and lengths of eigenrays lead to different transmission losses and time delay. The normalized CIR is calculated by accumulating all eigenrays. Figure 7 shows the normalized CIRs

of channel with different receiving depth. Comparing the four normalized CIRs, the normalized CIR with 10 m receiving depth is more concentrated in low time delay and has a maximum amplitude of 0.9415. Whereas the other normalized CIRs are scattered and have maximum amplitudes under 0.5.

**Figure 7.** Normalized channel impulse responses (CIRs) of channel with different receiving depth. Receiving depth of each figure: (**a**) 10 m; (**b**) 30 m; (**c**) 50 m; (**d**) 70 m.

Figure 8a shows BER performance of channels in different receiving depths. It is obvious that the performance of channel estimation in 10 m depth channel is much better than others. The gap of performance between 10 m depth channel and other channels is gradually increasing as SNR increases. With a 30 dB SNR, the gap in LS algorithm reaches around 20 dB and the gap in LMMSE algorithm reaches around 30 dB. Comparing the performance of LS and LMMSE algorithms in different channels, the difference between LS and LMMSE algorithms is larger in 10 m depth channel. MSE performance of channels in different receiving depth is illustrated in Figure 8b. The curves of LS algorithm have the same performance due to the principle of algorithm. Furthermore, the performance of MSE has similar features to that of BER in LMMSE algorithm.

**Figure 8.** (**a**) bit error rate (BER) performance of channels in different receiving depth, estimated by LS and LMMSE algorithms, respectively. (**b**) MSE performance of channels in different receiving depth, estimated by LS and LMMSE algorithms, respectively.

#### *4.2. Sea Bottom Boundary*

Seabed terrain has some structures that result from common physical phenomena. In the shallow sea, the depth of seabed usually changes slowly and the sea bottom boundary can be considered as flat in short distances, which is modelled in previous simulation. Furthermore, the seabed may rise or fall rapidly in the littoral area and have protruding or sagged structures under specific conditions. The four special types of seabed above will be modelled in this simulation to analyse the influence of sea bottom boundary on channel estimation. The rising and falling sea boundaries are sampled from scenario B with opposite transmitting and receiving positions. The protruding and sagged sea boundaries are rare to find in real bathymetry data. Therefore, the sea bottom boundary of scenario A is adjusted to produce the boundaries artificially. In this simulation, the receiving depth is fixed at 10 m, which has been confirmed to yield the best channel estimation performance among other depths in previous simulation. The sea surface boundary is set as flat to control the stochastic effect.

Figure 9 shows the eigenrays of channels with different sea bottom boundaries. The channel with rising bottom has more eigenrays than others because rising bottom increases the length of eigenrays transmission path and disperses the eigenrays gradually in the process of transmission. The falling bottom can also disperse the eigenrays. Since falling bottom decreases the length of eigenrays transmission path, the number of eigenrays in channel with falling bottom reduces a lot. The channel with protruding bottom has shorter length of eigenray transmission path than the channel with sagged bottom as the impact of seabed terrain.

Figure 10 illustrates the normalized CIRs of channels above. Overall, the length of normalized CIRs of channels with rising and sagged bottom are larger than that of channels with falling and protruding bottom. It is obvious that the CIR of channel with rising bottom has more multipaths than that with falling bottom. Owing to the seabed terrain, the relative time delay of path with maximum amplitude in protruding channel bottom is less than that in sagged channel.

Figure 11 is plotted to compare the BER and MSE performance between the four channels. The BER performance of different channels in LS algorithm is all around 0.0075 when SNR is 30 dB. The value is similar to the worse BER value in Figure 8a, implicating that there may exist a worst-case performance bound of UWA channel estimation through LS algorithm. In addition, the BER performance of different channels in LMMSE algorithm has a few differences. The channel with falling bottom is estimated in the best BER performance while the channel with sagged bottom is estimated in the worst BER performance. The performance of MSE also has similar characteristics to that of BER in LMMSE algorithm, shown in Figure 11b.

**Figure 9.** Eigenrays of channel with different sea bottom boundaries. The color of lines indicates whether eigenrays hit surface boundary or bottom boundary of the sea. Sea bottom boundaries of each figure: (**a**) rising bottom; (**b**) falling bottom; (**c**) protruding bottom; (**d**) sagged bottom.

**Figure 10.** *Cont.*

**Figure 10.** Normalized CIRs of channel with different sea bottom boundaries. Sea bottom boundaries of each figure: (**a**) rising bottom; (**b**) falling bottom; (**c**) protruding bottom; (**d**) sagged bottom.

**Figure 11.** (**a**) BER performance of channels in different sea bottom boundaries, estimated by LS and LMMSE algorithms, respectively. (**b**) MSE performance of channels in different sea bottom boundaries, estimated by LS and LMMSE algorithms, respectively.

#### *4.3. Sea Surface Boundary*

In a real ocean environment, the sea surface boundaries are usually not flat. The last simulation is designed to model UWA channels with different sea surface boundaries and analyse the impact on OFDM channel estimation. In this simulation, the receiving depth is fixed at 10 m and UWA channels are modelled in the sea area of scenario A with flat sea bottom boundary, according to the experience of previous simulations. For the sake of high complexity in stochastic channel simulation, LS algorithm is used to estimate UWA channel in this simulation.

First, four UWA channels are modelled with different sea surface boundaries, which are generated by Gauss-Lagrange wave model with measured SWH (1.29 m). Figure 12 shows the eigenrays of different sea surface boundaries. As can be seen from the figures, the stochastic wave boundaries have variable influence on reflection angle, which leads to the number of reflection changes.

Furthermore, Figure 13 shows the normalized CIRs of channels with different sea surface boundaries. In this figure, the effect of sea surface appears as the change of time delay and response amplitude. The CIRs of channels with sea surface 2, sea surface 3, and sea surface 4 are more concentrated than that with sea surface 1. Furthermore, the CIR of channel with sea surface 4 has the maximum amplitude (0.9702).

**Figure 12.** Eigenrays of channel with different sea surface boundaries. The color of lines indicates whether eigenrays hit surface boundary or bottom boundary of the sea. Sea surface boundaries of each figure: (**a**) Sea Surface 1; (**b**) Sea Surface 2; (**c**) Sea Surface 3; (**d**) Sea Surface 4.

**Figure 13.** *Cont.*

**Figure 13.** Normalized CIRs of channel with different sea surface boundaries in 1.29 m SWH. Sea surface boundaries of each figure: (**a**) Sea Surface 1; (**b**) Sea Surface 2; (**c**) Sea Surface 3; (**d**) Sea Surface 4.

Figure 14a shows BER performance in different sea surface boundaries. Comparing with the BER performance in UWA channel with flat sea surface boundary channel in Figure 8a, the Gauss-Lagrange waves can improve or deteriorate the BER performance of OFDM channel estimation to some extent. The BER performance of channels with sea surface 1, sea surface 2 and sea surface 3 has a range of deterioration from 1.50 dB to 19.38 dB, while the BER performance of channel with sea surface 4 has an improvement of 10.86 dB.The MSE performance of the four channels has the same value as that in previous simulations, which are illustrated in Figure 14b.

From the figures above, the result of different BER value is mainly caused by the stochastic effect of sea surface on the reflection angle of interfering eigenrays. The main energy of CIRs comes from the blue eigenrays (eigenrays hitting bottom only) that are the same in the four channels. Owing to the stochastic effect of sea surface on the reflection angle of interfering eigenrays which are mainly composed of black eigenrays that stand for the eigenrays hitting both boundaries, the amplitudes and delay of interfering eigenrays are different. In the channel with sea surface 1, the interfering eigenrays have more rebounds than that in the channel with sea surface 4. As a result, the interfering eigenrays in the channel with sea surface 1 have shorter time delay and higher amplitudes, which leads to a worse channel estimation performance.

The result above reveals the stochastic effect of sea surface and explains the reason for the deviation between modelled channels and real channels in [20]. In order to further investigate the statistical effect on OFDM channel estimation, batch of channels with 0.5 m, 1.29 m, and 2 m SWH are modelled and compared in terms of BER performance.

Figure 15a shows the BER curves of channels with different SWH. The blue, green, and red curves represent the BER performance of the modelled channel with 0.5 m, 1.29 m, and 2 m SWH, respectively. For each color, there are 20 curves. When SNR is 30 dB, the worst curves of the three colors have similar BER value with the worst curves in Figure 8a and Figure 11a, which confirms the existence of worst-case performance bound of UWA channel estimation through LS algorithm. Considering the distribution characteristic of curves in different colors, the blue, green, and red curves are mainly concentrated in the high, middle, and low BER performance part of the distribution range, respectively. Figure 15b shows the quantitative result by plotting the mean channel estimation BER curves for each value of SWH. It is obvious that the mean BER performance of stochastic channels is worse than that of channel with flat sea surface. As the SWH rises, the average performance gradually increases. When SNR is 30 dB, UWA channel with 2 m SWH is 7.95 dB better than that with 0.5 m SWH and 8.06 dB worse than that with flat sea surface.

**Figure 14.** (**a**) BER performance of channels in different sea surfaces boundaries. (**b**) MSE performance of channels in different sea surfaces boundaries.

**Figure 15.** (**a**) Distribution of BER curves with difference SWH. There are 20 curves for each SWH value. (**b**) Mean BER performance of channels in different SWH.

The result is mainly caused by the effect of sea surface and the worst-case performance bound. According to the analysis of Figures 12–14, sea surface can affect the time delay and amplitudes of interfering eigenrays, which leads to the variation of estimation performance. As SWH rises, the variation becomes bigger and makes channel estimation more likely to get better or worse performance while the worst-case performance bound limits the performance from getting worse. Consequently, the channels with high SWH can get better average performance of channel estimation. In addition, the possibility of negative effect of sea surface is more than the positive effect, which makes the rough surface channel get worse average performance of channel estimation than the flat surface channel.

#### **5. Conclusions**

In this article, we design a comprehensive simulation platform combining UWA channel modelling with OFDM channel estimation. The simulation platform is presented in a GUI and interfaced from various databases, allowing the user to model realistic UWA channels in most areas of the ocean and estimate channels with configurable inputs. Three simulations are conducted based on the simulation platform. Realistic UWA channels in the East China Sea are modelled to study the influence of receiving depth, sea bottom boundary, and sea surface boundary on OFDM channel

estimation. The simulations present that different environmental factors have specific effects on rays tracing, which result in the change of time delay and amplitudes, causing the specific effect on the performance of channel estimation. The results show that: (1) The UWA channel with 10 m receiving depth has more concentrated normalized CIR and better channel estimation performance than the other channels with deeper receiving depth. When SNR is 30 dB, the gap of performance reaches around 20 dB in LS algorithm and 30 dB in LMMSE algorithm. (2) The UWA channels with complicated sea bottom boundaries yield poor channel estimation. The BER performance of the channels in LS algorithm is around 0.0075, which is similar to the worse BER value in the first simulation. A worst-case performance bound exists in LS algorithm in which the UWA channels can hardly get worse performance of channel estimation. (3) The sea surface modelled in Gauss-Lagrange waves only affects the interfering eigenrays and has a stochastic effect on the performance of channel estimation. With the increase of SWH, the average performance gradually gets better because of the worst-case performance bound and increasing range of stochastic effect. Though the effect is stochastic, the rough surface channels get worse average performance than the flat one. When SNR is 30 dB, the 2 m SWH channels get 7.95 dB better mean BER performance than the 0.5 m SWH channels and 8.06 dB worse mean BER performance than the flat surface channel in the LS algorithm.

**Author Contributions:** Conceptualization, X.W. (Xiaoyu Wang) and R.J.; Formal analysis, X.W. (Xiaohua Wang) and Q.C.; Methodology, X.W. (Xiaohua Wang) and X.W. (Xinghua Wang); Resources, R.J. and W.W.; Software, X.W. (Xiaoyu Wang); Writing—original draft, X.W. (Xiaoyu Wang); Writing—review & editing, Q.C. and X.W. (Xinghua Wang).

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Experimental Investigation of Acoustic Propagation Characteristics in a Fluid-Filled Polyethylene Pipeline**

**Qi Li 1,2,3, Jiapeng Song 1,2,3 and Dajing Shang 1,2,3,\***


Received: 16 November 2018; Accepted: 4 January 2019; Published: 9 January 2019

**Abstract:** Fluid-filled polyethylene (PE) pipelines have a wide range of applications in, for example, water supply and gas distribution systems, and it is therefore important to understand the characteristics of acoustic propagation in such pipelines in order to detect and prevent pipe ruptures caused by vibration and noise. In this paper, using the appropriate wall parameters, the frequencies of normal waves in a fluid-filled PE pipeline are calculated, and the axial and radial dependences of sound fields are analyzed. An experimental system for investigating acoustic propagation in a fluid-filled PE pipeline is constructed and is used to verify the theoretical results. Both acoustic and mechanical excitation methods are used. According to the numerical calculation, the first-, second-, and third-order cutoff frequencies are 4.6, 10.4, and 16.3 kHz, which are close to the experimentally determined values of 4.7, 10.6, and 16 kHz. Sound above a cutoff frequency is able to propagate in the axial direction, whereas sound below this frequency is attenuated exponentially in the axial direction but can propagate along the wall in the form of vibrations. The results presented here can provide some basis for noise control in fluid-filled PE pipelines.

**Keywords:** fluid-filled polyethylene (PE) pipeline; noise control; acoustic propagation; cutoff phenomenon

#### **1. Introduction**

In pipeline systems, a number of different materials can be used for the pipe walls, with the most common being steel and thermoplastics such as polyethylene (PE). PE pipelines have a wide range of applications, including transportation of liquids on ships and aircraft, long-distance transportation of natural gas, storage and transportation of liquids in the chemical industry, and urban water supply. They are of particular importance in the last of these applications owing to their advantages of high strength, high resistance to corrosion and wear, good stability over a wide range of temperatures, and lack of toxicity [1]. However, fluid-filled PE pipelines often suffer from problems caused by excessive vibration and noise [2,3]. Vibration can cause long-term fatigue damage to the pipeline system, while noise not only reduces the stability and safety of the entire pipeline system, but can also have a deleterious effect on the environment for people in the vicinity of the pipeline. Burst water supply pipelines not only result in losses of large amounts of water and consequent serious disruption to daily life, but can also lead to secondary consequences such as traffic jams and even disasters such as landslides if they are not repaired rapidly. Therefore, it is of great importance to investigate the acoustic transmission characteristics of fluid-filled PE pipelines and to develop methods to control the noise and vibration that are generated in these systems.

Before investigating acoustic propagation in fluid-filled pipelines, it is useful to consider the problem of elastic vibrations of a cylindrical shell. In the nineteenth century, Rayleigh [4] investigated the vibration of cylindrical shells and obtained the free-vibration frequency of an infinite cylindrical shell in vacuum. More recently, the dispersion characteristics of pipelines have been extensively investigated [5–9]. Junger [10–12] and Muggeridge [13] investigated vibration problems for cylindrical shells in liquids. The first to investigate acoustic propagation in a fluid-filled pipeline was Lamb [14], who came to the conclusion that this propagation is influenced by the strength of longitudinal waves in the wall compared with that of bending waves. Later, Lin and Morgan [15] investigated the dispersion properties of a sound field in a fluid-filled pipeline, and analyzed the first four normal modes of waves in an axisymmetric rigid pipeline.

Using a short-pulse signal, Kwun et al. [16] experimentally investigated the dispersion of longitudinal waves in a liquid-filled cylindrical shell and found that the liquid in the pipeline slightly reduced the group velocity and cutoff frequency of the longitudinal mode in the tube wall. Horne et al. [17] conducted an experimental investigation of acoustic propagation in a liquid-filled pipeline, examining the effects of different pipe-wall materials on the sound field in the pipe. However, their experiment suffered from the limitations that the end of the pipeline was not muffled and the sound pressure was measured at only one point at a given time. Pan et al. [18] investigated acoustic propagation in a fluid-filled pipeline both experimentally and numerically. In their experiment, sound field measurements could not be obtained throughout the fluid, because they used only two PZT (Lead zirconate titanate piezoelectric ceramic) circular transducers mounted on the pipe wall, one at the transmitting end and the other at the receiving end. Lafleur [19], Aristegui et al. [20] and Baik et al. [21] conducted systematic theoretical and experimental investigations of acoustic propagation in liquid-filled elastic pipelines, but their experimental methods and equipment were not very different from those used in previous studies. To date, there have been few theoretical and experimental studies focusing on sound below the cutoff frequency. Many engineers are not aware of the low-frequency cutoff effect and the propagation path of low-frequency noise. Most pipeline mufflers are limited to liquid noise reduction alone, and do not deal with wall vibration [22,23], which is a serious omission.

There have been a number of systematic investigations of acoustic propagation in gas-filled pipelines [24,25], and the results obtained on noise in such systems can provide some guidance for studies of noise in liquid-filled pipelines. However, these results cannot be carried over completely to the liquid case because of the differences in the characteristic impedance and speed of motion of the respective media [26,27]. There have also been a few theoretical studies of noise generated in fluid-filled elastic pipelines by supersonic flow [28], wall vibration [29], and bubble oscillations [30], but there is a lack of corresponding experimental investigations.

The present study aims to improve on the results of previous work by taking full account of the existence of a low-frequency cut-off phenomenon in a fluid-filled pipeline such that sound below a cut-off frequency is mainly propagated through the pipeline wall. It thereby also aims to remedy some shortcomings of previous attempts at pipeline noise reduction. Both theoretical and experimental investigations of acoustic propagation in a fluid-filled PE pipeline are conducted. The remainder of the paper is organized as follows. In Section 2, the eigenequation for the sound field in an elastic PE pipeline is obtained from a theoretical analysis, and the cutoff frequencies of a normal wave in the PE tube are calculated. Section 3 describes the experimental system and the scheme for determining the acoustic propagation characteristics of the fluid-filled PE pipeline. Section 4 discusses the experimental results. The general distribution law of the sound field and the propagation path of noise in the fluid-filled PE pipeline are analyzed. Finally, Section 5 presents the conclusions of this study.

#### **2. Theoretical Analysis**

#### *2.1. Eigenequation in the Pipeline*

It should first be noted that although calculations in pipe acoustics have generally been performed under the assumption of absolute soft or absolute hard boundary conditions, the characteristic impedance of a liquid is large compared with that of air, and cannot be ignored, and so ideal boundary conditions are no longer applicable.

The infinitely long straight pipeline considered here has outer diameter *a* and inner diameter *b*, as shown in Figure 1 [31].

**Figure 1.** Infinitely long straight fluid filled pipeline model.

In the following, the displacement scalar potential function is denoted by *φ*, the vector potential function by <sup>→</sup> Ψ, the longitudinal-wave velocity in the wall by *cl*, the shear-wave velocity by *cs*, the shear modulus of the pipe wall material by *μ*, the wave velocity in the liquid in the pipe by *c*0, and the liquid density in the pipe by *ρ*1. Some previous studies have assumed an axisymmetric source in a plate [32] or in a pipe, using a cylindrical coordinate system [33], while some have assumed a point source [34,35]. In the present study, the assumption of axisymmetric excitation is an important one. Under axisymmetric excitation, with <sup>→</sup> Ψ = (0, *ψθ*, 0), the wave equation in the wall can be represented by the following equations for two scalar potential functions:

$$\begin{aligned} \nabla^2 \phi &= \frac{1}{\varepsilon\_l^2} \frac{\partial^2 \phi}{\partial t^2}, \\ \left(\nabla^2 - \frac{1}{r^2}\right) \psi\_\theta &= \frac{1}{\varepsilon\_s^2} \frac{\partial^2 \psi\_\theta}{\partial t^2} \end{aligned} \tag{1}$$

where

$$
\nabla^2 = \frac{\partial^2}{\partial r^2} + \frac{1}{r}\frac{\partial}{\partial r} + \frac{\partial^2}{\partial z^2}.
$$

The radial and axial components of the displacement are

$$\begin{array}{l} \mu\_{r} = \frac{\partial \phi}{\partial r} - \frac{\partial \psi\_{\theta}}{\partial z}, \\ \mu\_{z} = \frac{\partial \phi}{\partial z} + \frac{1}{r} \frac{\partial (r \psi \theta)}{\partial r}, \end{array} \tag{2}$$

where *ur* is the radial components of the displacement in the wall, *uz* is the axial components of the displacement in the wall, and the normal and tangential components of the stress are

$$\begin{cases} \delta\_{rr} = \lambda \Delta + 2\mu \frac{\partial u\_r}{\partial r}, \\ \delta\_{rz} = \mu \left(\frac{\partial u\_r}{\partial z} + \frac{\partial u\_z}{\partial r}\right), \end{cases} \tag{3}$$

where *δrr* is the normal component of the stress in the wall, *δrz* is the tangential components of the stressin the wall, *λ* and *μ* are the Lame coefficients.

*Appl. Sci.* **2019**, *9*, 213

Under the assumption of a simple harmonic vibration in the *z* direction, the displacement potential function can be expressed as

$$\phi = \Phi e^{i(k\_z z - \omega t)}, \psi\_\theta = \Psi e^{i(k\_z z - \omega t)}.$$

With the time dependence ignored, by substitution of these expressions into the wave equation, the relationship between the displacement potential function and the displacement and stress in the pipe wall can be obtained.

When *b* ≤ *r* ≤ *a*, the formal solution for the potential function in the wall is

$$\begin{aligned} \phi(r,z) &= [A]\_0(k\_l r) + B\mathcal{Y}\_0(k\_l r)]e^{ik\_z z}, \; k\_l^2 + k\_z^2 = \left(\omega/c\_l\right)^2, \\ \psi\_\theta(r,z) &= [\mathbb{C}J\_0(k\_l r) + D\mathcal{Y}\_0(k\_l r)]e^{ik\_z z}, \; k\_l^2 + k\_z^2 = \left(\omega/c\_s\right)^2, \end{aligned} \tag{4}$$

where *A*, *B*, *C*, and *D* are constants.

The wave equation satisfied by the water potential function is

$$
\nabla^2 \phi\_1 = \frac{1}{c\_0^2} \frac{\partial^2 \phi\_1}{\partial t^2}.\tag{5}
$$

The radial and axial components of the displacement in the water are

$$\begin{array}{l} u\_{rf} = \frac{\partial \phi\_1}{\partial r}, \\ u\_{zf} = \frac{\partial \phi\_1}{\partial z}, \end{array} \tag{6}$$

where *ur f* is the radial component of the displacement in the water, *urr* is the axial component of the displacement in the water.

The normal stress in the water is

$$
\delta\_{rrf} = \rho\_1 \omega^2 \phi\_{1\prime} \tag{7}
$$

where *δrr f* is the normal stress in the water.

Similarly, under the assumption of a simple harmonic vibration in the *z* direction, the displacement potential function can be expressed as

$$\phi\_1 = \Phi\_1 e^{i(k\_z z - \omega t)}.$$

With the time dependence ignored, on substitution of these expressions into the wave equation, the relationship between the displacement potential function and the displacement and stress in the water can be obtained.

When 0 ≤ *r* ≤ *b*, the formal solution for the potential function in the water is

$$
\phi\_1(r, z) = E f\_0(k\_r r) \epsilon^{ik\_z z}, \; k\_r^2 + k\_z^2 = (\omega / c\_0)^2,\tag{8}
$$

where *E* is a constant. The boundary conditions are

$$\begin{cases} \left. \delta\_{rr} \right|\_{b} = \left. \delta\_{rrf} \right|\_{b'}\\ \left. \delta\_{rz} \right|\_{b} = 0, \\\ \left. u\_{r} \right|\_{b} = \left. u\_{rf} \right|\_{b'} \end{cases} \qquad \left\{ \begin{array}{l} \left. \delta\_{rr} \right|\_{a} = 0, \\\ \left. \delta\_{rz} \right|\_{a} = 0. \end{array} \right. \tag{9}$$

Substitution of the formal solutions from Equations (4) and (10) into the expressions for the stress and displacement, and substitution into the boundary conditions (9), then gives the eigenequation

$$
\begin{bmatrix} P(a) & Q(a) & R(a) & S(a) & 0 \\ & P(b) & Q(b) & R(b) & S(b) & -\frac{\rho\_1 \omega^2}{2\mu} \ln(k\_l b) \\ & M f\_1(k\_l a) & M Y\_1(k\_l a) & G f\_1(k\_l a) & G Y\_1(k\_l a) & 0 \\ & M f\_1(k\_l b) & M Y\_1(k\_l b) & G f\_1(k\_l b) & G Y\_1(k\_l b) & 0 \\ & k\_l f\_1(k\_l b) & k\_l Y\_1(k\_l b) & ik\_k k\_l f\_1(k\_l b) & ik\_k k\_l Y\_1(k\_l b) & k\_l f\_1(k\_l b) \end{bmatrix} \begin{bmatrix} A \\ B \\ C \\ D \\ E \end{bmatrix} = 0,\tag{10}
$$

where

$$\begin{cases} \begin{aligned} P(r) &= -T J\_0(k\_l r) + \frac{k\_l}{r} J\_1(k\_l r), \quad R(r) = N \left[ J\_0(k\_l r) - \frac{1}{k\_l r} J\_1(k\_l r) \right], \\ Q(r) &= -T \mathcal{Y}\_0(k\_l r) + \frac{k\_l}{r} \mathcal{Y}\_1(k\_l r), \quad S(r) = N \left[ \mathcal{Y}\_0(k\_l r) - \frac{1}{k\_l r} \mathcal{Y}\_1(k\_l r) \right], \\\ T &= \frac{1}{2} (k\_l^2 - k\_z^2), \quad G = k\_l (k\_l^2 - k\_z^2), \quad N = -ik\_z k\_l^2, \quad M = 2ik\_z k\_l. \end{aligned} \end{cases}$$

#### *2.2. Calculation of the Normal Frequency*

For Equation (10) to have a nonzero solution, the determinant of the coefficient matrix must vanish, i.e.,

$$\begin{vmatrix} P(a) & Q(a) & R(a) & S(a) & 0\\ & P(b) & Q(b) & R(b) & S(b) & -\frac{\rho\_1 \omega^2}{2\mu} f\_0(k\_l b) \\ & M f\_1(k\_l a) & M Y\_1(k\_l a) & G f\_1(k\_l a) & G Y\_1(k\_l a) & 0 \\ & M f\_1(k\_l b) & M Y\_1(k\_l b) & G f\_1(k\_l b) & G Y\_1(k\_l b) & 0 \\ & k\_l f\_1(k\_l b) & k\_l Y\_1(k\_l b) & ik\_z k\_l f\_1(k\_l b) & ik\_z k\_l Y\_1(k\_l b) & k\_r f\_1(k\_r b) \end{vmatrix} = 0. \tag{11}$$

This equation is the dispersion relation.

If *kz* = 0, then *kr* = (*ω*/*c*0) 2 , *kl* = (*ω*/*cl*) 2 , and *kt* = (*ω*/*cs*) 2 , and there is no sound propagation in the axial direction of the pipeline. Then, *ω* can be obtained by substituting these values of *kr*, *kl*, and *kt* into Equation (11), and the corresponding frequency is the normal frequency of the corresponding order of vibration of the fluid-filled elastic pipeline.

Table 1 shows an example of the normal frequencies of the first four orders of vibration calculated using Newton's iterative method with the wall parameters of the experimental liquid-filled PE pipeline (also shown in the table).



#### *2.3. Axial and Radial Dependence of the Sound Field*

The sound field in the tube can be analyzed in terms of the formal solution in Equation (8) for the displacement potential function in the water. If the radial wavenumber *kr* is negligible, then only the axial dependence of the wave needs be considered. When *kz* is real, *eikz <sup>z</sup>* is a periodic function, and the sound wave is able to propagate for long distances along the axial direction. When *kz* is imaginary, *eikz <sup>z</sup>* is an exponential function, and the normal wave is transformed into a nonuniform wave attenuated according to an exponential law along the axial direction, and thus has very little influence on the sound field far from the pipeline axis: The sound wave cannot propagate for long distances in the pipeline.

If the axial wavenumber *kz* is negligible, then only the radial dependence of the wave needs be considered. It can be seen that this is given by the zeroth-order Bessel function *J*0(*krr*), and so the sound pressure is greatest close to the axis.

#### **3. Experimental Apparatus and Procedure**

To verify the theoretical results, experiments were carried out using the system shown in Figure 2. These experiments focused on the distribution of the sound field and the acoustic propagation behavior in the pipeline for different excitation sources when the liquid in the pipeline was stationary. The experimental conditions are listed in Table 2, and photographs of the experimental conditions and experiment apparatus are shown in Figure 3.


**Table 2.** Experimental conditions.

Normal waves in the pipeline can be analyzed under white noise conditions. The white noise frequency range was selected as 0–20 kHz according to the sampling frequency of the collector and the theoretically calculated normal frequency. The variation of the sound field along the axial direction in the pipeline can be analyzed under a single-frequency-signal condition. The two frequencies below and above the cutoff frequency were selected in experimental conditions 2 and 3, respectively, to verify the cutoff effect of the sound in the pipeline. Single-frequency mechanical excitation corresponds to the acoustic signal experimental conditions, and the propagation path of the sound was analyzed along the axial direction, and therefore experimental conditions 4 and 5 involved transmission of two single-frequency mechanical force signals of the same frequency as the sound source.

The maximum sampling rate of the B&K pulse collector was 131,072 Hz. The sensors used in this experiment were a B&K8103 hydrophone and B&K4371 vibration sensor. Their specifications are given in Table 3.

**Figure 2.** Experimental system for investigating acoustic propagation characteristics of a fluid-filled PE pipeline.

**Figure 3.** Experimental apparatus and acquisition system: (**a**) B&K2713 power amplifier; (**b**) YE5859 charge amplifier; (**c**) B&K pulse collector; (**d**) Agilent 33522A signal source; (**e**) hydrophone bracket; (**f**) experimental acquisition system.



As mentioned in references [10–13], there have been many experimental investigations of acoustic propagation in fluid-filled elastic pipelines; however, these pipelines were rather short, and there was no special treatment of the end of the pipeline other than, in some cases, the simple addition of a flange, which caused inverse superposition of the sound field in the axial direction. These previous experiments used a single hydrophone, measuring the sound pressure spectrum at a single point only, and therefore it was not possible to obtain the distribution of the sound field along the entire pipeline.

To avoid the above problems, in the present experiment, an 18-m-long PE pipeline with two layers of anechoic tips of different lengths installed at the end was used, which completely eliminated echo and prevented inverse superposition of the sound field in the axial direction. As the source, a piston transducer was mounted on the front of the pipe through a flange, and vibration isolation material was interposed between the flange and the pipeline, thereby preventing direct excitation of the pipe wall. The wall parameters were the same as those used in the theoretical calculations (see Table 1). Five slots, each of length 1 m, in the axial direction were cut in the pipe wall at a distance of 1 m from the transducer, with a 1 m gap between them (see Figure 4).

**Figure 4.** Slots of the pipeline.

Four 8103 hydrophones were mounted on a bracket, and so the sound pressure at four different radial positions could be measured at the same time, as shown in Figure 2. The hydrophone bracket was made from polyvinyl chloride (PVC), which has characteristic impedance similar to that of water, greatly reducing the scattering of sound waves as they passed through the bracket. Three other hydrophones measured the near-field signal at 0, 0.05, 0.1, and 0.15 m from the axial center of the transducer. After the pipeline was filled with water, it was allowed to stand for more than 30 h to eliminate the effect of bubbles on the experiment.

After the fluid column (water) was excited by the transducer, the hydrophones were moved along the axial slots away from the source, and recordings were taken at 10 cm intervals; thus, each slot had 9 or 10 recording points. The use of the hydrophone bracket ensured that the radial positions of the hydrophones remained unchanged when the axial position was changed. In this way, the sound pressure distributions of the fluid column along both the axial and radial directions were measured.

The analysis bandwidth of the collector was set to 0–25 600 Hz, in accordance with the theoretical normal frequencies, and the corresponding sampling frequency was set to 65,536 Hz by the Nyquist law, with a sampling time of 10 s. The power spectrum of the corresponding working condition was obtained through fast Fourier transform (FFT) processing of the time-domain signal. To determine the propagation path of the sound in the experimental system and compare it with the acoustic signal of the fluid in the pipeline at the same time, the vibration of the wall was also measured in this experiment. There were three rows of vibration sensors encircling the outside of the wall, each with four sensors, with a separation of 0.33 m between each row.

The deployment of the hydrophones and vibration sensors and the corresponding labels are shown in Figure 5.

**Figure 5.** Deployment and labeling of hydrophones and vibration sensors.

#### **4. Results and Discussion**

#### *4.1. Behavior of Normal Waves in the Pipeline*

For working condition 1, the transmitting voltage level response of the transducer is shown in Figure 6. The measurement results from the hydrophones at different positions along the axial direction in the pipeline are shown in Figure 7. It can be seen that the acoustic signal in the pipeline exhibits a significant cutoff phenomenon, with a cutoff frequency of 4.7 kHz, which is close to the theoretically determined cutoff frequency of 4.6 kHz.

**Figure 6.** Transmitting voltage level response of transducer.

**Figure 7.** Measurement results from hydrophones at different positions along the axial direction under white noise excitation: (**a**) 0.05, 0.1, and 0.15 m from the source; (**b**) 1.1 m from the source; (**c**) 5.95 m from the source; (**d**) 0.1, 5, 6, and 7 m from the source.

The sound energy of the far field can be divided into four intervals from the spectrum: (1) Below 4.7 kHz; (2) 4.7–10.6 kHz; (3) 10.6–16 kHz; (4) 16 kHz and above. The boundary points of these intervals are the frequencies of the normal wave in the pipeline, which are basically consistent with the calculated frequencies of the corresponding orders, as shown in Table 4.


**Table 4.** Comparison of calculated and experimental normal frequencies.

As mentioned before, the experimental pipeline is slotted. PE is not a very rigid material and, as a result of the slotting process, the tube is deformed radially. This is the main reason for the error between the actual measured frequency and the theoretically calculated frequency. In addition, the parameters in Table 1 are the material elastic parameters of standard high-density PE, which are not necessarily exactly the same as those of the wall of the experimental pipeline, which also leads to an error between the theoretical and experimental results.

In interval (1), the curve of the power spectrum is very close to the background noise. Close to 4.7 kHz, however, the curve suddenly rises, exhibiting a cutoff phenomenon. Then, in interval (2), the curve changes relatively gently, which is basically consistent with the corresponding frequency band of the transmitting transducer frequency response curve. As the distance between the measurement

points and the source becomes greater, it can be seen that the curve decreases monotonically; in interval (3), the curve changes more sharply, and many resonance peaks appear. As the frequency increases, the distance between adjacent peaks also increases, and the appearance of the curve in interval (4) is similar to that in interval (3).

In terms of the behavior of the normal wave, interval (1) is below the cutoff frequency, and the normal wave is attenuated exponentially in the axial direction. The curve of the power spectrum in this frequency band is basically the same as the background, and it can be seen that the power spectrum decays exponentially with frequency in the near field.

Interval (2) lies between the first-order and second-order normal wave frequencies; only the first-order normal wave can propagate, and the curve of the power spectrum hardly changes with frequency. The trend in this section of the curve is related to the transmitting response of the source, with the curve monotonically decreasing along the axial direction as a result of absorption of sound waves by the tube wall.

The first and second orders of the normal wave propagate simultaneously in interval (3), and the first, second, and third orders propagate in interval (4), where there is strong interference leading to large fluctuations in the curve and to the appearance of many resonant peaks.

The results of the vibration signal measurements are shown in Figure 8. The frequency response of the vibration signal is similar to that of the acoustic signal in its overall trend, with a cutoff phenomenon, and the division of the modal frequencies of each order is obvious. The vibration signal in the frequency band above the cutoff frequency is transmitted from the acoustic signal in the water.

**Figure 8.** Measurement results from vibration sensors on the outside of the pipeline wall: (**a**) 2 m from the source; (**b**) 3 m from the source.

#### *4.2. Variation of the Sound Field along the Axial Direction*

The variation of the sound field along the axial direction can be seen more clearly from analysis of the response to a single-frequency source compared with the response to the white noise in working condition 1. In working conditions 2 and 3, two representative single-frequency signals were used, 4.2 and 5.2 kHz, which are respectively below and above the cutoff frequency. The main frequency power spectrum was obtained as an average of the measurements by the four hydrophones on the bracket, and its variation along the axial distance is shown in Figure 9. It can be seen from Figure 9a that for a source frequency below the cutoff frequency, the power spectrum of the main frequency is attenuated very rapidly, indeed exponentially, as the distance increases. Acoustic signals below the cutoff frequency cannot propagate axially over long distances in the pipeline. For a source frequency above the cutoff frequency, as shown in Figure 9b, the power spectrum of the main frequency hardly changes with distance. There is only 4 dB attenuation from 3 to 10 m, and this attenuation is a result of acoustic absorption by the pipeline wall.

**Figure 9.** Variation of the main frequency average power spectrum along the axial direction: (**a**) 4.2 kHz source; (**b**) 5.2 kHz source.

#### *4.3. Variation of the Sound Field along the Radial Direction*

To explore the variation of the sound field in the radial direction, measurements were performed before and after the hydrophone bracket was raised by 1.5 cm, as shown schematically in Figure 10. The results of these measurements are shown in Figure 11.

**Figure 10.** Raising of hydrophone bracket.

**Figure 11.** Results of hydrophone measurements at different depths 5.05 m from the source: (**a**) Before and (**b**) after lifting of hydrophone bracket.

The frequency band between the first- and second-order normal frequencies was analyzed. Before lifting, the 2# and 3# hydrophones were at the same distance from the axis, and similarly for the 1# and 4# hydrophones. Therefore, in Figure 11a, the curves of the power spectra from the 2# and 3# hydrophones are the same, as are those from the 1# and 4# hydrophones. After lifting, the 3# hydrophone is nearest to the axis, followed in order by the 2#, 4#, and 1# hydrophones, which is consistent with the increasing strengths of the respective power spectra. This is in accordance with the theoretical radial dependence on the Bessel function *J*0(*krr*) in Equation (8).

The following is a quantitative analysis of the radial distribution. The distance between each pair of hydrophones is 50 mm, and the hydrophone bracket is initially at the radial center position. After lifting, the distances of the 3#, 2#, 4#, and 1# hydrophones from the axis are 10, 40, 60, and 90 mm, respectively. At the cutoff frequency, *kr* = *k*0, depending on the distance *r* from the axis, the theoretical differences between the sound pressure self-spectrum measured by the 3# hydrophone and those measured by the other three hydrophones can be calculated, and the results are compared with the experimental measurements in Table 5. For convenience of exposition, the distance between the 3# hydrophone and the axis is set as *r*0, and the differences between the sound pressure power spectrum of the 1#, 2#, and 4# hydrophones and the 3# hydrophone as *X*1, *X*2, and *X*3, respectively.


**Table 5.** Comparison of hydrophone measurements and theoretical values at different radial locations.

There is good agreement between theory and experiment, and the radial distribution of the normal wave in the tube is quantitatively confirmed to follow the Bessel function behavior.

The reasons for the error are as follows. In the experiment, the magnitude of the lifting was controlled manually, and not very accurately, which is the main source of error: If the lifting range were slightly larger, and the 3# and 4# hydrophones closer to the axis, the amplitude would be higher. If the 1# and 2# hydrophones were further away from the axis, the amplitude would be smaller, which would lead to an increase in *X*2; according to the properties of the Bessel function, the closer the value of the function is to the axis, the slower is its rate of change. Therefore, the power spectrum at the 4# hydrophone would increase more if it were lifted, so the value of *X*<sup>3</sup> would be reduced. If the 1# hydrophone were closer to the upper slot and the outside medium were air (which can be regarded as an absolutely soft boundary), the sound would be totally reflected, so the amplitude would become higher, leading to a decrease in *X*1. In addition, the radial deformation of the pipeline due to slotting, scattering by the hydrophone bracket, and the fact that the orientation of the hydrophone was not strictly in the axial direction are all possible sources of error.

#### *4.4. Measurements under Mechanical Excitation*

Working conditions 4 and 5 use mechanical force excitation from outside the pipeline wall in a position directly under that of the sound source in the previous working conditions, as shown in Figure 12. Corresponding to working conditions 2 and 3, the single frequencies of excitation applied in working conditions 4 and 5 are 4.2 and 5.2 kHz, respectively. The measurement results from the hydrophones and vibration sensors are shown in Figures 13–16. In contrast to the response of the acoustic signal, the excitation of the exciter to the pipe wall was a single-point excitation. Therefore, when the sound propagated mainly along the wall, the sound source excitation and the exciter excitation had completely different radial distribution laws. When the sound propagated mainly along the liquid in the tube, both had the same radial distribution. This is an important basis for judging the propagation path of sound in a pipeline.

**Figure 12.** Exciter and excitation point in working conditions 4 and 5.

**Figure 13.** Mechanical excitation at a single frequency of 4.2 kHz. Measurement results from hydrophones at different distances from the excitation point: (**a**) 1.1 m; (**b**) 5.1 m.

Even if the pipeline wall is excited by a mechanical force, an acoustic signal below the cutoff frequency cannot propagate a long distance in the case of weak excitation. It can be seen from Figure 13b that the hydrophones 5.1 m from the excitation point have difficulty in picking up a signal. In Figure 13a, the acoustic signal at the main frequency exhibits a radial dependence that is different from the Bessel function: The hydrophone near the lower outer wall close to the excitation point receives a stronger signal.

**Figure 14.** Mechanical excitation at a single frequency of 4.2 kHz. Measurement results from vibration sensors at different distances from the excitation point: (**a**) 1.3 m; (**b**) 5.3 m.

From the vibration measurement results in Figure 14, it can be seen that a vibration signal at the main frequency can still be measured on the wall at 5.3 m, but it is attenuated compared with the measurement from the sensor at 1.3 m, and the hydrophones in the pipeline are unable to detect any signal at 5.3 m. It can be deduced that the signal at 4.2 kHz is propagated mainly through the pipe wall in the form of vibrations. The second and third peaks in Figure 14 result from frequency doubling. The exciter generates frequency doubling when transmitting a single-frequency signal, which is a consequence of its own physical structure and has no effect on the results of this experiment.

**Figure 15.** Mechanical excitation at a single frequency of 5.2 kHz. Measurement results from hydrophones at different distances from the excitation point: (**a**) 1.1 m; (**b**) 1.1 m (local); (**c**) 5.1 m; (**d**) 5.1 m (local).

**Figure 16.** Mechanical excitation at a single frequency of 5.2 kHz. Measurement results from vibration sensors at different distances from the excitation point: (**a**) 1.3 m; (**b**) 5.1 m.

For excitation at 5.2 kHz, which is above the cutoff frequency, Figure 15a,b shows the measurement results from hydrophones 1.1 m from the excitation point, and Figure 15c,d those from hydrophones 5.1 m from the excitation point, with Figure 15b,d being partial displays of the frequency band near the main frequency shown in Figure 15a,c, respectively. Figure 16 shows the measurement results from the wall vibration sensors 1.3 and 5.3 m from the excitation point.

The sound power at the main frequency suffers almost no attenuation in the axial direction from 1.1 to 5.1 m, and conforms to the Bessel function dependence in the radial direction. It can be deduced that the signal at 5.2 kHz is propagated mainly in the form of sound through the fluid in the pipeline.

#### **5. Conclusions**

The first four orders of normal frequencies in a fluid-filled PE pipeline were calculated, and the distributions of sound in the axial and radial directions were analyzed. The acoustic propagation characteristics of such a pipeline were also studied in an experimental system.

Both the theoretical and experimental investigations have revealed the following:


**Author Contributions:** Conceptualization, Q.L. and J.S.; methodology, Q.L.; software, J.S.; data validation, J.S.; formal analysis, D.S.; writing—original draft preparation, J.S.; writing—review and editing J.S. and D.S.; supervision and project administration, Q.L.

**Funding:** This research was funded by Acoustic Science and Technology Laboratory, Harbin Engineering University (SSJSWDZC2018010) and by the National Science Foundation of China (11874131).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Regularization Factor Selection Method for** *l***1-Regularized RLS and Its Modification against Uncertainty in the Regularization Factor**

#### **Junseok Lim 1,\* and Seokjin Lee <sup>2</sup>**


Received: 5 December 2018; Accepted: 4 January 2019; Published: 8 January 2019

#### **Featured Application: This algorithm can be applied to various kinds of sparse channel estimations, e.g., room impulse response, early reflection, and underwater channel response.**

**Abstract:** This paper presents a new *l*1-RLS method to estimate a sparse impulse response estimation. A new regularization factor calculation method is proposed for *l*1-RLS that requires no information of the true channel response in advance. In addition, we also derive a new model to compensate for uncertainty in the regularization factor. The results of the estimation for many different kinds of sparse impulse responses show that the proposed method without a priori channel information is comparable to the conventional method with a priori channel information.

**Keywords:** *l*1-regularized RLS; sparsity; room impulse response; total least squares; regularization factor

#### **1. Introduction**

Room impulse response (RIR) estimation is a problem in many applications that use acoustic signal processing. The RIR identification [1] is fundamental for various applications such as room geometry related spatial audio applications [2–5], acoustic echo cancellation (AEC) [6], speech enhancement [7], and dereverberation [8]. In [9], the RIR has relatively large magnitude values during the early part of the reverberation and fades to smaller values during the later part. This indicates that most RIR entries have values close to zero. Therefore, the RIR has a sparse structure. The sparse RIR model is useful for estimating RIRs in real acoustic environments when the source is given a priori [10]. There has been recent interest in adaptive algorithms for sparsity in various signals and systems [11–22]. Many adaptive algorithms based on least mean square (LMS) [11,12] and recursive least squares (RLS) [14–17] have been reported with different penalty functions. Sparse estimation research, such as that done by Eksioglu and Tanc [17], has proposed a sparse RLS algorithm, *l*1-RLS, which is fully recursive like the plain RLS algorithm. The algorithm of *l*1-RLS in [17] proposed a proper calculation method for the regularization factor. These recursive algorithms have the potential for sparse RIR estimation; however, the regularization factor should be established prior to applying these algorithms. The regularization factor calculation method requires information about a true sparse channel response for a good performance. The authors in [18,19] have also proposed recursive regularization factor selection methods; however, these methods still need the true impulse response in advance.

In this paper, we propose a new regularization factor calculation method for *l*1-RLS algorithm in [17]. The new regularization factor calculation needs no information for the true channel response in advance. This makes it possible to apply *l*1-RLS algorithm in various room environments. In addition, we derive a new model equation for *l*1-RLS in [17] with uncertainty in the regularization factor and show that the new model is similar to the total least squares (TLS) model that compensates for uncertainty in the calculated regularization factor without the true channel response. For the performance evaluation, we simulate four different sparse channels and compare channel estimation performances. We show that, without any information of the true channel impulse response, the performance of the proposed algorithm is comparable to that of *l*1-RLS with the information of the true channel impulse response.

This paper is organized as follows. In Section 2, we summarize *l*1-RLS in [17]. In Section 3, we summarize the measure of sparsity. In Section 4, we propose a new method for the regularization calculation. In Section 5, we show that *l*1-RLS with uncertainty in the regularization factor can be modeled as the TLS model. In Section 6, we summarize *l*1-RTLS (recursive total least squares) algorithm as a solution for *l*1-RLS with uncertainty in the regularization factor. In Section 7, we present simulation results to show the performance of the proposed algorithm. Finally, we give the conclusion in Section 8.

#### **2. Summarize** *l***1-RLS**

In the sparse channel estimation problem of interest, the system observes a signal represented by an *<sup>M</sup>* × 1 vector **<sup>x</sup>**(*k*)=[*xk*, ··· , *xk*−*M*+1] *<sup>T</sup>* at time instant n, performs filtering, and obtains the output *<sup>y</sup>*(*i*) = **<sup>x</sup>***T*(*k*)**w***o*(*k*), where **<sup>w</sup>***o*(*k*)=[*wk*, ··· , *wk*−*M*+1] *<sup>T</sup>* is the M dimensional actual system with finite impulse response (FIR) type. For system estimation, an adaptive filter system applies with M dimensional vector **w**(*k*) to the same signal vector **x**(*k*) and produces an estimated output *<sup>y</sup>*ˆ(*k*) = **<sup>x</sup>***T*(*k*)**w**(*k*), and calculates the error signal *<sup>e</sup>*(*k*) = *<sup>y</sup>*(*k*) + *<sup>n</sup>*(*k*) <sup>−</sup> *<sup>y</sup>*ˆ(*k*) = *<sup>y</sup>*6(*k*) <sup>−</sup> *<sup>y</sup>*ˆ(*k*), where *<sup>n</sup>*(*k*) is the measurement noise, *y*(*k*) is the output of the actual system, and *y*ˆ(*k*) is the estimated output. In order to estimate the channel impulse response, an adaptive algorithm minimizes the cost function defined by

$$\mathbf{w} = \operatorname\*{argmin}\_{\mathbf{w}} \frac{1}{2} \sum\_{m=0}^{k} \lambda^{k-m} (e(m))^2. \tag{1}$$

From the gradient based minimization, Equation (1) becomes

$$\mathbf{R}(k)\mathbf{w}(k) = \mathbf{r}(k),\tag{2}$$

where **<sup>R</sup>**(*k*) <sup>=</sup> *<sup>k</sup>* ∑ *m*=0 *λk*−*m***x**(*m*)**x***T*(*m*) and **r**(*k*) = *k* ∑ *m*=0 *<sup>λ</sup>k*−*my*6(*m*)**x**(*m*). This equation is the normal equation for the least squares solution. Especially, **w***o*(*k*) is considered as a sparse system when the number of nonzero coefficients *K* is less than the system order of *M*. In order to estimate the sparse system, most estimation algorithms exploit non-zero coefficients in the system [11–17]. In [17], Eksioglu proposed a full recursive *l*1-regularized algorithm by the minimization of the object function as shown in Equation (3).

$$J\_k = \frac{1}{2}\varepsilon\_k + \gamma\_k \|\mathbf{w}\|\_{1^\prime} \tag{3}$$

where *<sup>ε</sup><sup>k</sup>* <sup>=</sup> *<sup>k</sup>* ∑ *m*=0 *λk*−*m*(*e*(*m*))<sup>2</sup> . From the minimization of Equation (3), a modified normal equation was derived as shown in Equation (4).

$$\mathbf{R}(k)\mathbf{w}(k) = \mathbf{r}(k) - \gamma\_k \nabla^s \|\mathbf{w}(k)\|\_1 = \mathbf{\hat{p}}(k). \tag{4}$$

When we solve Equation (4), we should select the regularization factor as shown in Equation (5).

$$\gamma\_k = \frac{2\frac{tr\{\mathbf{R}^{-1}(k)\}}{M}}{\left\|\mathbf{R}^{-1}(k)\nabla^s f(\mathbf{w}(k))\right\|\_2^2} \times \left[ (f(\mathbf{w}(k)) - \rho) + \nabla^s f(\mathbf{w}(k))\mathbf{R}^{-1}(k)\varepsilon(k) \right],\tag{5}$$

where *<sup>f</sup>*(**w**(*k*)) = **w**(*k*)<sup>1</sup> and the subgradient of *<sup>f</sup>*(**w**) is <sup>∇</sup>*<sup>s</sup>* **w**<sup>1</sup> = sgn(**w**). In Equation (5), the regularization factor has the parameter, *ρ*, which should be set beforehand. In [17], the parameter was set as *ρ* = *f*(**w***true*) = **w***true*<sup>1</sup> with **w***true* indicating the impulse response of the true channel. There was no further discussion about how to set *ρ*. However, it is not practical to know the true channel in advance.

#### **3. Measure of Sparseness**

In [20], the sparseness of a channel impulse response is measured by Equation (6).

$$\chi = \frac{L}{L - \sqrt{L}} \left( 1 - \frac{||\hat{\mathbf{w}}||\_1}{\sqrt{L} ||\hat{\mathbf{w}}||\_2} \right) \tag{6}$$

where **w**ˆ *<sup>p</sup>* is the p-norm of **w**ˆ and *L* is the dimension of **w**ˆ . The range of *χ* is 0 ≤ *χ* ≤ 1. That is dependent on the sparseness of **w**ˆ . As **w**ˆ becomes sparser, the sparsity, *χ*, comes close to 1, and as **w**ˆ becomes denser, *χ* comes close to 0. We often have small and none-zero value of *χ*, even in a dense channel. For example, Figure 1 shows the relation of the value of *χ* and the percentage of none-zero components in **w**ˆ with *L* = 215. In Figure 1, we consider all possible cases of none-zero components in **w**ˆ .

**Figure 1.** Sparsity (*χ*) vs. the percentage of none zero coefficients in the channel impulse response.

#### **4. New** *ρ* **Selection Method in the Sparsity Regularization Constant** *γ<sup>k</sup>*

Section 2 shows that the regularization constant *γ<sup>k</sup>* in Equation (5) needs *ρ* to be set as *ρ* = true system impulse response<sup>1</sup> = **w***true*1. However, we need a new method in the constant selection because Equation (5) is not practical. Therefore, Section 4 proposes a new method to set this constant.

For a practical method for the constant selection, we can consider using the estimated vector **w**ˆ instead of using the true vector **w***true* because **w**ˆ , the solution with *l*1-norm, will be closer to the sparse true vector than the solution of the conventional RLS. The more iteration is repeated, the more **w**ˆ converges to the true value. Conventional RLS also converges to the true value; however, the solution with *l*1-norm, is closer to the sparse true value. Therefore, we can use sparse estimate **w**ˆ instead of **w***true* when we set *ρ*, and the uncertainty arising from this is compensated through a TLS solution in the next section. When we determine *ρ* using the estimated **w**ˆ , we choose between the average *ρ* and the current estimate **w**ˆ 1. Table 1 summarizes the *ρ* selection steps.

The determination method for *ρ* value shown in Table 1 is as follows. In Step 1, the sparsity of the estimated **w**ˆ is calculated. The sparsity represents the sparseness of **w**ˆ as a number [23]. In Step 2, *l*1-norm of the estimated **w**ˆ is scaled and the value is averaged with the previous *ρ* value. The scaling value approaches 1 as the sparsity, *<sup>χ</sup>*, gets close to 1. However, the scaling value gets close to *<sup>e</sup>*−<sup>1</sup> 0.37 as the sparsity, *χ*, gets close to 0. Therefore, the scaling does not change *l*1-norm of **w**ˆ for the sparse **w**ˆ . Instead the scaling changes the *l*1-norm smaller for the dense **w**ˆ . In Step 3, the smaller one between the averaged *ρ* and the *l*1-norm of the estimated **w**ˆ is selected as the new *ρ* value. In this case, the *ρ* value becomes completely new if the *l*1-norm of the estimated **w**ˆ is selected, otherwise the previous trend is maintained. In Figure 1, the reference value 0.75 used in Step 3 means that less than 16% of all the impulse response taps are not zero.


**Table 1.** *ρ* selection method in the sparsity regularization constant *γk*.

#### **5. New Modeling for** *l***1-RLS with Uncertainty in the Regularization Factor**

If we set *ρ* = constant, the regularization factor becomes

$$\begin{array}{lcl}\widetilde{\gamma}\_{k} &= \frac{2\frac{\mathsf{tr}\left(\mathbf{k}^{-1}(k)\right)}{M}}{\left\|\mathbf{R}^{-1}(k)\nabla^{\mathsf{c}}f(\mathbf{w}(k))\right\|\_{2}^{2}} \times \left[\left(f(\mathbf{w}(k))-\text{constant}\right)+\nabla^{\mathsf{s}}f(\mathbf{w}(k))\times\mathbf{R}^{-1}(k)\varepsilon(k)\right]} \\ &= \frac{2\frac{\mathsf{tr}\left(\mathbf{k}^{-1}(k)\right)}{M}}{\left\|\mathbf{R}^{-1}(k)\nabla^{\mathsf{c}}f(\mathbf{w}(k))\right\|\_{2}^{2}} \times \left(f(\mathbf{w}(k))-\left\|\mathbf{h}\right\|\_{1}+\left\|\mathbf{h}\right\|\_{1}-\text{constant}\right) +\frac{2\frac{\mathsf{tr}\left(\mathbf{k}^{-1}(k)\right)}{M}\nabla^{\mathsf{c}}f(\mathbf{w}(k))\mathbf{R}^{-1}(k)\varepsilon(k)}{\left\|\mathbf{R}^{-1}(k)\nabla^{\mathsf{c}}f(\mathbf{w}(k))\right\|\_{2}^{2}}.\end{array} \tag{7}$$

Then,

$$\tilde{\gamma}\_k = \gamma\_k + \frac{2\frac{tr\{\mathbf{R}^{-1}(k)\}}{M}(\|\mathbf{h}\|\_1 - \text{constant})}{\left\|\mathbf{R}^{-1}(k)\nabla^s f(\mathbf{w}(k))\right\|\_2^2} = \gamma\_k + \Delta\gamma. \tag{8}$$

Using Equation (8), Equation (4) becomes

$$\mathbf{R}(k)\mathbf{w}(k) = \mathbf{r}(k) - (\gamma\_k + \Delta\gamma)\nabla^s ||\mathbf{w}(k)||\_1 \quad . \tag{9}$$

∇*s* **w**<sup>1</sup> = sgn(**w**) is represented as

$$\nabla^s \|\mathbf{w}(k)\|\_1 = \begin{bmatrix} \ddots & & & \\ & \ddots & & \\ & & \frac{1}{|\mathbf{w}\_i|} & & \\ & & & \ddots & \\ & & & & \ddots \end{bmatrix} \mathbf{w}(k). \tag{10}$$

By applying Equation (10) to Equation (9),

$$\left(\mathbf{R}(k) + \Delta\gamma \begin{bmatrix} \ddots & & & \\ & \frac{1}{|\mathbf{w}\_{i}|} & & \\ & & \ddots & \\ & & & \ddots & \\ \end{bmatrix}\right) \mathbf{w}(k) = \mathbf{r}(k) - \gamma\_{k} \|\mathbf{w}(k)\|\_{1'} \tag{11}$$

where **w***<sup>i</sup>* is *i*-th element of **w**(*k*). Then it is simplified as

$$\left(\mathbf{R}(k) + \Delta\gamma \begin{bmatrix} \ddots & & & \\ & \frac{1}{|\mathbf{w}\_i|} & & \\ & & \ddots & \\ & & & \ddots \end{bmatrix}\right) \mathbf{w}(k) = \mathbf{\hat{p}}(k) \tag{12}$$

Equation (12) is very similar to the system model in Figure 2 that is contaminated by noise both in input and in output. Suppose that an example of the system in Figure 2 is represented as

$$\begin{bmatrix} \mathbf{x}\_{k} + \boldsymbol{n}\_{i,k} & \cdots & \mathbf{x}\_{k-N+1} + \boldsymbol{n}\_{i,k-N+1} \\ \mathbf{x}\_{k-1} + \boldsymbol{n}\_{i,k-1} & \cdots & \mathbf{x}\_{k-N} + \boldsymbol{n}\_{i,k-N} \\ \vdots & \ddots & \vdots \\ \mathbf{x}\_{k-N+1} + \boldsymbol{n}\_{i,k-N+1} & \cdots & \mathbf{x}\_{k-2N+2} + \boldsymbol{n}\_{i,k-2N+2} \end{bmatrix} \times \mathbf{w}(k) = \begin{bmatrix} y\_{k} + \boldsymbol{n}\_{o,k} \\ y\_{k-1} + \boldsymbol{n}\_{o,k-1} \\ \vdots \\ y\_{k-N+1} + \boldsymbol{n}\_{o,k-N+1} \end{bmatrix},\tag{13}$$

where *xk* is *x*(*k*), *ni*,*<sup>k</sup>* is *ni*(*k*), and *no*,*<sup>k</sup>* is *no*(*k*). Equation (13) is simplified as

$$\mathbf{A}\mathbf{w}(k) = \mathbf{b}.\tag{14}$$

**Figure 2.** The model of a noisy input and noisy output system.

If we multiply Equation (14) by **A***<sup>H</sup>* and average it, we get

$$E\left(\mathbf{A}^H \mathbf{A}\right)\mathbf{w}(k) = E\left(\mathbf{A}^H \mathbf{b}\right). \tag{15}$$

We can rewrite Equation (15) as follows

$$\begin{bmatrix} r\_{xx}(0) + \sigma\_{\eta}^{2} & r\_{xx}(1) & \cdots & r\_{xx}(N-1) \\ r\_{xx}(1) & r\_{xx}(0) + \sigma\_{\eta}^{2} & \cdots & r\_{xx}(N-2) \\ \vdots & \vdots & \ddots & \vdots \\ r\_{xx}(N-1) & r\_{xx}(N-2) & \cdots & r\_{xx}(0) + \sigma\_{\eta}^{2} \end{bmatrix} \mathbf{w}(k) = \begin{bmatrix} r\_{xy}(0) \\ r\_{xy}(1) \\ \vdots \\ r\_{xy}(N-1) \end{bmatrix} \tag{16}$$

Then, it can be represented as

$$\mathbf{w}\left(\mathbf{R} + \sigma\_n^2 \mathbf{I}\right)\mathbf{w}(k) = \breve{\mathbf{p}}(k). \tag{17}$$

When we compare Equation (12) with Equation (17), the two system models have almost the same form. Therefore, it is feasible that the TLS method can be applied to Equation (12) [24–30]. Therefore, we expect to obtain almost the same performance as *l*1-RLS with the true channel response if we apply the TLS method by the regularization factor with the new *ρ* in Table 1. In the next section, we summarize *l*1-RTLS (recursive total least squares) algorithm in [29].

#### **6. Summarize** *l***1-RTLS for the Solution of** *l***1-RLS with Uncertainty in the Regularization Factor**

Lim, one of the authors of this paper, has proposed the TLS solution for *l*1-RLS known as *l*1-RTLS [30]. In this section, we summarize *l*1-RTLS in [30] for the solution of Equation (11).

The TLS system model assumes that both input and output are contaminated by additive noise as Figure 2. The output is given by

$$
\tilde{y}(k) = \tilde{\mathbf{x}}^T(k)\mathbf{w}\_o + n\_o(k), \tag{18}
$$

where the output noise *no*(*k*) is the Gaussian white noise with variance *σ*<sup>2</sup> *<sup>o</sup>* . The noisy input vector in the system is modeled by

$$
\widetilde{\mathbf{x}}(k) = \mathbf{x}(k) + \mathbf{n}\_i(k) \in \mathbb{C}^{M \times 1}, \tag{19}
$$

where **n***i*(*k*) = [*ni*(*k*), *ni*(*k* − 1), ··· *ni*(*k* − *M* + 1)] *<sup>T</sup>* and the input noise *ni*(*k*) is the Gaussian white noise with variance *σ*<sup>2</sup> *<sup>i</sup>* . For the TLS solution, we set the augmented data vector as

$$\overline{\mathbf{x}}(k) = \left[\widetilde{\mathbf{x}}^T(k), \widetilde{\mathbf{y}}(k)\right]^T \in \mathbb{R}^{(M+1)\times 1}.\tag{20}$$

The correlation matrix is represented as

$$
\overline{\mathbf{R}} = \begin{bmatrix}
\ \overset{\scriptstyle \mathbf{\tilde{R}}}{\mathbf{p}} & \mathbf{p} \\
\mathbf{p}^T & c
\end{bmatrix},
\tag{21}
$$

where **<sup>p</sup>** <sup>=</sup> *<sup>E</sup>*{6**x**(*k*)*y*(*k*)}, *<sup>c</sup>* <sup>=</sup> *<sup>E</sup>*{*y*(*k*)*y*(*k*)}, **<sup>R</sup>** <sup>=</sup> *<sup>E</sup>* 7 **x**(*k*)**x***T*(*k*) <sup>8</sup> and **<sup>R</sup>**<sup>6</sup> <sup>=</sup> *<sup>E</sup>* 3 <sup>6</sup>**x**(*k*)6**x***T*(*k*) 4 = **R** + *σ*<sup>2</sup> *i* **I**. In [27,28], the TLS problem becomes to find the eigenvector associated with the smallest eigenvalue of **R**. Equation (22) is the typical cost function to find the eigenvector associated with the smallest eigenvalue of **R**.

$$J(k) = \frac{1}{2}\tilde{\mathbf{w}}^T(k)\overline{\mathbf{R}}(k)\tilde{\mathbf{w}}(k),\tag{22}$$

where **<sup>R</sup>**(*k*) is a sample correlation matrix at *<sup>k</sup>*-th instant, and **<sup>w</sup>**<sup>6</sup> (*k*) = - **<sup>w</sup>**<sup>ˆ</sup> *<sup>T</sup>*(*k*), <sup>−</sup><sup>1</sup> *T* in which **w**ˆ (k) is the estimation result for the unknown system at *k*-th instant. We modify the cost function by adding a penalty function in order to reflect prior knowledge about the true sparsity system.

$$J(k) = \frac{1}{2}\breve{\mathbf{w}}^T(k)\overline{\mathbf{R}}(k)\widetilde{\mathbf{w}}(k) + \lambda\left(\widetilde{\mathbf{w}}^T(k)\widetilde{\mathbf{w}}(k-1) - 1\right) + \gamma\_k f(\breve{\mathbf{w}}(k)),\tag{23}$$

where *λ* is the Lagrange multiplier and *γ<sup>k</sup>* is the regularized parameter in [13]. We solve the equations by ∇**w**<sup>ˆ</sup> *J*(*k*) = 0 and ∇*<sup>λ</sup> J*(*k*) = 0 simultaneously. ∇**w**<sup>ˆ</sup> *J*(*k*) = 0 :

$$2\overline{\mathbf{R}}(k)\widetilde{\mathbf{w}}(k) + \lambda\widetilde{\mathbf{w}}(k-1) + \gamma\_k \nabla^s f(\widetilde{\mathbf{w}}(k)) = 0,\tag{24}$$

$$
\nabla\_{\lambda} I(k) = 0: \quad \tilde{\mathbf{w}}^{\top}(k)\tilde{\mathbf{w}}(k-1) = 1,\tag{25}
$$

where the subgradient of *<sup>f</sup>*(**w**<sup>6</sup> ) = **w**<sup>6</sup> <sup>1</sup> is <sup>∇</sup>*<sup>s</sup>* **<sup>w</sup>**<sup>6</sup> **w**<sup>6</sup> <sup>1</sup> <sup>=</sup> sgn(**w**<sup>6</sup> ). From (24), we obtain

$$
\widetilde{\mathbf{w}}(k) = -\frac{\lambda}{2} \overline{\mathbf{R}}^{-1}(k) \widetilde{\mathbf{w}}(k-1) - \gamma\_k \overline{\mathbf{R}}^{-1}(k) \nabla^s f(\widetilde{\mathbf{w}}(k)).\tag{26}
$$

Substituting Equation (26) in Equation (25), we get

$$\left(-\frac{\lambda}{2}\overline{\mathbf{R}}^{-1}(k)\widetilde{\mathbf{w}}(k-1) - \gamma\_k \overline{\mathbf{R}}^{-1}(k)\nabla^s f(\widetilde{\mathbf{w}}(k))\right)^T \times \widetilde{\mathbf{w}}(k-1) = 1,\tag{27}$$

or

$$\lambda = -2 \frac{1 + \gamma\_k \nabla^s f(\tilde{\mathbf{w}}(k))^T \mathbf{\tilde{R}}^{-1}(k) \tilde{\mathbf{w}}(k-1)}{\tilde{\mathbf{w}}^T(k-1) \mathbf{\tilde{R}}^{-1}(k) \tilde{\mathbf{w}}(k-1)}. \tag{28}$$

Substituting *λ* in Equation (26) by Equation (28) leads to

$$\check{\mathbf{w}}(k) = \frac{1 + \gamma\_k \nabla^s f(\check{\mathbf{w}}(k))^T \overline{\mathbf{R}}^{-1}(k) \check{\mathbf{w}}(k-1)}{\check{\mathbf{w}}^T(k-1) \overline{\mathbf{R}}^{-1}(k) \check{\mathbf{w}}(k-1)} \times \overline{\mathbf{R}}^{-1}(k) \check{\mathbf{w}}(k-1) - \gamma\_k \overline{\mathbf{R}}^{-1}(k) \nabla\_{\check{\mathbf{w}}} f(\check{\mathbf{w}}(k)). \tag{29}$$

Equation (29) can be expressed in a simple form as

$$
\widetilde{\mathbf{w}}(k) = a \overline{\mathbf{R}}^{-1}(k) \widetilde{\mathbf{w}}(k-1) - \gamma\_k \overline{\mathbf{R}}^{-1}(k) \nabla^s f(\widetilde{\mathbf{w}}(k)), \tag{30}
$$

where *<sup>α</sup>* <sup>=</sup> <sup>1</sup>+*γk*∇*<sup>s</sup> <sup>f</sup>*(**w**<sup>6</sup> (*k*)) *<sup>T</sup>***R**−<sup>1</sup> (*k*)**w**<sup>6</sup> (*k*−1) **<sup>w</sup>**<sup>6</sup> *<sup>T</sup>*(*k*−1)**R**−<sup>1</sup> (*k*)**w**<sup>6</sup> (*k*−1) . Because asymptotically **w**<sup>6</sup> (*k*) <sup>=</sup> 1 as *<sup>k</sup>* <sup>→</sup> <sup>∞</sup>, Equation (29) can be approximated as the following two equations.

$$\tilde{\mathbf{w}}(k) \simeq \overline{\mathbf{R}}^{-1}(k)\tilde{\mathbf{w}}(k-1) - \gamma\_{\tilde{k}}\Big(\tilde{\mathbf{w}}^{T}(k-1)\overline{\mathbf{R}}^{-1}(k-1)\tilde{\mathbf{w}}(k-1)\Big)\overline{\mathbf{R}}^{-1}(k)\nabla^{s}f(\tilde{\mathbf{w}}(k-1)).\tag{31}$$

$$
\tilde{\mathbf{w}}(k) = \tilde{\mathbf{w}}(k) / \|\tilde{\mathbf{w}}(k)\|. \tag{32}
$$

Finally, we obtain the estimated parameter of the unknown system as

$$
\hat{\mathbf{w}}(k) = -\hat{\mathbf{w}}\_{1:M}(k) / \hat{\mathbf{w}}\_{M+1}(k). \tag{33}
$$

For Equation (23), we can use the modified regularization factor *γ<sup>k</sup>* in [30]

$$\gamma\_k = \frac{2\frac{tr\left(\overline{\mathbf{R}}^{-1}(k)\right)}{M}}{\left\|\overline{\mathbf{R}}^{-1}(k)\nabla^s f\left(\mathfrak{w}\_{\text{aug}}(k)\right)\right\|\_2^2} \times \left[\left(f\left(\mathfrak{w}\_{\text{aug}}(k)\right) - \rho\right) + \nabla^s f\left(\mathfrak{w}\_{\text{aug}}(k)\right)\overline{\mathbf{R}}^{-1}(k)\varepsilon(k)\right],\tag{34}$$

where **<sup>w</sup>**<sup>ˆ</sup> *aug*(*k*) = - **<sup>w</sup>**<sup>ˆ</sup> *<sup>T</sup>*(*k*), <sup>−</sup><sup>1</sup> *T* , **<sup>w</sup>**<sup>ˆ</sup> *aug*,*RLS*(*k*) = - **w**ˆ *<sup>T</sup> RLS*(*k*), −1 *T* , *ε*(*k*) = **w**ˆ *aug*(*k*) − **w**ˆ *aug*,*RLS*(*k*), and **w**ˆ *RLS*(*k*) is the estimated parameter by recursive least squares (RLS). As *f*(**w**ˆ ) = **w**ˆ 1, the subgradient of *f*(**w**ˆ *aug*(*k*)) is

$$\left. \nabla^{s} \right| \left| \mathfrak{w}\_{aug}(k) \right| \big|\_{1} = \text{sgn}(\mathfrak{w}\_{aug}(k)). \tag{35}$$

As mentioned in Section 4, we apply new constant *ρ* in Table 1, to the regularization factor *γ<sup>k</sup>* in Equation (34) instead of **w***true*1, where **w***true* is the true system impulse response.

#### **7. Simulation Results**

This section confirms the performance of the proposed algorithm in sparse channel estimation. In the first experiment, the channel estimation performance is compared with other algorithms using randomly generated sparse channels. In this simulation, we follow the same scenario in the experiments as [17]. The true system vector **w***true* is 64 dimensions. In order to generate the sparse channel, we set the number of the nonzero coefficients, S, in the 64 coefficients and randomly position the nonzero coefficients. The values of the coefficients are taken from an *N*(0, 1/S) distribution, where *N*( ) is the normal distribution. In the simulation, we estimate the channel impulse response by the proposed algorithms that are *l*1-RLS using the *ρ* in Table 1 and *l*1-RTLS using the *ρ* in Table 1. For the comparison, we estimate the channel impulse response by *l*1-RLS using the true channel response; in addition, we also execute the regular RLS algorithm in an oracle setting (oracle-RLS) where the positions of the true nonzero system parameters are assumed to be known. For the estimated channel results, we calculate the mean standard deviation (MSD), where MSD = *E* |**w**ˆ − **w***true*| 2 , **w**ˆ is the

estimated channel response and **w***true* is the true channel response. For the performance evaluation, we simulate the algorithms in the sparse channels for S = 4, 8, 16, and 32.

Figure 3 illustrates the MSD curves. For S = 4, Figure 3a shows that the estimation performance of *l*1-RTLS using the regularization factor with the *ρ* in Table 1 is almost the same as the *l*1-RLS using regularization with a true channel impulse response. However, the performance of *l*1-RLS using the regularization factor with the *ρ* in Table 1 is gradually degraded and shows a kind of uncertainty accumulation effect. In the other cases of S, we can observe the same trend in the MSD curves. Therefore, we can confirm that the new regularization factor selection method and the new modeling for *l*1-RLS can estimate the sparse channel as good as *l*1-RLS using the regularization with the true channel impulse response. In all the simulation scenarios, oracle RLS algorithm produces the lowest MSD as expected.

**Figure 3.** Steady-state MSD for S = 4, 8, 16, and 32 when applying the new *ρ* method to the regularization factor (-o-: *l*1-RLS with the true channel response, -×-: *l*1-RLS with the new *ρ* method, -\*-: proposed *l*1-RTLS with the new *ρ* method, --: oracle-RLS).

Table 2 summarizes the steady-state MSD values as varying S from 4 to 32. The results show that the proposed *l*1-RTLS with the new *ρ* is comparable to *l*1-RLS with the true channel.

In the second experiment, we compare channel estimation performance using room impulse response. The size of the room is (7.49, 6.24, 3.88 m). The position of the sound source is (1.53, 0.96, 1.12 m) and the position of the receiver is (1.81, 5.17, 0.71 m), respectively. T60 is set to 100 ms and 400 ms. The impulse response of the room is generated using the program in [31]. We focus on the direct reflection part and the early reflection part in the RIR because the direct reflection and early reflection part of the RIR has a sparse property. This is the part that is estimated in the AEC applications [32]. This part is also related to localization and clarity in room acoustics [33–35]. Comparing the impulse response (IR) generated by setting T60 = 100 ms to the channel with 65 coefficients used in the first experiment, it is equivalent to S = 4 in the channel with 65 coefficients. In the same manner, the IR generated by setting T60 = 400 ms is equivalent to S = 10.


**Table 2.** MSD (mean square deviation) comparison.

Table 3 summarizes the steady-state MSD values. The results also show the same trend as Table 2. In RIR estimation, the proposed *l*1-RTLS with the new *ρ* is also comparable to *l*1-RLS with the true channel.


**Table 3.** MSD (mean square deviation) comparison in sparse RIR estimations.

#### **8. Conclusions**

In this paper, we have proposed the regularization factor for recursive adaptive estimation. The regularization factor needs no prior knowledge of the true channel impulse response. We have also reformulated the recursive estimation algorithm as *l*1-RTLS type. This formulation is robust to the uncertainty in the regularization factor without a priori knowledge of the true channel impulse response. Simulations show that the proposed regularization factor and *l*1-RTLS algorithm provide good performance comparable to *l*1- RLS with the knowledge of the true channel impulse response.

**Author Contributions:** Conceptualization, J.L.; Methodology, J.L.; Validation, J.L. and S.L.; Formal analysis, J.L.; Investigation, J.L. and S.L.; Writing—original draft preparation, J.L.; Writing—review and editing, S.L.; Visualization, J.L.; Project administration, J.L.; Funding acquisition, J.L. and S.L.

**Funding:** This research received no external funding.

**Acknowledgments:** This research was supported by Agency for Defense Development (ADD) in Korea (UD160015DD).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Evaluation of Cracks in Metallic Material Using a Self-Organized Data-Driven Model of Acoustic Echo-Signal**

### **Xudong Teng 1,2,†, Xin Zhang 3,†, Yuantao Fan <sup>4</sup> and Dong Zhang 1,\***


Received: 24 November 2018; Accepted: 22 December 2018; Published: 28 December 2018

**Abstract:** Non-linear acoustic technique is an attractive approach in evaluating early fatigue as well as cracks in material. However, its accuracy is greatly restricted by external non-linearities of ultra-sonic measurement systems. In this work, an acoustical data-driven deviation detection method, called the consensus self-organizing models (COSMO) based on statistical probability models, was introduced to study the evolution of localized crack growth. By using pitch-catch technique, frequency spectra of acoustic echoes collected from different locations of a specimen were compared, resulting in a Hellinger distance matrix to construct statistical parameters such as *z*-score, *p*-value and T-value. It is shown that statistical significance *p*-value of COSMO method has a strong relationship with the crack growth. Particularly, T-values, logarithm transformed *p*-value, increases proportionally with the growth of cracks, which thus can be applied to locate the position of cracks and monitor the deterioration of materials.

**Keywords:** crack growth; acoustic echo; COSMO; *p*-value

#### **1. Introduction**

Nonlinear ultrasonic behaviors, such as harmonics, mix frequencies, and the resonance frequency shift, have been proven to be sensitive to structure imperfections and early degradation of materials [1–6]. In the early stage of damage, material fatigue can induce a number of micro-cracks with a typical length of 1–100 μm by continuous loading cycles, then the micro-cracks further grow, coalesce with other micro-cracks and eventually form macro-cracks [7]. Since the fatigue cracks are localized [8], not uniformly distributed in the structure, the generated nonlinear response is basically dependent on the configuration of the crack area related to the localized hysteretic deformation. Clapping between the contacting surface and dissipative mechanism due to frictional sliding and so on [7,8], leads to a much stronger nonlinearity than the surrounding material behaving linearly [8]. However, non-linear effects induced by the localized cracks in the materials are not obvious enough to conveniently be measured and analyzed [9,10]. In addition, the use of power amplifier, transducers, and coupling media in ultrasonic testing system also bring about external non-linear change. Since it is difficult to separate structure-induced non-linearity in materials from external non-linearity, non-linear ultrasonic technology is not applied widely to accurately evaluate and locate the structure imperfections in practical applications [11].

In terms of the uncertainties in real-life testing conditions, "big data" sets, collected from acoustic echo-signals of vast amounts of damaged material, show different but distinguishable statistical characteristics compared with intact material [12–17]. Some statistical models, also called data-driven models, relate the degree of damage to the probability of detection (PoD). They were first introduced by the National Aeronautics and Space Administration (NASA) in 1973 and were soon accepted as a standard method [18,19]. Later Lu and Meeker further developed statistical methods to estimate a time–to–failure distribution for a broad class of degradation structures [16]. Gebraeel employed Bayesian to update a method to develop a closed-form residual-life distribution for the monitored device [20]. Gang Qi et al. proposed a framework to meet the challenge by systematically evaluating material damage based on large data sets collected by using acoustic emission (AE) [12]. Zhou et al. investigated AE relative energy, amplitude distribution as well as amplitude spectrum to discern the delamination damage mechanism of the composites [21]. K*u*˚s et al. employed the model-based Clustering (Hellinger divergence) method to classify certain attributes of the original pure data obtained directly from the acoustic emission signals and form normed frequency spectra to perform physical separation tasks of AE random signals [22].

However, most data-driven methods are probabilistic and obtained historical degradation data or empirical knowledge [23,24], which require a large amount of experimental data to construct the reference curves or the preset feature threshold. Hence, in our previous work [25], we first introduced consensus self-organizing model (COSMO), neither any domain knowledge nor supervision to extract useful features, detecting a single flaw located at a fixed positon of steel specimens based on acoustical echo-signals. Nevertheless, as the contacting surfaces of the flaw produced by an electrical discharge machine were totally separated in [25], evaluating the actual fatigue cracks by applying COMSO method would potentially be problematic. In this paper, four cracks distributed along the length direction of a steel specimen were investigated, and the crack's growth (produced using fatigue testing [26]) was further discussed in details, which is probably accord with the mechanisms involved in contact acoustic nonlinearity and hysteretic nonlinearity. Both Numerical simulation and experimental measurements showed that COSMO models are effective in NDT inspection, as well as health monitoring for regular metallic structures.

#### **2. The Consensus Self-Organizing Models (COSMO)**

The COSMO method identifies the typical variability within a group of systems and to evaluate the likelihood of any individual being significantly different from the majority. In an ultrasonic testing system, a group of acoustic echo-signals were collected using pitch-catch method from N different locations (i.e., the total amount of samples) on a measured object. The spectral density of the acoustic signals were then obtained so that the difference between spectral density of two acoustic signals were compared [27,28], and Hellinger distance *di,j* was employed [22–24]:

$$d\_{i,j} = \frac{1}{\sqrt{2}} \sqrt{\sum\_{i,j=1}^{m} \left(\sqrt{p\_i} - \sqrt{q\_j}\right)^2},\tag{1}$$

where *pi* and *qj* are normalized histograms, representing spectral density of two acoustic signals. *m* is the number of sampled points of each acoustic signal. Two acoustic signals with different spectral density will yield a greater value of Hellinger distance than two similar ones. From the perspective of clustering analysis, samples with larger sum of distances to all peers are prone to be outliers.

For all pairs of these histograms, Hellinger distance was computed, resulting in a symmetric distance matrix D:

$$D = \begin{pmatrix} 0 & d\_{1,2} & \cdots & d\_{1,\mathbb{N}} \\ d\_{2,1} & 0 & \cdots & d\_{2,\mathbb{N}} \\ \vdots & \vdots & \vdots & \vdots \\ d\_{\mathbb{N},1} & \cdots & d\_{\mathbb{N},\mathbb{N}-1} & 0 \end{pmatrix}.$$

Within D, the row with minimum sum of distances is chosen as the most central pattern *c*, representing the most typical testing sample within this group. Based on the most central pattern *c*, the *z*-score is computed for each sample *k*, representing the percentage of all samples that are further away from *c* than sample *k*, that is,

$$z(k) = \left| \frac{\{i = 1, 2, \ldots, \text{N:} d\_{i, \varepsilon} < d\_{k, \varepsilon} \}}{\text{N}} \right| \quad (k = 1, 2, \ldots, \text{N:}) \; , \tag{2}$$

where *dk,c* is the distance between sample *k* and *c*; *di,c* denotes the distance between sample *i* and *c*; N is the total amount of samples. A sample with *z*-score close to 0 indicates a large distance to its peers. If there is no micro-crack at a particular position, the *z*-scores of a set of testing samples from this position should be uniformly distributed between (0, 1). We approximate the distribution of the average based on *n* samples using a normal distribution with mean 0.5 and variance 1/12n, i.e., [23,24]

$$z \sim N[\frac{1}{2}, \left(1/12n\right)^{1/2}].\tag{3}$$

Therefore, the one-sided *p*-value, which is the probability that a single observation *z* picked from a normal distribution with parameters (1/2, (1/12n)1/2) will fall in the interval (−∞, *<sup>z</sup>*], can be computed as:

$$p = \frac{2}{\sqrt{2\pi}} \int\_{-\infty}^{\infty} e^{-\frac{\left(z - 1/2\right)^2}{2\left(1/\sqrt{12n}\right)^2}} dz. \tag{4}$$

Based on a uniformity test of *z*-scores over an area, the resulting *p*-value of this test is obtained to estimate whether the inspected contains any micro-flaws.

#### **3. Crack Identification Based on COSMO**

In this section, we employed the COSMO model to identify and locate the cracks to steel specimens by numerical simulation and experimental tests.

#### *3.1. Numerical Simulation*

A two-dimensional model of a steel board embedded with a single crack was simulated by using a commercial software (Comsol Multiphysics V4.3a. COMSOL, Inc., Palo Alto, CA, USA). Figure 1 shows the schematic illustration of an ultrasonic measurement system in the simulation. Two longitudinal transducers with 60◦ wedge as the transmitter were typically used to carry out the ultrasonic inspection on the top surface of the steel board with the length of 220 mm and the height of 30 mm. A fixed distance between the two transducers was kept at a constant interval to make sure the first back wall echo was fully collected by the receiver. The single crack with a depth of *d* was located at the middle of the steel board along the x direction, i.e., x = 110 mm.

**Figure 1.** Schematic illustration of scanning on a simulated specimen with a single crack.

A 0.5 MHz continuous sinusoidal signal with signal-noise ratio (SNR) of 15 dB was applied as the exciting signal. Both the transmitter and the receiver were moved simultaneously to scan the steel board along the x direction. The receiving waveforms, at eight different positions, were spaced by 20 mm on the top of the simulated steel board, recorded, and the corresponding spectrums were then analyzed. Figure 2a depicts spectrums at eight positions of the simulated steel board with crack length *d* = 2 mm.

**Figure 2.** (**a**) Spectrums and (**b**) *z*-score distribution of echoes at eight monitoring points (depth of crack d = 2 mm) along x direction.

According to the COSMO algorithm model, the corresponding spectrums, at eight different observation positions, were saved then a group of *z*-scores were calculated by Equation (2) after every scanning process, finally 30 groups of *z*-score were obtained by scanning repeatedly 30 times. It is clearly shown in Figure 2b that *z*-score of observation points are distributed almost evenly between 0.3 and 1 except *x* = 110 mm, while most of *z*-score at *x* = 110 mm are mainly distributed below 0.4, just right at the crack's position. It is shown that the distribution of *z*-scores could be used to locate and identify cracks or defects in materials, i.e., *z*-score of damaged regions might be below 0.4. However, the conclusions need to be subjected to hypothesis tests to reach statistical significance, which determines whether a null hypothesis can be rejected or retained.

Figure 3a shows the calculated level of significance testing for crack depth of 2, 4, and 5 mm by Equation (4). It could be seen from Figure 3a that the *p*-value has a much smaller value than 0.1 around the crack region from 90 to 150 mm, which suggests that the imperfect structure of this region is significant. To make the comparisons and analysis clearly, an indicator called deviation level is defined as,

$$\mathbf{T} = -\lg(\mathbf{p}),\tag{5}$$

i.e., logarithm transformed *p*-value, obviously the small T-value indicates little significant probability of crack. The T-value curve of significance testing is shown in Figure 3b. Obviously, the maximum T-value occurs around the position of crack (*x* = 110 mm) for crack depth of *d* = 2, 4 and 5 mm, respectively. Furthermore, the maximum T-value increases simultaneously as crack growth, e.g., the maximum T-value is close to 4 when crack depth is 2 mm, and the peak of T-value up to 10 when crack depth equal to 5. This result indicates that the higher level T-value is strongly correlated with the crack depth, which could become an index to exhibit the evolution of crack growth inside materials.

**Figure 3.** The calculated level of significance testing (**a**) *p*-value curve and (**b**) deviation level T-value curve can be plotted for crack depth of 2, 4 and 5 mm by Equations (4) and (5) along *x* direction on the simulated board.

When the crack depth changed from 0 mm to 5 mm, the peaks of T-value around the cracks were obtained and thus the relationship curve between the maximum T-value and crack depth d was plotted, as shown Figure 4. It can be seen that as the crack depth increased from 0 to 1, the slope of the curve sharply increased. When crack depth was less than 1 mm, the T-value was not larger than 5, which is basically considered the formation stage of crack, due to the relative small change of crack depth, thus this phase is called stage I. As the crack depth gradually expanded form 1.5 mm to 3.7 mm, the maximum T-value increased slowly from 5 to 7, at stage II. When crack depth was larger than 4 mm, at stage III, the value of the curve increased rapidly up to 10, and as high as 2 times than that of stage I, which means that the small cracks had already expanded to macro-cracks. Therefore, the peaks of T-value might track the progression of damage and evaluate the evolution of crack growth.

**Figure 4.** Deviation level T-value grows monotonically as depth of crack increases.

#### *3.2. Experimental Measurement*

A specimen made of Q235 (See Table 1) with dimensions 800 mm × 250 mm × 20 mm was used in the experimental measurement, as shown in Figure 5. Four sections embedded with cracks with average depth of 6 mm, 2 mm, 1 mm, and 0.5 mm were manufactured in the specimen, mainly located at 150 mm, 300 mm, 450 mm, and 600 mm, respectively, denoted by B, C, D and E. Additionally, Section A and F represented as undamaged regions located in the two ends of the specimens. A portable TOFD ultrasonic detector (PXUT-920, Nantong Union Digital Tech., China) was used to excite a narrow-pulse

acoustic signal with 200 ns in width and stored the echo signals from inspected cracked region. Scanning was manually carried out by a scanner unit with one pair of 5 MHz normal transducer, i.e., the transmitter and the receiver, with 60◦ wedges for longitudinal waves. Two transducers, spaced 62 mm apart, were located at equidistant over the crack region center, and scanning was done by moving the scanner in the length direction of steel plate parallel to the crack region. The echo signal sampled by the detector, containing 1496 points, was acquired every 0.5 mm along the length direction. After a scanning, a total of 1600 echo signals (A-scan) were obtained and stored in the ultrasonic detector.

**Table 1.** Mechanical characteristics of Q235 carbon steel. (Provided by HBO Windpower Equipment Co., Ltd, Nantong, China).


**Figure 5.** (**a**) A specimen with four cracks with average depth of (**b**) 6 mm, (**c**) 2 mm, (**d**) 1 mm, and 0.5 mm located at Section B, C, D, E, respectively. Section A and F without cracks.

COSMO method was applied to analyze this dataset of echo-signal recorded by TOFD ultrasonic detector. Firstly, the Hellinger distance matrix D was constructed using Equation (1), and the row with minimum sum was chosen in the metric D so that the *z*-score could be determined using Equation (2). Figure 6a shows the *z*-score distribution for 30 scanning. It is shown that the *z*-score in undamaged sections is much larger than those in the region with cracks. For example, the *z*-score for positions A and F are about 0.4~1, while those for positions B, C, D and E are below 0.4. The results suggest that the *z*-score is closely related to cracks of specimen, just as the simulated results.

**Figure 6.** (**a**) *z*-score distribution (**b**) the curve of T-value of specimen with cracks and along *x* direction.

By using Equations (1)~(4) and (5), the deviation level T-Value is calculated to make a significant analysis. It can be observed from Figure 6b that the T-value at B, C, D and E are of high level compared to those in uncracks region. For instance, the T-value at x = 150 up to 13, and T- value at x = 600 sharply increasing from 0 to 7, but T-value of sections without cracks almost equal to 0, far less than T-value at crack region. In addition, T-value increase almost linearly with the depth of cracks. The relationship curve between the peaks of T-value and crack depth is plotted in Figure 7. It is shown that the peak of T value increases with crack growth from 0 to1.5 mm quickly up to 5, which is exactly in stage I. When the crack depth is larger than 2, the slope of curve go slow but still faster than the simulated results. It is worth noting that the change of slope is not distinct enough to easily recognize stage II or III when depth of crack exceeding 1.5 mm, different from the simulated curve, which might attributed to the result of multi-physical mechanisms.

**Figure 7.** The relationship curve of specimen between T-value peaks and crack depth.

#### **4. Conclusions**

Defects or cracks can significantly increase acoustic non-linearity, and the nonlinear acoustical parameters, and thereby can be exploited to evaluate the state of material damage. However, the harmonics are usually too weak to be detected in early fatigue. Therefore, the non-linear ultrasonic technique is rarely used to qualify crack growth. In this work, the COSMO method was applied to compare the spectrum of different positions by ultra-sonic scanning in order to obtain the distribution

of *z*-scores as well as the corresponding significance level in every scanning position. The results show that: (1) the *z*-scores in the location with cracks are distributed below 0.4, while the *z*-scores in the location without cracks are above 0.4; (2) the deviation level T - value at locations with crack are much larger than those at locations without cracks, and the T-value would get larger with the increase of crack depth; (3) based on the quantitative relations between the T values and the crack depth, we can evaluate and monitor the online state of the structural health by COSMO model. However, it is noted that the COSMO model is still a simple model that does not consider some other factors, such as the shape and mechanical properties of structure, as well as the requirements on data size. Therefore, the reliability of COSMO needs to be further optimized to reach a solution for non-destructive evaluation in future.

**Author Contributions:** D.Z., conceived and designed the experiments. X.T., and X.Z., performed the experiments. X.T., X.Z., and Y.F., analyzed the data. X.T., X.Z., and D.Z., wrote the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (Grant no's., 81627802, 11674173 and 11874216), QingLan Project, and the Fundamental Research Funds for the Central Universities.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Influence of Piano Key Vibration Level on Players' Perception and Performance in Piano Playing**

#### **Matthias Flückiger \* , Tobias Grosshauser and Gerhard Tröster**

Electronics Laboratory, ETH Zurich, Gloriastrasse 35, 8092 Zurich, Switzerland; tobias@grosshauser.de (T.G.); troester@ife.ee.ethz.ch (G.T.)

**\*** Correspondence: matthias.flueckiger@gmail.com

Received: 29 October 2018; Accepted: 16 December 2018; Published: 19 December 2018

**Abstract:** In this study, the influence of piano key vibration levels on players' personal judgment of the instrument quality and on the dynamics and timing of the players' performance of a music piece excerpt is examined. In an experiment four vibration levels were presented to eleven pianists playing on a digital grand piano with grand piano-like key action. By evaluating the players' judgment of the instrument quality, strong integration effects of auditory and tactile information were observed. Differences in the sound of the instrument were perceived by the players, when the vibration level in the keys was changed and the results indicate a sound-dependent optimum of the vibration levels. By analyzing the influence of the vibration levels on the timing and dynamics accuracy of the pianists' musical performances, we could not observe systematic differences that depend on the vibration level.

**Keywords:** piano playing; vibrotactile feedback; interaction; musical performance; auditory perception; sensors; actuators

#### **1. Introduction**

Playing the piano is a complex multi-modal task, where the pianist controls the instrument through his or her intention and perceived instrument feedback. There are four main musician-musical instrument interaction modalities: visual feedback, auditory feedback, force feedback, and vibrotactile feedback. The interaction with a musical instrument can be modeled as a feedback controller [1], where the musician's brain controls his or her body, arms, and fingers to modify the instrument's behavior based on changes in sensory inputs. This closed-loop model implies that if the instrument's feedback is altered, the pianist will adapt his or her playing to compensate for and retain the desired instrument behavior. Vibrotactile feedback can support the precise control of finger force, as shown by Ahmaniemi [2] with a basic force repetition experiment on a rigid sensor box. Furthermore, Goebl and Palmer [3] demonstrated that tactile sensations from the finger-key surface interaction support some pianists to improve timing accuracy and precision of finger movements.

In piano playing, vibrotactile feedback is perceived through the fingers in contact with the keys and the feet in contact with the pedals. The keybed and soundboard vibrations excite the piano keys and pedals [4]. Askenfelt and Jansson [5] measured the vibrations of a depressed piano key and a depressed piano pedal. Piano key vibrations comprise broadband and tonal parts [4,6]. The tonal parts come from the string vibrations, and the broadband parts come from mechanical impacts (e.g., hammer-string impact and key–keybed impact) of the piano action when the piano key is played [4,6].

The levels of the vibrations' tonal part can rise to the micrometer range [7]; the vibrations are close to the limits of human vibration perception and are often sensed subconsciously [8]. However, piano key vibrations can be detected up to the middle octave of the keyboard [9]. Further, the vibration levels vary considerably among different pianos, but it remains unclear if pianists can perceive these differences [7].

Keane and Dodd [8] found that the ratio of broadband and tonal parts of piano key vibrations influenced the instrument's perceived sound. An upright piano was mechanically modified to reduce the broadband parts' amplitude, with the expectation to improve the instrument's quality. The pianists preferred the modification with regard to tone and loudness in an evaluation study. Interestingly, the participants did not report differences with regard to touch or vibrations.

Fontana et al. [10] showed by ratings of evaluation criteria that realistic piano key vibrations rendered on a digital keyboard are preferred to a no-vibration condition. In the same study, key vibrations did not show a significant effect on pianists' timing and dynamics accuracy during a scale playing task.

In addition to the state-of-the-art, the influence of four piano key vibration levels on pianists' personal judgment of an instrument's sound, control, and feel is investigated in this study. We designed the experiment, such that the control of vibration levels was independent of the sound of the instrument and aimed to explore connections between vibrotactile feedback and the perceived quality of the instrument. To test if pianists adapt their playing to vibrotactile feedback and to analyze if the vibrations support the control of finger forces, the effect of the vibration levels on timing and dynamics accuracy in pianists' performances is also studied in this paper.

#### **2. Methods**

#### *2.1. Equipment*

The pianists played on an AvantGrand N3X, a digital hybrid grand piano from Yamaha. This instrument was chosen because it simulates piano key vibrations, features state-of-the-art grand piano sound-rendering algorithms, and has a piano action resembling that of acoustic grand pianos. Musical instrument digital interface (MIDI) messages and the headphone audio output of the instrument were recorded. The pianists played with closed-back headphones to block the small amount of sound that vibrating keys radiate.

#### *2.2. Experiment Design*

The target was to control the key vibration levels independent of the sound and to cover the level range of piano key vibrations of acoustic concert grand pianos. Independent control could not be achieved with the built-in vibrotactile feedback rendering system of the AvantGrand N3X; therefore, it was extended as illustrated in Figure 1.

The mono audio output signal of the AvantGrand N3X was processed with a digital signal processor (DSP). Through a combined approach of vibrometer measurements and subjective evaluation by playing on the instrument, the DSP's filter stage was tuned to create vibration level *V*<sup>3</sup> (see Figure 1), which approached the maximum vibration levels previously measured on acoustic grand pianos [7]. After implementation of *V*3, vibration levels *V*<sup>2</sup> and *V*<sup>1</sup> were created by attenuating the signal in steps of 6 dB. The no-vibration condition *V*<sup>0</sup> completed the levels of the experiment. As shown in Figure 1, the vibration levels cover the range of acoustic grand pianos for notes A2, A3, and A4. For notes A0 and A1, the levels are more than 10 dB lower. The chosen music piece avoided the lowest notes. The deviations of the vibration level curves in Figure 1 are due to non-idealities of the excitation system.

The experiment was created to study the influence of four vibration levels on players' personal judgment of the instrument quality and on the dynamics and timing of the players' performance of a music piece excerpt. The experiment was designed so that the participants were unaware of the independent variable, and the session was split into three parts to steer the pianists' attention to different instrument properties. Free verbalizations were used to assess the players' judgment, allowing for unrestrained and possibly unexpected answers. Since a small influence of the piano key vibrations on musical performance and the players' judgments was assumed—as natural levels are close to the threshold of vibration sensation [7–9]—numerous repetitions were included in the protocol, and participants with high levels of playing experience were selected.

(**a**) Block diagram of the extended vibrotactile feedback rendering system

(**b**) Comparison of the measured tonal part of the vibration levels to levels measured on acoustic grand pianos

**Figure 1.** (**a**) Block diagram of the vibrotactile feedback rendering system to generate the key vibrations; the mono audio output of the N3X was filtered and attenuated with a DSP. Thereafter the signal was power amplified to drive the transducer of the built-in vibrotactile rendering system of the N3X. (**b**) Comparison of the tonal part of the vibration levels *V*1, *V*2, *V*<sup>3</sup> to vibration levels of four acoustic concert grand pianos. The comparison is based on vibrometer measurements of *forte* keystrokes [7]. (*V*<sup>0</sup> is not shown because it corresponds to no vibrations.)

#### *2.3. Participants*

Eleven pianists participated in the study: seven piano students, 22–26 years of age, with an average playing experience of 17 years; and four professional pianists, 31–40 years of age, with an average playing experience of 26 years. None of them reported having auditory or tactile impairments.

#### *2.4. Procedure*

The session for each participant lasted around 1.5 h. The participants were asked to prepare an interpretation of a music piece excerpt. The excerpt was 15 bars long and the participants were instructed to adhere to the tempo, dynamics, accents, and pedaling information. The excerpt was taken from Klage by Gretchaninov [11] and was edited to cover the dynamic range from *pianissimo* to *fortissimo* (see Figure A1 in the Appendix A). The participants were not informed about the purpose of the experiment beforehand. The participants were only told that their judgments of various settings of the instrument will be evaluated.

The experiment comprised a warm-up (with a duration of around 5–10 min), questionnaires, and three parts (A, B, and C) with a duration of roughly 20 min each.

During the warm-up, the pianists were free to play whatever they wanted and were instructed to evaluate the instrument in a way comparable to choosing an instrument for a concert or for purchase. After this familiarization, the pianists were asked to express their first impression by answering a set of questions about the sound and touch of the AvantGrand N3X and by comparing the instrument to their main instrument.

During the main parts of the experiment (A, B, and C) the pianists were asked to repeat the excerpted music piece accurately in 12 direct comparisons of different instrument settings. After each trial the pianists were asked to indicate a personal preference ("better", "worse", or "similar") of the current setting relative to the previous setting and to describe their impression in a few words. The participants were told that the differences between the comparisons can be small and that some might be perceptually irrelevant. In part A the pianists were told that a slight adjustment of the instrument (not further specified in order to not suggest an answer or category) was made between each repetition, in part B it was claimed that a small adjustment to the sound was made between each

trial, and in part C the pianists' attention was directed to the keyboard by asking about the instrument's control and feel.

In fact, the only independent variable throughout the experiment was the key vibration level (*V*0, *V*1, *V*2, *V*3). The sequence of parts (A, B, and C) was the same for all participants. Each part had a different randomized sequence of vibration levels. All pair of levels were compared twice in each part—once for each order. In total, nine trials per vibration level and per participant were recorded.

At the very end of each experiment session, personal information was collected and the participant was asked about his or her experience and preference of piano key vibrations in piano playing, before we disclosed and explained the purpose of the experiment.

#### **3. Results**

#### *3.1. Influence of Vibration Levels on Perceived Instrument Sound, Control, and Feel*

In the analysis of the preference ratings ("better", "worse", or "similar"), a high variance and for some participants also controversial ratings were observed. We decided to present the ratings across all parts and participants here, because it was not possible to draw conclusions from the ratings per participant or per part. The result is presented in Figure 2.

**Figure 2.** Analysis of the preference ratings ("better" (1), "worse" (−1), or "similar" (0)) of the vibration levels across all parts and all participants. The ratings are based on direct comparisons between all pairs of levels. The ratings are relative. For example, a positive value for *V*0:*V*<sup>1</sup> indicates a preference of *V*<sup>0</sup> over *V*<sup>1</sup> and a negative value a preference of *V*<sup>1</sup> over *V*0. The shaded area marks the standard deviation of the ratings.

The high variance of the ratings in Figure 2 reflects the closeness of the key vibration levels to the limits of human perception. Additional factors that can have disturbed the ratings are the mood and fatigue of the player. Also a self-evaluation of the playing, the difficulty of the task, or the imposed expectation of a difference between the settings might have disturbed the ratings. However, a visual comparison of the mean values in Figure 2 indicates a tendency in the preference of the players toward vibration level *V*2. The preference of vibration level *V*<sup>2</sup> is confirmed by the evaluation of the player's verbal self-reports presented hereafter.

The free verbalizations were analyzed with an approach presented by Pate et al. [12], where concepts by Dubois [13] were applied to musical instrument evaluation.

Based on the context and for each participant, the meaning, category, and preference of each statement was identified via linguistic tools such as reformulations, oppositions, and comparatives. Thereafter, the statements were classified into positive and negative statements for three categories: sound, control, and feel. Statements covering multiple categories were split before classification. Statements not indicating a preference or a perceived difference were counted as "no difference". The results are summarized in Table 1. Significance was evaluated with Pearson's *χ*2-tests at a confidence level of 95%. The test was performed on all statements (positive and negative) per category.

**Table 1.** Evaluation per vibration level derived from the free verbalizations of the pianists. The number of positive and negative statements per category was counted for each vibration level (*V*0, *V*1, *V*2, *V*3). The frequency counts were evaluated with *χ*2-tests. The *χ*2-statistics and *p*-values are given for each category; *p* < 0.05 is highlighted in bold.


Although we did not alter the sound throughout the experiment, Table 1 shows that the vibration levels (*V*0, *V*1, *V*2, *V*3) have an influence on the pianists' sound perception. Pairwise testing with Bonferroni correction showed that the significance of the *χ*2-test in the sound category arises from the difference between *V*<sup>0</sup> and *V*2.

The phenomenon of vibrotactile feedback causing a difference in sound perception is known as integration of auditory and tactile information [14–16] or weak synesthesia [17]. The preference of *V*<sup>2</sup> over *V*3, confirming the evaluation of the preference ratings in Figure 2, was surprising because vibration level *V*<sup>3</sup> is closer to the levels of acoustic instruments (see Figure 1). An explanation lies in five statements about vibration level *V*3, which were classified as negative. These statements criticized the balance of the perceived sound as having "too much bass" or being "unbalanced".

The results for control and feel are not significant according to the *χ*2-tests. For vibration level *V*<sup>1</sup> there were twice as many "no difference" statements than for all other levels, which indicates that *V*<sup>1</sup> is most difficult to differentiate.

Only two participants consciously noticed a change in vibration levels during the experiment, when vibration level *V*<sup>3</sup> was compared to *V*<sup>0</sup> and vice versa. Both recognized the vibrations during the last part, when the keyboard was the focal point.

To find possible explanations for the above presented differences, we analyzed the verbal self-reports of the participants in more specific categories. We observed that the key vibrations influence the timbre and the perceived loudness of the bass keys. Also, the timbre of treble notes was judged more pleasant when playing with *V*<sup>2</sup> or *V*3. Some participants also noted a sensation of space when playing with higher vibration levels (*V*2, *V*3) and described it as room or reverb effect of the sound. In contrast when pianists played with vibration levels *V*<sup>0</sup> or *V*<sup>1</sup> the sound was sometimes described as dry. Comparisons to acoustic instruments and e-pianos also align with this observation.

A critical aspect for discussion is that the sequence of parts (A, B, and C) was the same for all participants, which might have influenced the results. We designed the experiment protocol to steer the players' attention to different multi-modal aspects to discover unexpected connections between vibrotactile feedback and the players' judgment of the instrument quality.

In part A the participants could freely describe their impressions and we did not suggest any quality criteria for the comparisons. Unbiased comparisons are only possible within the first part of the experiment. Ten out of eleven participants naturally made statements about sound and control in the first part, which justifies the suggestion of these criteria in the following parts.

We decided to put part C at the end of the experiment session, because we did not want to risk that the participants are already consciously aware of the vibrations, when comparing the levels with regard to sound. In part C the participants focused on the keyboard. Therefore we expected that it is most likely that the participants recognize the vibrations in this part (which happened in two cases).

Finally, also the difficulty of the evaluation task might have altered the judgments of the pianists. In each trial, the participant played the music piece excerpt for a duration of around 30 s, communicated his or her impression with regard the previous setting and sometimes also answered clarifying questions from the experimenter. Thereafter he or she performed the music piece for the next comparison.

#### *3.2. Influence of Piano Key Vibration Levels on Musical Performance*

To analyze the MIDI-based performance data, a custom data structure was used. The structure groups notes played at the same time (±40 ms) into clusters, removes accidentally played wrong notes, and assigns if the note was played by the left or right hand.

Key velocity *υ*, a measure of a keystroke's excitation strength, was directly extracted from the MIDI messages. For the calculation of the inter-onset interval *τ*—the time interval between two subsequent note onsets—only notes played by the left hand were considered. The tempo of each trial was normalized.

To compare trials by the distribution of key velocity *υ* (analogously for inter-onset interval *τ*) histogram intersection was used. Histogram intersection was introduced by Swain and Ballard [18] to identify objects by color similarity in computer vision. Histogram intersection is defined as [18]

$$H\_1(\upsilon) \cap H\_2(\upsilon) = \sum\_{i=1}^n \min\left(h\_{1i}(\upsilon), h\_{2i}(\upsilon)\right),\tag{1}$$

where *H*<sup>1</sup> and *H*<sup>2</sup> represent two trials by normalized discrete distributions of key velocity *υ* with *n* bins *h*1*i*, *h*2*i*. Equation (1) measures the overlap of two histograms in the range [0, 1]. The number '1' corresponds with perfect overlap; '0' means no overlap. In contrast to an evaluation based on mean values only, histogram intersection also identifies differences in the distributions' shape or offset.

To judge significance, two tests were demanded to reject the null hypothesis: non-parametric Friedman analysis of variance with a 95% confidence level in combination with pairwise Wilcoxon signed-rank tests with Bonferroni correction. We used non-parametric tests, because the evaluated quantities do not necessarily follow a normal distribution and because of the sample size.

Two approaches were used to compare the distributions of key velocity *υ* and inter-onset interval *τ* by histogram intersection. The distribution of both parameters was calculated for each trial and was analyzed for each participant separately.

The first approach considered if the pianists adapted their playing to the vibration levels (e.g., if a pianist perceived an overemphasis of bass notes and therefore played the bass notes with less finger force than before). This force adaption manifests in the shape of the distribution of key velocity *υ*. An increase in the "amount of adaption" was expected with increasing vibration levels. The "amount of adaption" for key velocity *υ* was measured as follows.

Let *HPk*,*V*-,*i*(*υ*) denote the normalized histograms describing the distribution of key velocity *υ* for trial *i* ∈ {1, ... , 9}, vibration level *V* with - ∈ {0, 1, 2, 3}, and pianist *Pk* with *k* ∈ {1, ... , 11}. Then the "amount of adaption" *APk*,*Vn* of pianist *Pk* to vibration levels *Vn* with *n* ∈ {1, 2, 3} was estimated as the histogram difference relative to *V*<sup>0</sup> condition *HPk*,*V*0,*i*(*υ*) ∩ *HPk*,*Vn*,*j*(*υ*) for all combinations of trials *i*, *j* ∈ {1, . . . , 9} and key velocity *υ*. For the inter-onset interval *τ* the same procedure was conducted.

Figure 3 shows the "amount of adaption" of the pianist's playing to the feedback levels for both performance parameters. The differences in Figure 3 are not significant. There is no general tendency that the participants adapt their playing to the key vibration level. Nonetheless, by analyzing the

influence of the vibration levels per participant individually, three participants showed significant differences for key velocity *υ* and three for inter-onset interval *τ*.

(**a**) "Amount of adaption" *APk* ,*Vn* (*υ*) of key velocity *υ*

(**b**) "Amount of adaption" *APk* ,*Vn* (*τ*) of inter-onset interval *τ*

**Figure 3.** "Amount of adaption" to the feedback levels *Vn* (*n* = {1, 2, 3}) relative to the no-vibration condition *V*<sup>0</sup> across all pianists *Pk* and for both performance parameters. The differences in the amount of adaption for all vibration levels are not significant for both parameters. The line in the center of the box-plot marks the median, the box extends from the first to the third quartile, and the whiskers mark the value range.

Possible explanations for the majority of pianists not adapting their playing to the feedback levels include that the combined task of playing, judging the impression, and adapting their playing was too difficult, that the levels were too small to cause a reaction, or that the method was not accurate enough to unveil such differences.

The second approach investigated how accurately the pianists could repeat the music piece excerpt when playing with different vibration levels. If key vibrations support the precise control of finger forces, a lower variance in the distribution of key velocity *υ* could be expected. In consequence the shape of the distribution of key velocity *υ* would be altered and hence a difference in repeatability could be detected. Likewise, if the pianist's tempo was more stable, a different shape of the distribution of the inter-onset interval *τ* would occur. Indirect causes are also possible (e.g., the pianist feels more comfortable to play and therefore plays with higher repeatability). The time-point of the trials during the experiment was not taken into account. We decided to analyze and present the data for each pianist individually hereafter, because we observed a strong dependency of the repeatability on the player.

The repeatability *RPk*,*V*<sup>0</sup> (*υ*) for participant *Pk* playing with vibration level *V*<sup>0</sup> was computed by comparing the distributions of key velocity *υ* by *HPk*,*V*0,*i*(*υ*) ∩ *HPk*,*V*0,*j*(*υ*) for all combinations of trials *i*, *j* ∈ {1, ... , 9}, where *i* = *j*. For vibration levels *V*1, *V*2, and *V*3, and for inter-onset interval *τ* similar procedures were conducted.

The resulting repeatability per vibration level and per participant is presented in Figure 4 for key velocity *υ* and inter-onset interval *τ*. The differences in repeatability were significant for a majority of the participants. However, no consistent tendency or pattern in repeatability occurred among the pianists in Figure 4.

Therefore, the vibration levels of our experiments do not have a conclusive influence on the pianists' repeatability, and the measured MIDI data do not support the hypothesis that key vibrations assist the precise control of finger force. Consequently the observations do not confirm the results of Ahmaniemi [2] or Galica et al. [19], for the piano playing case. Galica et al. [19] showed that even unconscious vibratory stimulation applied to the soles of the feet can cause lower variance in kinematic interactions.

**Figure 4.** Estimated repeatabilities per participant *Pk* and vibration level *V* for both performance parameters. No consistent tendency occurred among the pianists, but the vibration levels had a significant influence (marked with \*) on repeatability for a majority of the participants.

In summation, the repeatability estimates *RPk*,*V*- (*υ*), *RPk*,*V*- (*τ*) for vibration levels (*V*0, *V*1, *V*2, *V*3) depended on the player. The pianists in this study were more accurate in repeating key velocity *υ* (median of *RPk*,*V*- (*υ*) ∈ [0.85, 0.88], ∀*k*, -) than in repeating inter-onset interval *τ* (median of *RPk*,*V*- (*τ*) ∈ [0.66, 0.77], ∀*k*, -). For the inter-onset interval *τ* the intra-individual variance was also considerably larger (see Figure 4), although we normalized the tempo of each trial before the analysis.

As a concluding aspect of interest, no categorical differences (in repeatability or playing adaption to feedback levels) were found between the group of students (*P*<sup>1</sup> to *P*<sup>7</sup> in Figure 4) and the group of professional pianists (*P*<sup>8</sup> to *P*<sup>11</sup> in Figure 4).

#### **4. Conclusions**

By systematically investigating the players' personal judgment of the instrument quality of the vibration level in the keys, we observed strong integration effects of auditory and tactile information. The results give an illustration of the strong multi-modal effects in piano playing. The subjects perceived differences in the sound of the instrument when the vibration level in the keys was changed. The preference of vibration level *V*<sup>2</sup> over *V*<sup>3</sup> indicates an optimum or a "sweet spot" of piano key vibration levels, which depends on the instrument's sound and sound balance.

In line with the results of Keane and Dodd [8], the vibration levels in this experiment significantly affected the instrument's judged sound quality but not its control and feel. However, in contrast to the design of the present experiment, Keane and Dodd [8] reduced the level of the broadband part of piano key vibrations of an acoustic instrument.

An interesting direction for future research is to determine the vibration level differences that pianists can differentiate. This would help to understand if an instrument can be identified based on its vibrotactile feedback only, while the instrument's auditory and force feedback are kept constant. Some participants in this experiment perceived a spatial sensation and described it as room or reverb effect on the sound when playing with higher vibration levels (*V*2, *V*3). For several applications it could be interesting to understand the conditions that can cause such an illusion.

We did not find systematic differences by analyzing the influence of the vibration levels on the timing and dynamics accuracy of the pianists' musical performances. We can not exclude that such an influence exists but with the proposed measures, "amount of adaption" and repeatability, we could not measure such a relation. Furthermore, the basic results of Ahmaniemi [2], that vibrotactile feedback assists the precise control of finger forces could not be confirmed in our case. For future studies, we suggest to include a larger number of participants. This could help to identify groups reacting similarly to key vibrations. Future studies might also include multiple experiments over a certain range of time to exclude influences of physical and mental state on the day of testing. Finally, an analysis on a note-by-note basis could clarify if, for example, the vibrotactile feedback of a long-lasting bass note helps the precise control of the dynamics in subsequent keystrokes. We could not generalize such a relation with the data of the presented experiment.

If our results can be confirmed on acoustic instruments, our findings of the perception part of the experiment suggest that piano manufacturers should design the vibrations in the piano keys in balance with the sound of the lower notes of the instrument. Furthermore, it would be interesting to investigate the just-noticeable difference of piano key vibration levels, which might possibly be around 6 dB. Further research in this area could help to answer the question, if the tonal parts of the Steinway and Sons and the Yamaha concert grand pianos (the tonal parts for notes A2, A3, and A4 differ by more than 6 dB [7]) can be differentiated based on their vibrotactile feedback only by the player. Of course in such an experiment the vibrotactile feedback should be rendered on the same instrument, otherwise cues from the auditory or kinematic sensations might dominate the perceived impression.

**Author Contributions:** M.F. designed, accomplished, and evaluated this study. T.G. and G.T. contributed as consultants. All discussed the results.

**Funding:** This research has been pursued as part of the "Musician's behavior based on multi-modal real-time feedback" project, Grant No. 166588, funded by the Swiss National Science Foundation (SNSF).

**Acknowledgments:** The authors are indebted to Anders Askenfelt for contributing to the experiment's design, for inspiring discussions, and for offering advice about the topics of the presented study. They would also like to thank Yamaha for generously providing the AvantGrand N3X.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Figure A1.** Music sheet of the study. The excerpt was taken from Klage by composer Gretchaninov [11]. The excerpt was edited to cover a broad dynamic range and also to include accents.

#### **References**


c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **MRI Compatible Planar Material Acoustic Lenses**

### **Daniel Tarrazó-Serrano <sup>1</sup> , Sergio Castiñeira-Ibáñez <sup>1</sup> , Eugenio Sánchez-Aparisi 2, Antonio Uris <sup>1</sup> and Constanza Rubio 1,\***


Received: 10 October 2018; Accepted: 13 December 2018; Published: 15 December 2018

**Abstract:** Zone plate lenses are used in many areas of physics where planar geometry is advantageous in comparison with conventional curved lenses. There are several types of zone plate lenses, such as the well-known Fresnel zone plates (FZPs) or the more recent fractal and Fibonacci zone plates. The selection of the lens material plays a very important role in beam modulation control. This work presents a comparison between FZPs made from different materials in the ultrasonic range in order to use them as magnetic resonance imaging (MRI) compatible materials. Three different MRI compatible polymers are considered: Acrylonitrile butadiene styrene (ABS), polymethyl methacrylate (PMMA) and polylactic acid (PLA). Numerical simulations based on finite elements method (FEM) and experimental results are shown. The focusing capabilities of brass lenses and polymer zone plate lenses are compared.

**Keywords:** MRI; Zone Plates; ultrasonic lenses

#### **1. Introduction**

The development of modulating and focusing energy systems has been a field of study of great interest for scientist and engineers. The lens is a devices that is able to perform this energy modulation. Lenses allow beam forming, control propagation and focusing the energy that impinges on them. These effects are produced by refractive and diffractive phenomena. Transmission efficiency is one of the most important aspects, particularly when low impedance contrast is presented between the lens and the host medium. Due to the wide versatility of the lenses, they have been used in different areas. For example, they have been applied in sonochemistry [1], construction [2] and the pharmaceutical industry [3].

The acoustic lenses, depending on the physics involved in the beam formation, can be divided into different groups, including refractive lenses and diffractive lenses. One example of lenses based on the refraction phenomenon are sonic crystal lenses made of periodic distributions of rigid cylinders [4]. Due to the subsonic sound speed inside the crystal, these lenses act similar to those in optical systems. Another example of this typology of acoustic lenses are those which modify the refractive index using labyrinths. These type of lenses are the so-called Gradient-Index lenses [5–7].

The other subtype of lenses, based on the diffractive phenomenon, conducts its behavior on the constructive interferences of the pressure field. An example of these types of lenses is the fractal lenses, which are able to generate different foci depending on their fractal geometrical properties [8]. Fresnel Zone Plates (FZP) have an improved focusing capacity. Among the different ways to implement FZPs, one of the most common and easiest is to alternate transparent and blocking zones, which results in a Soret type FZP [9]. To obtain these blocking areas, materials that are opaque to sound are required. This fact is accomplished by selecting materials that have a high impedance contrast with the host

medium. There are studies that have implemented Soret FZP (SZP) by ultrasounds based on these type of lenses [10].

A material that has a high impedance with respect to water and that allows for the creation of opaque zones to achieve a Soret type lens is brass. However, this type of material has limitations, especially when used in fields such as bioengineering. The use of acoustic lenses in medicine for high intensity focused ultrasounds (HIFU) treatment is one of the current lines of research. magnetic resonance imaging (MRI) is the technique that is most used for guiding HIFU treatment [11].

MRI is a technique used for soft tissue structure imaging in a non-invasive way. The image is obtained by aligning and relaxing the magnetic moments of the atoms of the introduced elements in the MRI. Tissues are exposed to a strong external time-independent magnetic field. Thus, metallic elements cannot be introduced in the resonance zone due to their interference in the image and because they could damage MRI-systems. To avoid interaction with the electromagnetic field, non-metallic materials should be used in the construction of lenses. One of these materials is polylactic acid (PLA) [12]. The MRI environment requires materials such as PLA for medical instruments and patient supports. Recently, PLA has been used for this purpose and its reliability has been shown [13]. The HIFU transducer is embedded within a specially designed table that fits into the MRI device. This integrated system, has a degassed water bath where the transducer is located. The patient lies over this system on [14]. Although the transducer and the lens must be immersed in this water bath, degradation of the PLA will occur over long-term immersion. PLA degrades in water after a period ranging from months to a year [15]. Therefore it must be taken into account that, in MRI systems, the lens and the transducer are not permanently submerged. After 20 to 25 min, the system is extracted from the water bath, and for this reason, the time of degradation due to being immersed in water can be prolonged considerably.

In this work, three lenses with three types of compatible materials with MRI environments are compared. In this sense, acrylonitrile butadiene styrene (ABS), polymethyl methacrylate (PMMA) and polylactic acid (PLA) materials are used. Furthermore, a SZP built in brass is compared. Although, this material is not MRI compatible, it is the nearest to the ideal SZP that can be implemented in real projects. In the comparison, a not compatible with MRI lens built in brass and an ideal Soret lens will be added. Results are obtained and compared both numerically and experimentally. Numerical results have been obtained using the commercial software COMSOL Multiphysics 4.3a by COMSOL Inc. (Sweden) [16]. In this work, it has been verified that the ratio of the transmission capacity that is related to the ratio of impedances of the medium and the lens, directly influences the focusing capacity.

#### **2. Methodology and Theoretical Analysis**

Fresnel zone plates are circular concentric structures, which are known as Fresnel regions. Every consecutive region has a *π* phase shift between them. This fact makes a coherent contribution to obtain high intensity levels at focal length (*FL*), which is the location in the axial coordinate where the focus is placed. The number of Fresnel regions is defined as *N*, this includes both opaque and transparent acoustic sections. The working frequency is defined as *f*<sup>0</sup> and radial distances (*rn*) of each Fresnel zone can be obtained using Equation (1) valid for plane wave incidence.

$$r\_n = \sqrt{n\lambda F\_L + \left(\frac{n\lambda}{2}\right)^2} \qquad n = 1, 2, \ldots, N \tag{1}$$

In this work, underwater transmission is considered and lenses are designed for ultrasound applications. Therefore, piston sources have to be considered when FZPs are implemented. Due to spherical wave incidence consideration, Equation (2) has been used where *d* is the separation between the point source and the lens.

$$d + F\_L + \frac{n\lambda}{2} = \sqrt{d^2 + r\_n^2} + \sqrt{F\_L^2 + r\_n^2} \tag{2}$$

*Appl. Sci.* **2018**, *8*, 2634

The acoustic wave has to propagate through the host medium, then cross Fresnel regions and afterwards continue through the host medium. A three-layer configuration has to be considered (Figure 1). Acoustic impedance (*Z*) is defined as the product of the medium density (*ρ*) and the sound propagation velocity (*c*) in it. Therefore, it is necessary to consider the input (*Zin*) and output (*Zout*) acoustic impedance and the transmission pressure coefficients must be calculated (*t*). This coefficient is a clear indicator of the blocking capacity of the elements of the FZP. Hence, *t* is defined as the relation between the transmitted field and the incident field. Density (*ρ*), sound propagation velocity (*c*) and acoustic impedance (*Zmat*) values have been shown in Table 1. Using these values in Equation (3), *Zin* could be obtained [17].

$$Z\_{in} = Z\_{\text{mat}} \frac{Z\_{\text{out}} + jZ\_{\text{mat}} \tan(k\_{\text{mat}}d)}{Z\_{\text{mat}} + jZ\_{\text{out}} \tan(k\_{\text{mat}}d)} \tag{3}$$

where *km* is the wave number, defined as *km* = *ω*/*c*. Considering *ω* = 2*π f*0. Once *Zin* is obtained, reflection coefficient is defined in Equation (4).

$$r\_{in} = \frac{Z\_{in} - Z\_{water}}{Z\_{in} + Z\_{water}} \tag{4}$$

The equation that relates the field balance as a function of the impedance and reflection coefficient of the system is defined in Equation (5) and gives *t* values depending on the material.

$$|t| = \frac{|p\_t^+|}{|p\_{in}^+|} = \sqrt{(1 - |r\_{in}|^2)}\tag{5}$$

**Figure 1.** Transmission diagram of the implemented lenses.

**Table 1.** Density and sound speed values. Acrylonitrile butadiene styrene (ABS), polylactic acid (PLA), and polymethyl methacrylate (PMMA).)


Considering the transmission coefficient values obtained (0.23 for brass, 0.51 for PLA-Air-PLA and more than 0.95 for ABS, PLA and PMMA), it can be affirmed that full implemented MRI compatible material lenses will focus less energy at the *FL* if it is compared to brass FZP or ideal SZP. Therefore, one solution is proposed to obtain the desired impedance contrast. A FZP that includes an air chamber inside the structure has been implemented by using a 3D-printer. Thus, both lenses, full-PLA and air-chamber, have been compared.

#### *2.1. Numerical Model*

The finite elements method (FEM) has been used to obtain a numerical solution of the physical problem. The finite elements method allows us to study the physical phenomena involved in the interaction of waves with FZPs. Therefore, a mathematical model that replicates the conditions of the problem has been implemented. This method also allows us to determine the pressure distribution of the diffracted fields generated by the FZP when there is a piston emitter, causing interference phenomena. From the mesh generated by FEM, a partial differential equation solution is obtained for each node [18]. In this case, acoustic Helmholtz equation is considered (Equation (6)). To solve the Helmholtz equation, standard values of water such as density of the medium (*ρ* =1000 kg/m3) and sound propagation velocity (*c* = 1500 m/s) have been considered. The working frequency of the FZPs is 250 kHz and it can be found by its relation with the angular velocity (*ω*). Finally, *p* corresponds to the acoustic pressure.

$$\nabla \cdot \left( -\frac{1}{\rho\_0} (\nabla p) \right) = \frac{\omega^2 p}{\rho\_0 c^2} \tag{6}$$

If a 3D model is considered, this will require high computational resources. To simplify the model and reduce this computational cost, as shown in previous works [19,20], the geometrical properties of the model are used taking advantage of its axisymmetry. Therefore, the model is simplified by implementing a semi-lens only. A complete solution is obtained by rotating it from its symmetry axis. This procedure achieves a reduction of the degrees of freedom necessary to obtain the results of the numerical simulation and thus significantly diminishing the calculation time.

The boundary conditions defined in the numerical models are explained below as seen in Figure 2. The contours of the model are defined as wave radiation condition boundary to emulate an infinitely large medium and therefore the Sommerfeld condition is satisfied. Acoustic impedance domain definition has been used for all opaque Fresnel regions for each lens. In the case of the SZP lens, the contours are considered infinitely rigid, applying the Neumann condition (the sound velocity in the contour is zero).

**Figure 2.** Scheme of the finite element method (FEM) conditions.

#### **3. Experimental Set-Up**

It is required to validate the results obtained from the theoretical models with other solutions such as numerical models and experimental measurements. In this sense, obtaining experimental results is fundamental to validate the numerical models. A complex measurement and acquisition system is needed to perform the experiments given the technical difficulties to control the underwater devices. The Center for Physics Technologies: Acoustics, Materials and Astrophysics of the Universitat Politècnica de València has a robotized and automated system for high precision ultrasound measurements. The robot is built based on the size of the immersion tank where the tests and experiments are carried out, which contains distilled and degassed water, with dimensions of 0.5 m wide by 0.5 m high by 1 m long. These dimensions suppose that the immersion tank must contain around 200 L of distilled water to be functional, and allow both transducers and devices to be completely submerged, and avoid reflections due to the impedance changes produced by the change medium.

The measurement system is composed by a fixed emitter and a hydrophone coupled to the robotic system. This system obtains reliable and precise results that allow for the evaluation of the acoustic phenomena involved in these types of lenses. A plane immersion piston transducer built by Imasonic with 250 kHz of central working frequency and an active diameter of 32 mm has been used as the emitter. Also, a Precision Acoustics hydrophone, model 1.0 mm Needle Hydrophone is used as the receiver. This hydrophone is capable of measuring high frequencies, even if they have a very weak signal level. The sensitivity of the hydrophone is 850 nV/Pa (−241.4 dB 1V/μPa) with a tolerance of ±3 dB. The frequency response is flat ±2 dB between 3 and 12 MHz and ±4 dB between 200 kHz and 15 MHz. The bandwidth ranges from 5 kHz to 15 MHz. Figure 3 shows an experimental set-up in a measurement. Two types of different configurations are used to generate and amplify the signals. The first one is to use an external function generator connected to a high power amplifier. The second configuration is to use a pulse generator (5077PR of Panametrics) with integrated amplifier. This generator and amplifier allows generating pulses with frequencies between 100 kHz and 20 MHz, a pulse repetition frequency (PRF) from 100 Hz to 5 kHz and a pulse amplitude between 100 and 400 V.

**Figure 3.** Experimental set-up.

All the results shown below are obtained for a working frequency of 250 kHz. For the experimental comparison, three lenses have been implemented, two made of PLA and one of brass. Every lens considered in this work was designed with 11 Fresnel zones and an outer radius of 88.8 mm. The thickness of the brass lens was 1 mm. For manufacturing reasons, the rest of the lenses had a total thickness of 5 mm. Figure 4 shows both PLA and brass lenses. Both PLA lenses are identical, the only difference being an inner air chamber to achieve a higher impedance contrast. As described in the previous section, it is not possible to differentiate them by the naked eye.

**Figure 4.** Implemented lenses, (**a**) PLA and (**b**) brass.

#### **4. Results**

Intensity gain for longitudinal axis cuts and maps have been calculated to compare all the lenses coherently. One parameter, which can be used to evaluate the focusing capacity of a lens is the intensity gain (*G*). The intensity gain is related with the intensity with both the intensity with lens (*I*) and intensity without lens (*I*0), as shown in Equation (7).

$$G(dB) = 10 \cdot \log\_{10}(I/I\_0) \tag{7}$$

Intensity gain values have been calculated from Equation (7). Figure 5, shows the intensity gain for longitudinal cuts on the Z axis for both, numerical and experimental results. It can be seen from Figure 5a, that higher impedance contrast, as in the case of the ideal SZP or brass FZP lens, gives rise to a higher gain levels. As expected, the lower gains are obtained with those materials with impedance contrast values between 1 and 2 and for impedance contrast values lower than 1, the intensity gain increases. ABS, PLA, PLA-Air-PLA, and PMMA polymers, according with the values showed in Table 1, are not able to achieve enough intensity gain as brass FZP. By comparing the experimental results obtained for brass and PLA (see Figure 5b) with numerical ones (see Figure 5a) it can be seen that there is a good agreement. From Figure 5b, it is observed that the air chamber PLA lens has higher intensity gain than full PLA lens. This fact can be explained by the introduction of an air layer. This layer, due to its low acoustic impedance, can block the transmission of the ultrasonic waves approaching its behavior to an ideal SZP. Nevertheless, a focal length displacement of 1.66*λ* is observed in the FZP lens built with air chamber and PLA. In this case, the displacement is due to the new three-layer configuration (PLA-Air-PLA). The resolution of the 3D-printer and the wall width needed to avoid porosities means that there is an interface between the host medium and the air chamber.

**Figure 5.** Intensity gain longitudinal cuts for (**a**) FEM results and (**b**) experimental results.

Figure 6 shows four intensity gain maps, the first three obtained experimentally and the fourth numerically. The experimental ones correspond to PLA, PLA-Air-PLA and brass, while the numerical one has been obtained using an ideal SZP. It has been verified how the results obtained with brass resembles the ideal SZP. This is due to the rigidity of the material. On the other hand, in PLA lens results, a diminishing intensity gain is observed. This intensity gain level can be improved using a PLA-Air-PLA lens. All the lenses are designed with a focal length located at 8.33*λ* for a working frequency of 250 kHz. When the lens is able to block destructive interference, it is possible to locate the focus in *FL*. This occurs in brass and the ideal SZP case. Resulting from the lack of blocking capability, the full PLA FZP could not impede the incident pressure wave transferal generating aberrations in the *FL*.

**Figure 6.** Intensity gain maps for experimental measurements and ideal SZP numerically obtained (FEM).

#### **5. Conclusions**

Non-metallic materials can be used for the construction of acoustic lenses. Three alternative materials, compatible with magnetic resonance, have been proposed instead of brass lenses. It has been possible to verify that the higher the impedance contrast of the materials, the higher the intensity gain levels. The PMMA lens has higher intensity level than ABS and PLA ones, because it has a slightly higer impedance contrast value than ABS or PLA. However, the use of an air chamber inside the PLA lens increases the intensity gain levels, due to the fact that values of impedance contrast less than one means blocking of the waves. PLA is a biocompatible material and is cheaper than PMMA. 3D printers give open field of new lens design MRI compatible. Moreover, since PLA is a biodegradable material, it is a environmental friendly material. This point is important in procedures that generate waste. Nevertheless, the manufacture of PLA lenses require great care because of microporosities that could appear. The appearance of pores can cause water to enter into the lens, drastically reducing the

blocking capacity. In addition, polymers such as PLA, ABS or PMMA are more affordable than metal plates. This will lower the costs in the production of HIFU treatment devices based on acoustic lenses. For this reason, PLA is proposed as an MRI compatible material with great potential for therapeutic applications of ultrasound focusing.

**Author Contributions:** A.U. and C.R. coordinated the theoretical development, participating in the establishment of the theory principles used in this work, as well as in the drafting of the manuscript. D.T.-S. coordinated experimental development. S.C.-I. developed part of the theory used and designed some characterization. E.S.-A. participated in the analysis of the state of art.

**Funding:** This research was funded by spanish Ministerio de Economía y Competitividad (MINECO) TEC2015-70939-R.

**Acknowledgments:** This work has been supported by Spanish MINECO (TEC2015-70939-R).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*

### **Theoretical and Numerical Estimation of Vibroacoustic Behavior of Clamped Free Parabolic Tapered Annular Circular Plate with Different Arrangement of Stiffener Patches**

#### **Abhijeet Chatterjee 1,\*, Vinayak Ranjan 2, Mohammad Sikandar Azam <sup>1</sup> and Mohan Rao <sup>3</sup>**


Received: 13 September 2018; Accepted: 2 November 2018; Published: 8 December 2018

**Abstract:** This paper compares the vibroacoustic behavior of a tapered annular circular plate having different parabolic varying thickness with different combinations of rectangular and concentric stiffener patches keeping the mass of the plate and the patch constant for a clamped-free boundary condition. Both numerical and analytical methods are used to solve the plate. The finite element method (FEM) is used to determine the vibration characteristic and both Rayleigh integral and FEM is used to determine the acoustic behavior of the plate. It is observed that a Case II plate with parabolic decreasing–increasing thickness variation for a plate with different stiffener patches shows reduction in frequency parameter in comparison to other cases. For acoustic response, the variation of peak sound power level for different combinations of stiffener patches is investigated with different taper ratios. It is investigated that Case II plate with parabolic decreasing–increasing thickness variation for an unloaded tapered plate as well as case II plate with 2 rectangular and 4 concentric stiffeners patches shows the maximum sound power level among all variations. However, it is shown that the Case III plate with parabolically increasing–decreasing thickness variation with different combinations of rectangular and concentric stiffeners patches is least prone to acoustic radiation. Furthermore, it is shown that at low forcing frequency, average radiation efficiency with different combinations of stiffeners patches remains the same, but at higher forcing frequency a higher taper ratio causes higher radiation efficiency, and the radiation peak shifts towards the lower frequency and alters its stiffness as the taper ratio increases. Finally, the design options for peak sound power actuation and reduction for different combinations of stiffener patches with different taper ratios are suggested.

**Keywords:** thick annular circular plate; Rayleigh integral; finite element modeling; rectangular and concentric stiffener patches; taper ratio; thickness variation

#### **1. Introduction**

Tapered annular circular plates with different combinations of rectangular and concentric patches has many engineering applications. They are used in many structural components i.e., building, design, diaphragms and deck plates in launch vehicles, diaphragms of turbines, aircraft and missiles, naval structures, nuclear reactors, optical systems, construction of ships, automobiles and other vehicles, the space shuttle etc. These tapering plates with different combinations of rectangular and concentric patches are found to have greater resistance to bending, buckling and vibration in

comparison to plates of uniform thickness. It is interesting to know that tapered plates with different thickness variation have drawn the attention of most of the researchers in this field. However, tapered plates with different combination of rectangular and concentric patches can alter the dynamic characteristic of structures with a change in stiffness. Hence, for practical design purposes, the vibration and acoustic characteristics of such tapered plates are equally important. In comparison to the present study, several existing works are presented where the researchers have investigated the vibration response [1–9] of circular or annular plates of tapered or uniform thickness. But in terms of acoustic behavior, many researchers have contributed most. Lee and Singh [10] used the thin and thick plate theories to determine the sound radiation from out-of-plane modes of a uniform thickness annular circular plate. Thompson [11] used the Bouwkamp integral to determine the mutual and self-radiation impedances both for annular and elliptical pistons. Levine and Leppington [12] analyzed the sound power generation of a circular plate of uniform thickness using exact integral representation. Rdzanek and Engel [13] determined the acoustic power output of a clamped annular plate using an asymptotic formula. Wodtke and Lamancusa [14] minimized the acoustic power of circular plates of uniform thickness using the damping layer placement. Wanyama [15] studied the acoustic radiation from linearly-varying circular plates. Lee and Singh [16] used the flexural and radial modes of a thick annular plate to determine the self and mutual radiation. Cote et al. [17] studied the vibro acoustic behavior of an unbaffled rotating disk. Jeyraj [18] used an isotropic plate with arbitrarily varying thickness to determine its vibro-acoustic behavior using the finite element method (FEM). Ranjan and Ghosh [19] studied the forced response of a thin plate of uniform thickness with attached discrete dynamic absorbers. Bipin et al. [20] analyzed an isotropic plate with attached discrete patches and point masses with different thickness variation with different taper ratios to determine its vibro acoustic response. Lee and Singh [21] investigated the annular disk acoustic radiation using structural modes through analytical formulations. Rdzanek et al. [22] investigated the sound radiation and sound power of a planar annular membrane for axially-symmetric free vibrations. Doganli [23] determined the sound power radiation from clamped annular plates of uniform thickness. Nakayama et al. [24] investigated the acoustic radiation of a circular plate for a single sound pulse. Hasegawa and Yosioka [25] determined the acoustic radiation force used on the solid elastic sphere. Lee and Singh [26] used a simplified disk brake rotor to investigate the acoustic radiation through a semi-analytical method. Thompson et al. [27,28] analyzed the modal approach for different boundary conditions to calculate the average radiation efficiency of a rectangular plate. Rayleigh [29] determined the sound radiation from flat finite structures. Maidanik [30] analyzed the total radiation resistance for ribbed and simple plates using a simplified asymptotic formulation. Heckl [31] used the wave number domain and Fourier transform to analyses the acoustic power. Williams [32] determined the wave number as a series in ascending power to estimate the sound radiation from a planar source. Keltie and Peng [33] analyzed the sound radiation using the cross- modal coupling from a plane. Snyder and Tanaka [34] demonstrated the importance of cross-modal contributions for a pair of modes through total sound power output using modal radiation efficiency. Martini et al. [35] investigated the structural and elastodynamic analysis of rotary transfer machines by a finite element model. Croccolo et al. [36] determined the lightweight design of modern transfer machine tools using the finite element model. Martini and Troncossi [37] determined the upgrade of an automated line for plastic cap manufacture based on experimental vibration analysis. Pavlovic et al. [38] investigated the modal analysis and stiffness optimization: the case of a tool machine for ceramic tile surface finishing using FEM.

While reviewing the literature, this comes to a discussion at a common point that has inspired the present paper based on a comparison of vibroacoustic behavior of a parabolic tapered annular circular plate with attached rectangular and concentric patches at different positions. The paper is significant for the analysis of the comparison of vibroacoustic behavior of such clamped free tapered plate, which is done by keeping the mass of the plate and patch constant. Therefore, this paper is based on the vibroacoustic analysis of a clamped free parabolic tapered annular circular plate with different attachments of rectangular and concentric stiffener patches for different positions with different taper ratios under time-varying harmonic excitations.

#### **2. Mathematical Modeling and Analysis**

#### *2.1. Plate Free Vibration*

Let us considered a plate with outer radius 'a' and inner radius 'b' as shown in Figure 1. In this paper, the modal analysis is performed to estimate the natural frequency and modes shape of the plate is given by the following equation:

$$\left( [K] - \omega^2 [M] \right) \psi\_{\text{mu}} = 0 \tag{1}$$

where [*M*] is the mass matrix and [*K*] is the stiffness matrix where as *ψmn* is the mode shape and *ω* is the respective natural frequency of the plate in rad/sec. The non-dimensional frequency parameter *λ*<sup>2</sup> is given by the following equation:

$$
\lambda^2 = \omega a^2 \sqrt{\frac{\rho h}{D}} \tag{2}
$$

where D, the flexure rigidity = *Eh*<sup>3</sup> <sup>12</sup>(1−*υ*2) , a = outer radius, E = Young's modulus of elasticity, *υ* = Poisson's ratio, h = thickness of the plate and *ρ* = density of plate.

**Figure 1.** Sound radiation investigated for thick annular circular plate in Z direction enclosed in a sphere.

#### *2.2. Analytical and Numerical Formulation for Acoustic Radiation from Tapered Annular Circular Plate*

It is considered that an annular circular plate of inner radius 'b' and outer radius 'a' in flexural vibration is set on flat rigid baffle having infinite extent as reported in Figure 1. Acoustic scattering of the edges of a vibrating structure is neglected in this study. Let P be the sound pressure amplitude, Ss be the surface of the sound source, *q* be the Green methods function in free field, *ls* and *lp* be the position vectors of source and receiver and the surface normal vector at *ls* be *f*; then structure sound radiation can be obtained by the Rayleigh integral [10] as given by Equation (3):

$$P(l\_{\mathbb{P}}) = \int\_{S\_s} \left( \frac{\partial q}{\partial f}(l\_{p\prime}, l\_s) P(l\_{\mathbb{P}}) - \frac{\partial P}{\partial f}(l\_s) q\left(l\_{p\prime}, l\_s\right) \right) ds(l\_s) \tag{3}$$

*Appl. Sci.* **2018**, *8*, 2542

The sound pressure, radiated from non-planar source in far and free field environment based on plane wave approximation can be expressed by Equation (4):

$$P(l\_p) = \frac{\rho\_0 c\_0 B}{4\pi} \int\_{S\_s} \frac{e^{iB|l\_p - l\_s| \cdot \mathcal{U}} (l\_s)}{|l\_p - l\_s|} (1 + \cos \eta) dS \tag{4}$$

Let *ρ*<sup>0</sup> be the mass density of air, *c*<sup>0</sup> be the speed of sound in air, B be the corresponding acoustic wave number, and . <sup>U</sup> and . u be the corresponding vibratory velocity amplitude and spatial dependent vibratory velocity amplitude in the *z* direction at *ls*, then from a normal plane [10], the modal sound pressure Pmn for an annular plate with (m, n)th mode is obtained from simplifying Equation (4) with Hankel transform and is expressed by Equations (5) and (6):

$$P\_{\rm mm}(R, a, \beta) = \frac{\rho\_0 c\_0 B e^{i R\_{\rm mm} R\_d}}{2 R\_d} \cos n\beta (-i)^{n+1} A\_n \left[ u(l) \right] (1 + \cos \eta) \tag{5}$$

$$A\_f\left[u(l)\right] = \int\_0^\infty u(l)l\_n(B\_l l)l dl\_l \text{ } lb = B\sin\theta ; \ R\_d = \left|l\_p - l\_s\right|\tag{6}$$

where Xn is Bessel function of order n, (α, β) are the cone and azimuthal angles of the observation positions, respectively, η is the angle between the surface normal vector and the vector from source position to receiver position, and A is the Hankel transform. According to the far field condition, Rd in the denominator is approximated by R whereR=|*lp*| is considered to be radius of the sphere. Consider that on a sphere Sv the observation positions are represented by some points having equal angular increments (Δ*ϕ*, Δ*α*). If 'Δ*ϕ*' represents the small increment in the circumferential direction of the plate, at all of the observation positions, the sound pressures is given by Equations (4)–(6). The modal sound power Smn for the (m, n)th mode [10,16] from the far-field is given by Equation (7):

$$S\_{mn} = \left(D\_{mn} S\_{\upsilon}\right)\_{s} = \frac{1}{2} \int \int \frac{P\_{mn}^2}{\rho\_0 c\_0} R^2 \sin \operatorname{ad} d\boldsymbol{d} \,\tag{7}$$

where the acoustic intensity is represented by Dmn and area of the control surface is represented by Sv. The radiation efficiency σmn of the plate [10] is given by Equation (8):

$$\sigma\_{mn} = \frac{S\_{mn}}{\left| \stackrel{?}{u}\_{mn}^2 \right| \text{ts}} \Big| \stackrel{?}{u}\_{mn}^2 \Big| \text{ts} = \frac{1}{2\pi(a^2 - b^2)} \int\_b^a \int\_0^2 \text{l}I^2 \, d\varphi dl \tag{8}$$

where, for the two normal surfaces of the plate the spatially average r.m.s velocity is represented as . u2 mn *ts*. Considering the plate thickness (h) effect, the sum of sound radiations [16] from two normal surfaces of the plate at (Z = 0.5 h and −0.5 h) will represent the modal sound power, which can be given by Equations (9)–(11):

$$P\_{mn}(R,\mathfrak{a},\beta) = (1+\cos\mathfrak{a})P\_{mn}^{\mathfrak{c}}(R,\mathfrak{a},\beta) + (1-\cos\mathfrak{a})P\_{mn}^{\mathfrak{o}}(R,\mathfrak{a},\beta) \tag{9}$$

$$P\_{mn}^{\varepsilon}(R, a, \beta) = \frac{\rho\_0 \varepsilon\_0 B\_{mn} e^{iB\_{mn}R}}{2R} e^{-iB\_{mn} \left(\frac{\delta}{2}\right) \cos a} \cos n\beta (-i)^{n+1} A\_n \left[\mathcal{U}(l)\right] \tag{10}$$

$$P\_{mn}^{0}(R, a, \beta) = \frac{\rho\_0 c\_0 B\_{mn} e^{iB\_{mn}R}}{2R} e^{-iB\_{mn} \left(\frac{h}{2}\right) \cos a} \cos n(\beta + \phi)(-i)^{n+1} A\_n \left[i L(l)\right] \tag{11}$$

where, the corresponding acoustic wave number of the (m, n)th mode is represented by Bm,n, s and o in Equations (10) and (11) represent the source side and the opposite to the source side.

For numerical analysis, we have used ANSYS (ANSYS, Inc., Canonsburg, PA, USA) as a tool. The plates with rectangular and concentric patches are modeled in ANSYS with Plane 185 with 8 brick nodes and having three degrees of freedom at each node. The mesh is not exactly equal to all the cases of different thickness variation and stiffener. The number of element and nodes for uniform unloaded plates ends up being 5883 and 1664, respectively. For plates with different cases of thickness variation with different stiffener we tried to keep the mesh as close to the mesh of the unloaded plate. For vibration analysis and for a Case I plates with 1 rectangular stiffener, the modal structure consists of 5685 elements with 1638 nodes whereas for Case I plates with 1 concentric stiffener, it has 5524 elements and 1618 nodes. With other combinations of rectangular and concentric stiffener with different parabolic thickness, a variation of 5% of the mesh from that of the unloaded plate is considered. The numerical results obtained using FEM are compared with the existing literature. The structure is modeled as such that the total volume of the plate plus stiffeners is equal to the total volume of the uniform unloaded plate. As a result the whole mass of the plate plus stiffener is equal to the whole mass of the uniform unloaded plate. So for all the cases of plate, the mass will be constant. FLUID 30 and FLUID130 elements are used to create the acoustic medium environment around the plate. For fluid-structure interaction FLUID 30 is used. For the surface on outer sphere, FLUID 130 elements are created by imposing a condition of infinite space around the source and to prevent the back reflection of sound waves to the source. For acoustic analysis, the number of element and nodes for a uniform unloaded plate ends up being 14,680 and 3639, respectively. For a Case I plate with 1 rectangular stiffener, and after proper convergence of modeling, the numbers of elements and nodes ends up being 14,124 and 3465 respectively while for Case I plate with 1 concentric stiffener, the numbers of elements and nodes found to be 13,934 and 3345 respectively. Again, for other combination of rectangular and concentric stiffeners with different parabolic thickness, a variation of 5% of mesh from that of the unloaded plate is taken. Consider the air medium where the plate is vibrating with air density *ρ*<sup>0</sup> = 1.21 kg/m3. At 20 ◦C, the speed of sound *c*<sup>0</sup> of air is taken as 343 m/s. The structural damping coefficient of the plate is assumed as 0.01.

#### *2.3. Thickness Variation of the Plate*

In this study, three different parabolic thickness variations of plates is considered for analysis and is reported in Figure 2. The radial direction is considered for thickness variation by keeping the total mass of the plate plus patch constant. In the radial direction the plate thickness is given by hx = h [1 − Tx {f(x)}n], where 'h' is the maximum thickness of the plate where,

$$f(\mathbf{x}) = \{\_{1, x=a}^{0, x=b} \text{ and } f(\mathbf{x}) = \frac{\mathbf{x} - b}{a - b} \text{ where } \mathbf{b} < \mathbf{x} < \mathbf{a} \tag{12}$$

The taper parameter or taper ratio (Tx) is given by the equation:

$$T\_x = \left(1 - \frac{h\_{\rm min}}{h}\right) \tag{13}$$

The Case I plate of (Figure 2 with parabolically decreasing thickness variation is given by the equation:

$$h\_{\mathbf{x}} = h \left\{ 1 - T\_{\mathbf{x}} \left( \frac{\mathbf{x} - b}{a - b} \right)^{n} \right\} \tag{14}$$

The Case II plate (parabolically decreasing-increasing) and Case III plate (parabolically increasing-decreasing) thickness variation of (Figure 2) are given by the equations:

$$h\_{\mathbf{x}} = h \left\{ 1 - T\_{\mathbf{x}} \left( 1 - abs \left( 1 - 2 \frac{\left( \mathbf{x} - b \right)}{\left( a - b \right)} \right) \right)^{n} \right\} \tag{15}$$

$$h\_{\mathbf{x}} = h \left\{ 1 - T\_{\mathbf{x}} ab \mathbf{s} \left( 1 - 2 \frac{(\mathbf{x} - b)}{(a - b)} \right)^{n} \right\} \tag{16}$$

where, n = 2 for parabolic thickness variation. The total volume of the plate plus patches as well as the unloaded plate is kept constant and is given by the equation:

$$\text{Volume} = \pi (a^2 - b^2) \text{h} = \int\limits\_{b}^{a} (a^2 - b^2) h\_x dx \tag{17}$$

In this paper, a comparison for the effect of frequency parameters, effect of sound power levels, average radiation efficiency and peak sound power level is obtained for parabolic tapered plates. The out of plane (m, n)th modes in *Z* direction for the plate with different attachment of rectangular and concentric stiffener patches at different positions with different parabolically tapered varying thickness is considered. The plate is made tapered with different taper ratios of 0.25, 0.50 and 0.75. The mass of the plate plus rectangular or concentric patches are kept constant for this analysis. The inner clamped and outer free boundary condition is taken. Three arrangements of plates with different combinations of rectangular or concentric stiffener patches are considered as shown in Figure 3. The selection of different combinations of rectangular or concentric stiffener patches are such that the mass of the unloaded plate is equal to mass of the rectangular or concentric stiffener patches plus plate and in all three cases mass of the plate with stiffener patches remains constant. The specifications and the material properties of an annular circular plate with attached rectangular and concentric stiffener patches are reported in Table 1. Rayleigh integral has been used for sound power calculation and ANSYS has been used as a tool for computation.

**Figure 2.** Plate with different parabolic varying thickness variations.

**Table 1.** The specifications and the material properties of an annular circular plate with different attachment of rectangular and concentric stiffener patches.


**Figure 3.** Plate with different arrangements of rectangular and concentric stiffener patches with (0, 3) modes.

#### **3. Results and Discussion**

#### *3.1. Validation of Natural Frequency Parameter and Acoustic Power Calculation*

In this paper, the natural frequency parameter of a uniform unloaded annular circular plate is validated with the published result of Lee et al. [10] and is reported in Table 2. In reference [10], Lee et al. provide the solution for the natural frequency parameter of a uniform annular circular plate by Thick and thin plate theories. In our study we have calculated our result using FEM by taking the same dimension of plate as that of Lee et al. From Table 2, it is clearly understand that in this paper the results obtained are almost equal to the published results [10]. For the acoustic power calculation, the computed analytical, numerical and published experimental results [10] are considered as reported in Figure 4. From Figure 4, a good agreement of computed acoustic results is seen to be obtained analytically and numerically in line with published experimentally results [10].

**Table 2.** Validation and comparison of natural frequency parameter *λ*<sup>2</sup> of uniform clamped-free annular circular plate obtained in the present work with that of the published result of Lee et al. [10].


**Figure 4.** Comparison of sound power level analytically, numerically and experimentally for unloaded plate with uniform thickness for taper ratio Tx = 0.00.

#### *3.2. Effect of Natural Frequency Parameter (λ2) of Plate with Different Combinations of Rectangular and Concentric Stiffener Patches with Different Taper Ratios*

In this paper, the effect of the natural frequency parameter (*λ*2) is investigated for annular plates with different attachment of rectangular and concentric stiffener patches for different positions. The analysis is made for annular plates with different cases of parabolic thickness variations keeping the mass of the plate plus patches constant. Table 3 compares the first four natural frequency parameters numerically of a uniform unloaded plate for taper ratio, Tx = 0.00, with different attachments of rectangular and concentric stiffener patches along with the percentage variation of *λ*2. It is clear from Table 3 that the plate with different arrangements of rectangular stiffener patches has the same effect of natural frequency parameters as that of the unloaded plate for taper ratio, Tx = 0.00. However, for concentric stiffener patches, the natural frequency parameter decreases with addition of patches and minimum for 4 concentric stiffener patches. In Table 3, the negative variation of frequency parameter is calculated by *<sup>l</sup>* <sup>2</sup>(*sti f f ener*)−*<sup>l</sup>* <sup>2</sup>(*original*) *<sup>l</sup>*2(*original*) ∗ 100. Figures 5 and 6 show the comparison of negative % variation of natural frequency parameter with different modes for a plate with taper ratio, Tx = 0.00 and with different attachment of rectangular and concentric stiffener patches. From Figure 5 it is shown that due to less stiffness associated with these modes, the (0, 2) mode of plate with 4 rectangular stiffener patches and (0, 0) mode of plate with 1 rectangular stiffener patch show the lowest percentage variation of *λ*2. However, due to greater stiffness associated with the (0, 1) mode of plate with 1 rectangular stiffener patch, it showed the highest percentage variation of *λ*2. Furthermore, from Figure 6 it is observed that for concentric stiffener patches the (0, 1) mode of plate with all combinations of patches shows the highest value of percentage variation of *λ*<sup>2</sup> due to greater stiffness associated with this mode. However, for all the remaining modes (0, 0), (0, 2) and (0, 3) of plates with concentric stiffener patches the stiffness decreases and as a result the percentage variation of *λ*<sup>2</sup> decreases associated with these modes. Figure 7 shows the numerical comparison of natural frequency parameters *λ*<sup>2</sup> with modes for an unloaded plate and for a plate with 4 rectangular and 4 concentric stiffener patches for taper ratio, Tx = 0.00. It is clear from Figure 7 that the effect of natural frequency parameter due to 4 rectangular stiffener patches is almost same as that of the unloaded plate. However, a plate with 4 concentric stiffener patches shows little decrease in the frequency parameter due to greater stiffness associated with this plate with concentric patch. Tables 4–6 numerically compare the first four natural frequency parameter *λ*<sup>2</sup> of a plate with different combinations of rectangular and concentric stiffener patches for different cases of thickness variation with different taper ratios. It is observed from Tables 4–6 that the

natural frequency parameter for a plate with concentric patches for all thickness variations reduces more in comparison to rectangular patches with increasing taper ratios. This may be due to the lower stiffness of the plate associated with concentric patches. Furthermore, it is observed that the frequency parameter for a Case II plate (parabolically decreasing–increasing thickness variation) for all cases of thickness variation with different combinations of rectangular and concentric stiffener patches reduces more in comparison to a Case I plate (parabolic decreasing thickness variation). This is due to the lower stiffness associated with the Case II plate than that of the Case I plate. It is further investigated that the effect of the frequency parameter for a Case III plate with different attachment of rectangular and concentric stiffener patches (parabolic increasing–decreasing thickness variation) is almost same as that of uniform unloaded plate due to more stiffness associated with the Case III plate. However, for all cases of different rectangular and concentric stiffener patches, plate with different parabolically thickness variations alters its modes at higher taper ratios.


**Table 3.** Numerical comparison of first four natural frequency parameter *λ*<sup>2</sup> of uniform unloaded plate with different attachment of rectangular and concentric stiffener patches for taper ratio, Tx = 0.00.

**Figure 5.** Comparison of % variation of natural frequency parameter with different modes for plate with taper ratio, Tx = 0.00 and with different attachments of rectangular stiffener patches.

**Figure 6.** Comparison of % variation of natural frequency parameter with different modes for a uniform plate with taper ratio, Tx = 0.00 and with different attachment of concentric stiffener patches.

**Figure 7.** Comparison of variation of different frequency parameter with different modes for an unloaded plate and for a plate with 4 rectangular and 4 concentric stiffener patches.



 and



 and



 and

#### *3.3. Acoustic Response Solution of Tapered Annular Circular Plate with Different Combination of Rectangular and Concentric Stiffener Patches with Different Taper Ratios*

In this paper, the sound power level (dB, reference = 10−<sup>12</sup> watts) of an annular circular plate with a different attachment of rectangular and concentric stiffener patches is estimated. The plate for the sound power level is analyzed for all cases of different parabolic thickness variation due to transverse vibration. The taper ratio is maintained from a range (0.00–0.75). The sound power level is investigated by applying 1 N concentrated load under time-varying harmonic excitations at different excitation locations at different nodes, and a harmonic frequency range of (0–8000) HZ is taken to determine the sound radiation characteristic. The Case I plate with parabolic decreasing thickness variation is taken as a convergence study. Figures 8 and 9 compare the sound power level for a Case I plate obtained analytically and numerically for the taper ratio Tx = 0.75 for 4 rectangular stiffener patches and 4 concentric stiffener patches, respectively, for different modes. A good agreement of computed results is seen in the comparison of sound power as depicted from Figures 8 and 9. Figures 10–12 shows the numerical comparison of sound power level for Case I plate with different combinations of rectangular stiffener patches for different taper ratios and for different modes under forced excitation. From Figure 10, a sound power level up to 30 dB is seen, and we do not get any broad range of frequencies for different taper ratios for plate with 1 rectangular stiffener patch. However, for a sound power level up to 40 dB, we get all taper ratios, Tx = 0.00, 0.25, 0.50 and 0.75, with a broad range of frequencies in frequency band A only with 1 rectangular stiffener.

**Figure 8.** Comparison of sound power level analytically and numerically for annular plate attached with 4 rectangular stiffener patches and having parabolic decreasing thickness variations (Case I) for taper ratio Tx = 0.75.

**Figure 9.** Comparison of sound power level analytically and numerically for annular plate attached with 4 concentric stiffener patches and having parabolic decreasing thickness variations (Case I) for taper ratio Tx = 0.75.

**Figure 10.** Numerical comparison of sound power level for annular plate attached with 1 rectangular stiffener patch and having parabolic decreasing thickness variations (Case I) for different taper ratios Tx.

**Figure 11.** Numerical comparison of sound power level for annular plate attached with 2 rectangular stiffener patches and having parabolic decreasing thickness variations (Case I) for different taper ratios Tx.

**Figure 12.** Numerical comparison of sound power level for annular plate attached with 4 rectangular stiffener patches and having parabolic decreasing thickness variations (Case I) for different taper ratios Tx.

Patches are as reported in Figure 10. It is noteworthy that for sound power level up to 50 dB, we get a broader range of frequencies for sound power level in different frequency bands, i.e., B, C and D, as reported in Figure 10. From Figure 11, it is apparent that for a sound power level up to 20 dB, we do not get any broad range of frequencies for plate with 2 rectangular stiffener patches. However for a sound power level up to 30 dB, we get the broad range of frequencies in frequency band A only with taper ratio, Tx = 0.00, 0.25, 0.50 and 0.75 as available design alternative. For a sound power level up to 50 dB, we get more broad range of frequencies for the sound power level in different frequency bands, i.e., B and C, as reported in Figure 11. From Figure 12, it is shown that for a sound power level up to 10 dB, we do not get any broad range of frequencies for plate with 4 rectangular stiffener patches. But for a sound power level up to 20 dB, we get the broad range of frequencies in frequency bands A only with all taper ratios, Tx = 0.00, 0.25, 0.50 and 0.75 and, therefore, this is the available design alternative. However, for a sound power level up to 40 dB, we get more design options for the sound power level in different frequency bands, i.e., B, C and D as reported in Figure 12. Furthermore, Figures 13–15 show the numerical comparison of the sound power level for a Case I plate with different combinations of concentric stiffener patches for different taper ratios. From Figure 13, it is seen that for a sound power level up to 30 dB, we do not get any design options for different taper ratios for the plate with both 1 concentric stiffener patch and 4 concentric stiffener patches; and for 2 concentric stiffener patches, we do not find any sound power level upto 10 dB. However, for a sound power level up to 40 dB, we get all taper ratios, Tx = 0.00, 0.25, 0.50 and 0.75 as design options in frequency bands A and B for plate with 1 concentric stiffener patch combinationas reported in Figure 13. It is noteworthy that for a sound power level up to 50 dB, we get more design options for the sound power level in different frequency bands, i.e., C, D and E as reported in Figure 13. From Figure 14, it is apparent that for a sound power level up to 20 dB, then in frequency band A only taper ratio Tx = 0.00, 0.25, 0.50 and 0.75 are available design alternatives for a plate with the 2 concentric stiffener patches combination. But for sound power level up to 30 dB, we get wider frequency bands, B, C and D for different taper ratios as reported in Figure 14. From Figure 15, it is seen that for a sound power level up to 40 dB is possible only in frequency bands A only with all taper ratios, Tx = 0.00, 0.25, 0.50 and 0.75 and, therefore, this is the available design alternative for plate with 4 concentric stiffener patches combination. However, for a sound power level up to 60 dB, we get a broader range of frequency denoted as B and C for all taper ratios as reported in Figure 15.

**Figure 13.** Numerical comparison of sound power level for annular plate attached with 1 concentric stiffener patch and having parabolic decreasing thickness variations (Case I) for different taper ratios Tx.

**Figure 14.** Numerical comparison of sound power level for annular plate attached with 2 concentric stiffener patches and having parabolic decreasing thickness variations (Case I) for different taper ratios Tx.

**Figure 15.** Numerical comparison of sound power level for annular plate attached with 4 concentric stiffener patches and having parabolic decreasing thickness variations (Case I) for different taper ratios Tx.

It may be inferred from Figures 10–15 that a plate with different combinations of rectangular and concentric stiffener patches plays a significant role in sound power reduction in different frequency bands. A plate with 4 rectangular stiffener patches combination causes maximum sound power level reduction in comparison to 1 rectangular stiffener patch and 2 rectangular stiffener patch combinations for a Case I plate; whereas, for a plate with 4 concentric stiffener patches the lowest sound power is observed in comparison to other combinations. However, the stiffness contribution due to various taper ratios has a very limited impact on sound power level reduction in comparison to that of modes and excitation locations of plate with different combinations of rectangular and concentric stiffener patches. Furthermore, from Figures 10–15, it is observed that for an excitation frequency up to 2000 HZ, the effect of different combinations of rectangular and concentric stiffener patches and stiffness variation due to different taper ratios do not have a significant effect on sound power radiation for clamped-free boundary condition. However, when the excitation frequency increases beyond 2000 HZ and up to the first peak, the sound power level is higher for only higher taper ratios for a Case I plate with both 1 rectangular stiffener and 1 concentric stiffener patch, and variation of sound power level due to variation of peaks for different taper ratios is observed for a plate with both 2 rectangular stiffener and 2 concentric stiffener patches and for 4 rectangular stiffener and 4 concentric stiffener patch combinations. For a Case II plate, beyond a forcing frequency 2000 HZ, the highest sound power level is associated with plate for 2 rectangular stiffener patches combinations; while for a Case III plate, the sound power level is seen to decrease for all combinations of rectangular stiffener patches with increasing taper ratio. Similar effect is observed for plate with concentric stiffener patches where plate with 2 concentric stiffener patches is seen to have highest radiation power. Again for case III plate the sound power is seen to be decreased for all combination of concentric stiffener patches. Furthermore, different modes do influence the sound power peaks as evident from Figures 10–15. Sound power level peak obtained for different modes (0, 0) and (0, 1) almost remain same for different taper ratios. However, no such sound power similarity of modes (0, 0) and (0, 1) is observed for plates with concentric stiffener patches. The sound power level does shift towards a lower frequency with increasing taper ratio for all combinations of rectangular and concentric stiffener patches. For a higher frequency beyond 4000 HZ, it is observed that different taper ratios alter its stiffness at higher forcing frequency for different cases of thickness variation. It is noteworthy that for a higher frequency beyond 4000 HZ and up to 8000 HZ, a plate with different combination of rectangular and concentric stiffener patches alters its stiffness at higher forcing frequency and the acoustic power curve tends to intersect each other at this high forcing region. Table 7 compares the peak sound power level of a plate having different parabolically varying thickness with different combinations of rectangular and concentric stiffener patches for a taper ratio Tx = 0.75. It is interesting to note that a plate with 4 rectangular stiffener patches combination shows the lowest peak sound power level among all cases of thickness variations, and the lowest peak sound of 77 dB is obtained for Case III plate whereas the highest peak sound power level of 84 dB is obtained for a Case II plate with 2 rectangular stiffener patches combination. Similar effect is again observed for plate with concentric stiffener patch combination. The lowest sound power of 76 dB is observed for the plate with the 4 concentric stiffener patches combination, and the highest power of 83 dB is observed for plate with 1 concentric stiffener patch combination. Figures 16–21 shows the numerical comparison of sound power levels for Case I, Case II and Case III plates for different combinations of rectangular and concentric stiffener patches for taper ratio Tx = 0.75. From Figures 16–21, it is observed that for all cases of thickness variation and for excitation frequency up to 2000 HZ, different parabolic thickness variation does not have any significant effect on sound power radiation. Furthermore, from Figures 16–18 it is seen that beyond excitation frequency of 2000 HZ and up to the first peak, a Case II plate with 2 rectangular stiffener patches shows the highest radiation power of 84 dB in comparison to a radiation power of 82 dB for a Case I plate with 1 rectangular stiffener patch combination. However, at this forcing frequency of 2000 HZ case III plate remains unaffected and shows the lowest peak sound level for all cases of thickness variations and so it is suggested that Case III plate is the lowest sound power radiator among all cases of thickness variation with different combinations of rectangular stiffener patches. Again, a similar effect is observed for plate with the concentric stiffener patches combination. Beyond 200 HZ, Case II plate with 2 concentric stiffener patches is a very good sound radiator of sound power 83 dB in comparison to 82 dB of plate with 1 stiffener patch combination. A Case III plate with all combination of concentric stiffener patches is found to be a poor sound radiator.


**Table 7.**

Numerical

 comparison

 of peak sound power level and radiation efficiency for annular plate having different

**Figure 16.** Numerical comparison of sound power level for annular plate having parabolic decreasing thickness variation (case I) for different attachments of rectangular stiffener patches for taper ratio Tx = 0.75.

parabolically

 thickness variations with

**Figure 17.** Numerical comparison of sound power level for annular plate having parabolic decreasing increasing thickness variation (Case II) for different attachments of rectangular stiffener patches for taper ratio Tx = 0.75.

**Figure 18.** Numerical comparison of sound power level for annular plate having parabolic increasing decreasing thickness variations (Case III) for different combinations of rectangular stiffener patches for taper ratio Tx = 0.75.

**Figure 19.** Numerical comparison of sound power level for annular plate having parabolic decreasing thickness variations (Case I) for different combinations of concentric stiffener patches for taper ratio Tx = 0.75.

**Figure 20.** Numerical comparison of sound power level for annular plate having parabolic decreasing increasing thickness variations (Case II) for different combinations of concentric stiffener patches for taper ratio Tx = 0.75.

**Figure 21.** Numerical comparison of sound power level for annular plate having parabolic increasing decreasing thickness variations (Case III) for different combinations of concentric stiffener patches for taper ratio Tx = 0.75.

Figures 22 and 23 compare the analytical and numerical comparison of radiation efficiency (σmn) for a Case I plate with 4 rectangular stiffener patches and 4 concentric stiffener patches respectively having parabolically decreasing thickness variation for taper ratio Tx = 0.75. A good agreement of results is seen in the comparison of radiation efficiency as reported in Figures 22 and 23. Figures 24 and 25 show the variation of radiation efficiency with different taper ratios Tx for different combination of rectangular and concentric stiffener patches for a Case I plate with parabolically decreasing thickness variation. It is seen that for all combinations of rectangular and concentric stiffener patches, the effect of radiation efficiency due to different taper ratios is independent of exciting frequency up to 1000 HZ, but at a given forcing frequency a higher taper ratio causes higher radiation efficiency beyond 1000 HZ. However, sound power level peaks do shift towards a lower frequency as taper ratio increases. For higher frequency beyond 2000 HZ, different taper ratios alter its stiffness at higher forcing frequency and the radiation efficiency curves tend to intersect each other at this high forcing region. It is interesting to note that the radiation curve tends to unity in the frequency band 6800–7200 HZ and a clear peak is seen at this frequency band for all combination of rectangular and concentric stiffener patches. Furthermore, from Figures 24 and 25 it is seen that the radiation efficiency increases with the taper ratio for all combinations of rectangular and concentric stiffener patches. Out of these combinations, the Case II plate with 2 rectangular stiffener patches and 2 concentric stiffener patches delivers the highest radiation efficiency whereas Case I plate with both 1 rectangular and 1 concentric stiffener patch and 4 rectangular and 4 concentric stiffener patches is seen to be a moderate radiator as depicted in Table 7. However, at higher forcing frequency, it is seen

that both a plate with 4 rectangular and 4 concentric stiffener patches with all cases of thickness variation (Cases I, II and III) shows the least radiation efficiency for all combinations of rectangular stiffener patches, as evident from Table 7.Therefore, it is interesting to mention that a Case III plate shows the lowest radiation efficiency (σmn) for all cases of thickness variations and is a poor radiation emitter among all the thickness variations with different combinations of rectangular and concentric stiffener patches. Figure 26 shows the numerical comparison of radiation efficiency for a plate with both 4 rectangular and 4 concentric stiffener patches for taper ratio Tx = 0.75. It is found that the plate shows almost the same radiation efficiency as depicted from Figure 26. Figure 27 shows the numerical comparison of the sound power level for a plate with 4 rectangular and 4 concentric stiffener patches for taper ratio Tx = 0.75. It is observed that both the plates show almost the same peak for taper ratio Tx = 0.75. Hence, the effect of stiffness variation along with the modes has negligible effect for both the combinations.

**Figure 22.** Comparison of radiation efficiency (σmn) analytically and numerically for annular plate attached with 4 rectangular stiffener patches and having parabolic decreasing thickness variations (Case I) for taper ratio Tx = 0.75.

**Figure 23.** Comparison of radiation efficiency (σmn) analytically and numerically for annular plate attached with 4 concentric stiffener patches and having parabolic decreasing thickness variations (Case I) for taper ratio Tx = 0.75.

**Figure 24.** Numerical comparison of radiation efficiency (σmn) for annular plate having parabolically decreasing thickness variations (Case I) with different attachment of (a) 1 rectangular stiffener patch (b) 2 rectangular stiffener patches (c) 4 rectangular stiffener patches for taper ratio Tx =0.75.

**Figure 25.** Numerical comparison of radiation efficiency (σmn) for annular plate having parabolically decreasing thickness variations (Case I) with different attachment of (a) 1 concentric patch (b) 2 concentric patches and (c) 4 concentric patches for taper ratio Tx = 0.75.

**Figure 26.** Numerical comparison of radiation efficiency (σmn) for annular plate attached with 4 rectangular and 4 concentric stiffener patches and having parabolic decreasing thickness variations (Case I) for taper ratio Tx = 0.75.

**Figure 27.** Numerical comparison of sound power level (dB) for annular plate attached with 4 rectangular and 4 concentric stiffener patches having parabolic decreasing thickness variations (Case I) for taper ratio Tx = 0.75.

#### *3.4. Peak Sound Power Level Variation with Different Taper Ratios for All Combinations of Rectangular and Concentric Stiffener Patches Attached to a Plate*

Peak sound power level for a plate was estimated for annular plates with different attachment of rectangular and concentric stiffener patches. The peak sound was considered for plates with different parabolically varying thickness. The different taper ratios were taken as reported in Figures 28 and 29, respectively. Furthermore, peak sound power level for different attachments of rectangular and

concentric stiffener patches attached to a plate is reported at the first peak which corresponds to (0, 0) mode of the plate. From Figure 28 it is seen that for a Case I plate with 1 rectangular stiffener patch combination, peak sound power level increases for increasing value of taper ratio whereas for 2 rectangular stiffener patches and 4 rectangular stiffener patches combinations, variations of peak sound power levels are observed for an increasing value of taper ratio. For a Case I plate, the maximum peak sound power level is obtained for taper ratio Tx = 0.75 for plate with 1 rectangular stiffener patch combination. Furthermore, it is seen that peak is minimum for taper ratio, Tx = 0.25 and maximum for taper ratio, Tx = 0.50 for plate with 4 rectangular stiffener patches combination, whereas for a plate with 2 rectangular stiffener patches combination, the peak is at a minimum for taper ratio Tx = 0.25 and maximum for taper ratio Tx = 0.75. Similarly, for a Case II plate, the maximum peak sound power level is obtained for taper ratio Tx = 0.75 and minimum peak is seen for taper ratio Tx = 0.50 for a plate with 2 rectangular stiffener patches combination. Also, for a Case II plate, it is interesting to note that peak sound power level increases for increasing value of taper ratio for plate with 1 rectangular stiffener patch combination and the peak is seen to be maximum for a taper ratio Tx = 0.75. However, for a plate with 4 rectangular stiffeners patches combination, the peak is seen to be at a minimum for taper ratio Tx = 0.75 and maximum for taper ratio Tx = 0.50. Furthermore, it is investigated that for case III plate, peak sound power level decreases for increasing value of taper ratio for all combinations of rectangular stiffener patches attached to a plate. For a Case III plate, the maximum peak of the sound power level is obtained for taper ratio Tx = 0.25 for a plate with 2 rectangular stiffener patches combination and minimum peak is observed for taper ratio Tx = 0.75 for a plate with 4 rectangular stiffener patches combination. From Figure 29, it is observed that for a Case I plate peak is maximum for taper ratio, Tx = 0.75 for plate with 1 concentric stiffener patches and minimum for taper ratio Tx = 0.50 for 2 concentric stiffener patches. For a Case II plate the highest peak is seen for taper ratio Tx = 0.75 for 2 concentric stiffener patches. Similarly, for case III plate lowest peak is observed for taper ratio, Tx = 0.75 for plate with 4 concentric stiffener patches. However, from Figures 28 and 29, it is necessary to mention that for a Case II unloaded tapered plate the highest peak sound power level is seen for taper ratio Tx = 0.75. Furthermore, it is also observed that the peak sound power level increases for case I plate for taper ratio Tx = 0.75 and peak sound power level decreases for a Case III plate for taper ratio, Tx = 0.75.

It is thus quite obvious that different combinations of rectangular and concentric stiffener patches have a significant impact on peak sound power level corresponding to the (0, 0) mode. Furthermore, different combinations of rectangular and concentric stiffener patches with different taper ratios provide us design options for peak sound power level. For example, for peak sound power reduction, taper ratio Tx = 0.75 with 4 rectangular stiffener patches and 4 concentric stiffener patches combination, as well as taper ratio Tx = 0.50 with 2 rectangular stiffener patches combination, for a Case III plate may be the options. Similarly, for sound power actuation, taper ratio Tx = 0.75 with 1 rectangular stiffener patch and 1 concentric stiffener patch combination for a Case I plate and 2 rectangular stiffener patches and 4 concentric stiffener patches combination for a Case II plate may be the alternative solution. However, for an unloaded tapered plate, it can be added that for taper ratio Tx = 0.75 for a Case III plate may be considered as a poor sound emitter and a taper ratio Tx = 0.75 for a Case I and Case II plate may be considered as the highest sound emitter.

*Appl. Sci.* **2018**, *8*, 2542

**Figure 28.** Comparison of peak sound power level (dB) for (**a**) Case I, (**b**) Case II, and (**c**) Case III plates having different parabolic thickness variations with different attachments of rectangular stiffener patches.

**Figure 29.** Comparison of peak sound power level (dB) for (**a**) Case I, (**b**) Case II, and (**c**) Case III plates having different parabolic thickness variations with different combinations of concentric stiffener patches.

#### **4. Conclusions**

A comparison is made of vibroacoustic behavior of tapered annular circular plates having different parabolically varying thickness with different combinations of rectangular and concentric stiffeners patches keeping the mass of the plate plus patch constant for a clamped-free boundary condition. It is observed that due to lower stiffness associated with the Case II plate, the non-dimensional frequency parameter of a Case II plate reduces more in comparison to a Case I plate. However, for a Case III plate, the non-dimensional frequency parameter is same as that of unloaded plate. In response to acoustic behavior, it is observed that different combinations of rectangular and concentric stiffener patches and modes variation have significant impacts on the sound power level in comparison to the stiffness variation due to the taper ratio. It is observed that for a sound power level up to 50 dB, and for a plate with different parabolically varying thickness, we get all taper ratios, Tx = 0.00, 0.25, 0.50 and 0.75, with a broad range of frequencies as design options in different frequency bands for different combinations of rectangular and concentric stiffener patches. It is further shown that a plate

with 4 rectangular and 4 concentric stiffener patches combination shows the minimum sound power level for all cases of thickness variation, whereas the highest power is obtained for a Case II plate with 2 rectangular and concentric stiffener patches combination. It is interesting to note that a Case III plate has the lowest sound power level among all variations and is seen to be the lowest sound radiator. Further different combinations of rectangular and concentric stiffener patches with different taper ratios provide us design options for peak sound power level. For example, for peak sound power reduction, taper ratios, Tx = 0.75 with a 4 rectangular stiffener patches 4 concentric stiffener patches combination, and taper ratio Tx = 0.50 with a 2 rectangular stiffener patches combination for a Case III plate may be the options. Similarly, for sound power actuation, a taper ratio Tx = 0.75 with 1 rectangular stiffener patch and 1 concentric stiffener patch combination for a Case I plate and 2 rectangular stiffener patches and 4 concentric stiffener patches combination for a Case II plate may be an alternative solution. Furthermore, for unloaded tapered plates, it can be added that taper ratio Tx = 0.75 for a Case III plate may be considered as a poor sound emitter and a taper ratio Tx = 0.75 for Case I and Case II plates may be considered as the highest sound emitter.

**Author Contributions:** V.R. and M.R. supervised the research. A.C. and M.S.A. developed the research concept, developed the theory and performed the analysis. M.S.A. collected the data. A.C. wrote the paper. V.R. and M.R. revised the manuscript and made important suggestions technically and grammatically. A.C. provided the APC funding.

**Funding:** The work was carried out in the Indian Institute of Technology (ISM) Dhanbad, India. The APC will be funded by the corresponding author only.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Time-Domain Hydro-Elastic Analysis of a SFT (Submerged Floating Tunnel) with Mooring Lines under Extreme Wave and Seismic Excitations**

#### **Chungkuk Jin \* and Moo-Hyun Kim**

Department of Ocean Engineering, Texas A&M University, Haynes Engineering Building, 727 Ross Street, College Station, TX 77843, USA; m-kim3@tamu.edu

**\*** Correspondence: kenjin0519@gmail.com; Tel.: +1-979-204-3454

Received: 19 October 2018; Accepted: 22 November 2018; Published: 26 November 2018

**Abstract:** Global dynamic analysis of a 700-m-long SFT section considered in the South Sea of Korea is carried out for survival random wave and seismic excitations. To solve the tunnel-mooring coupled hydro-elastic responses, in-house time-domain-simulation computer program is developed. The hydro-elastic equation of motion for the tunnel and mooring is based on rod-theory-based finite element formulation with Galerkin method with fully coupled full matrix. The dummy-connection-mass method is devised to conveniently connect objects and mooring lines with linear and rotational springs. Hydrodynamic forces on a submerged floating tunnel (SFT) are evaluated by the modified Morison equation for a moving object so that the hydrodynamic forces by wave or seismic excitations can be computed at its instantaneous positions at every time step. In the case of seabed earthquake, both the dynamic effect transferred through mooring lines and the seawater-fluctuation-induced seaquake effect are considered. For validation purposes, the hydro-elastic analysis results by the developed numerical simulation code is compared with those by a commercial program, OrcaFlex, which shows excellent agreement between them. For the given design condition, extreme storm waves cause higher hydro-elastic responses and mooring tensions than those of the severe seismic case.

**Keywords:** submerged floating tunnel (SFT); mooring line; coupled dynamics; hydro-elastic responses; wet natural frequencies; mooring tension; seismic excitation; wave excitation; seaquake

#### **1. Introduction**

The submerged floating tunnel (SFT) is an innovative solution used to cross deep waterways [1,2]. The SFT consists mainly of a tunnel for vehicle transportation and mooring lines for station-keeping. The tunnel is usually positioned at a certain submergence depth, typically greater than 20 m, with positive net buoyancy that is balanced by mooring lines anchored in the seabed [3,4].

Considering that wave/current/wind effects are greatly reduced, the cost is almost constant along the length [5], and ship passage is not obstructed by the structure, the SFT has been regarded as a competitive alternative to floating bridges and immersed tunnels. In this regard, since Norway's first patent in 1923 [6], many proposals and case studies have been published worldwide, which includes Høgsfjord/Bjørnafjord in Norway [7–9], the Strait of Messina in Italy [10], Funka Bay in Japan [11,12], Qiandao Lake in China [13,14], and the Mokpo-Jeju SFT in Korea [15]. Even though there is no actually installed structure in the world despite extensive research [16], the first construction of the SFT is being considered by Norwegian Public Road Administration (NPRA) with global interest [17].

To provide sufficient confidence for the concept, feasibility studies under diverse catastrophic environmental conditions, such as extreme waves and earthquakes, must be extensively studied in advance. Along this line, numerous researches have been carried out to verify structural safety in wave and seismic excitations on the SFT. Regarding wave-excitation effects, Kunisu et al. [18] evaluated the effect of mooring-line configurations on SFT dynamic responses including possible snap loading. Lu et al. [11] and Hong et al. [19] focused on slack mooring phenomena at various buoyancy-weight ratios (BWRs) of the SFT and inclination angles of mooring lines. Long et al. [3] conducted parametric studies to investigate the effects of the BWR and mooring-line stiffness. Dynamic motions at varying BWRs and the corresponding comfort index were investigated by Long et al. [20]. Seo et al. [21] compared experimental results with simplified numerical approach for a segment of the SFT. Chen et al. [22] evaluated the influence of VIV (vortex induced vibration) of mooring lines on the SFT dynamic responses using a simplified numerical model. In addition, with regard to seismic-excitation effects, Di Pilato et al. [4] carried out a coupled dynamic analysis to investigate the effect of wave and seismic excitations. Martinelli et al. [13] suggested detailed procedures to generate artificial seismic excitations and performed the corresponding structural analysis. Dynamic responses at various shore connections under transverse earthquake were investigated by Xiao and Huang [23]. Martinelli et al. [24] and Wu et al. [25] focused on hydrodynamic fluid-structure interaction induced by vertical fluid fluctuations known as the seaquake. Mirzapour et al. [26] derived simplified analytical solutions for 2D and 3D cases and computed SFT dynamic responses in diverse stiffness conditions. Muhammad et al. [6] compared the dynamic effects induced by wave and seismic excitations.

During the past decade, various SFT-related studies have been carried out in the second author's research lab. Cifuentes et al. [27] compared the dynamics of a moored-SFT segment in regular waves for various BWRs and mooring types between experimental results and numerical simulations. For the numerical simulations, both commercial program (OrcaFlex) and in-house program CHARM3D (Coupled Hull And Riser Mooring 3D) were used for cross-checking. Lee et al. [16] further investigated the dynamics of the short tunnel segment under irregular waves and random seabed earthquakes. Then, the initial studies of hydro-elastic responses of a long SFT with many mooring lines by random waves and seabed earthquakes were conducted by Jin and Kim [28] and Jin et al. [29] by using commercial software, OrcaFlex. However, when using OrcaFlex for seismic excitations, an indirect modeling with many seabed dummy masses has to be introduced instead of direct inputs of dynamic boundary conditions at those anchor points.

In this research, to add the capability of hydro-elastic analyses of a long SFT with many mooring lines in the in-house coupled dynamic-analysis program, a new approach called 'dummy-connection-mass method' is developed. The equation of motion for the line element is derived by rod theory, and finite element modelling is implemented by using Galerkin method. Linear and rotational springs are employed to conveniently connect several objects with given connection conditions. The Adams–Moulton implicit integration method combined with the Adam–Bashforth explicit scheme, is used for the time-domain-integration method so that stable and time-efficient numerical integration can be done without iteration. The newly developed program is applied to calculate the hydro-elastic responses of a 700-m-long SFT (with both ends fixed) with many mooring lines by extreme random waves or severe random earthquakes. The results from the newly developed program are cross-checked against those from OrcaFlex program. In the case of seabed earthquake, the seabed motions are transferred to the SFT through mooring lines and through seawater fluctuations called seaquake, which is extensively discussed in Section 4 based on the produced numerical results. In the present study, the effect of seismic-induced acoustic pressure is not considered since the resulting frequency range is much higher [30], and thus it is of little importance for the mooring design.

#### **2. Configuration of the System**

Figure 1 shows 2D and 3D views of the entire structure, and Table 1 summarizes major design parameters of the tunnel and mooring lines. The tunnel, which has a diameter of 23 m and a length of 700 m, is made of high-density concrete. Since the structure in this study is a section of the 30-km-long SFT, the fixed-fixed boundary condition at both ends are applied assuming that strong fixtures (or towers) will be built at 700-m intervals, as shown in the Figure 1. Considering that the water depth of the planned site is 100 m, the submergence depth, a vertical distance between free surface and the tunnel centerline, is set to be 61.5 m. The BWR is fixed at 1.3, and the tunnel thickness is 2.3 m. The tunnel thickness is actually greater than the real value to have the equivalent tunnel bending stiffness including inner compartment structures. The axial and bending stiffnesses are calculated based on the given data of Table 1.

Chain mooring lines with a nominal diameter of 180 mm are used. High static and dynamic mooring tensions are expected based on the given BWR and wave condition [28]. In addition, the maximum mooring tension should be smaller than the MBL (minimum breaking load) divided by safety factor (SF). Thus, chain might be the best choice considering high MBL of 30,689 kN for Grade R5. As shown in Figure 1, four 60-degree-inclined mooring lines are installed for every 25 m interval toward the center locations. The lengths of mooring lines are 51.1 m for line #1 and #2 and 37.8 m for line #3 and #4. The wet natural frequencies of the tunnel hydro-elastic responses coupled with mooring lines are calculated and presented in Table 2.

**Figure 1.** 2D and 3D views of the entire structure.




**Table 2.** Wet natural frequencies of the tunnel hydro-elastic responses coupled with mooring lines.

#### **3. Numerical Model**

Tunnel-mooring coupled dynamic analysis was conducted by using the in-house program, CHARM3D. This in-house code has been developed by second author's research lab for the coupled dynamic simulations of complex offshore structures with mooring lines and risers during the past two decades [34,35]. In addition, the capability has further been expanded for various applications including multiple bodies connected by lines, wind turbines [36], dynamic positioning [37], and ice-structure interactions [38]. The program is further extended in this paper to study the SFT hydro-elastic dynamics for seismic excitations. In addition, some of the computed results are compared with those by widely-used commercial program OrcaFlex for cross-checking. In the following equations, bold variables represent vectors or matrices.

#### *3.1. Governing Equations of Dynamic Simulation*

The entire structure is modelled by rod elements and the rod theory suggested by Garrett [39] is used. The behavior of the rod element is determined by the position of the rod centerline. The equation of motion is solved in general coordinate whose tangential direction follows the line profile; therefore, coordinate transformations, which increase computation time, are not required. In addition, geometric non-linearity is considered without specific assumptions associated with the shape or orientation of lines [34]. The equation of motion and the extensible condition are presented in Equations (1) and (2).

$$-\left(EI\mathbf{\tilde{r}}\right)'' + \left(\lambda\mathbf{\dot{r}}\right)' + \mathbf{q} = m\mathbf{\dot{r}}\tag{1}$$

$$\frac{1}{2}(\mathbf{r'} \cdot \mathbf{r'} - 1) = \frac{T}{A\_I E} \approx \frac{\lambda}{A\_I E} \tag{2}$$

where **r**(*s*, *t*) is a position vector, which is a function of arc length *s* and time *t* in order to define space curve, *E* is Young's modulus, *I* is second moment of sectional area, *λ* is Lagrange multiplier, **q** and *m* are the distributed load and mass per unit length, *T* is the tension, and *AI* is the cross sectional area filled with the material. In addition, dot and apostrophe denote time and spatial derivatives, respectively. The distributed load includes the weight of the rod and hydrostatic and hydrodynamic loads induced by the surrounding fluid. The hydrostatic load is subdivided into buoyancy and force induced by hydrostatic pressure. The hydrodynamic force is estimated by Morison equation for moving objects, which consists of linear wave inertia and nonlinear wave drag forces. Thus, Morison equation, which is given by Equation (3), enables to compute wave force per unit length at instantaneous rod-element positions at each time step.

$$\mathbf{F\_{d}} = -\mathbf{C}\_{A}\rho A\_{E}\mathbf{r}^{\mathrm{-n}} + \mathbf{C}\_{M}\rho A\_{E}\dot{\mathbf{V}}^{\mathrm{n}} + \frac{1}{2}\mathbf{C}\_{D}\rho D \left| \mathbf{V}^{n} - \dot{\mathbf{r}}^{n} \right| \left( \mathbf{V}^{n} - \dot{\mathbf{r}}^{n} \right) \tag{3}$$

where *CM*, *CA*, and *CD* are the inertia, added mass, and drag coefficients, *ρ* is density of water, and *AE* is the cross-sectional area for the element, *<sup>D</sup>* is the outer diameter, and **<sup>V</sup>***<sup>n</sup>* and · **V** *n* represent

velocity and acceleration of a fluid particle normal to the rod centerline. The inertia coefficient of the tunnel and mooring lines is 2.0 considering that the added mass is the same as displaced mass [40]. The drag coefficient of the tunnel is a function of Reynolds number, KC (Keulegan-Carpenter) number, and relative surface roughness, and the representative value of 0.55 is used here based on the experimental results (e.g., [31]). The drag coefficient of mooring lines is 2.4 for stud-less chain [32]. It was shown in Cifuentes et al. [27] that the use of Morison equation for SFT dynamics is good enough compared to the case by using 3D diffraction/radiation panel program. Here, the Morison equation is further modified to include hydrodynamic force induced by vertical pressure variations during earthquake excitations i.e., the seaquake effect, as supported by Islam and Ahrnad [41], Martinelli et al. [24], Mousavi et al. [42], and Wu et al. [25]. In the equation, inertia and drag force terms are modified by introducing seismic velocity **v***<sup>n</sup> <sup>g</sup>* and acceleration · **v** *n <sup>g</sup>* as shown in Equation (4). The vertical component of seismic velocity and acceleration is considered only for the seaquake simulations.

$$\mathbf{F\_{d}} = -\mathbf{C\_{A}}\rho A\_{E}\overset{\text{\textquotedblleft}}{\mathbf{r}} + \mathbf{C\_{M}}\rho A\_{E}(\overset{\text{\textquotedblleft}}{\mathbf{V}} + \overset{\text{\textquotedblleft}}{\mathbf{v}}) + \frac{1}{2}\mathbf{C\_{D}}\rho D\Big|\mathbf{V^{n}} + \mathbf{v\_{g}^{n}} - \overset{\text{\textquotedblleft}}{\mathbf{r}}\Big|(\mathbf{V^{n}} + \mathbf{v\_{g}^{n}} - \overset{\text{\textquotedblleft}}{\mathbf{r}})\tag{4}$$

Therefore, the final form of the equation of motion is given by Equations (5)–(9):

$$m\ddot{\mathbf{r}} + \mathbf{C}\_{A}\rho A\_{E}\mathbf{r}^{\prime\prime\prime} + \left(EI\mathbf{r}^{\prime}\right)^{\prime} - \left(\tilde{\lambda}\mathbf{r}^{\prime}\right)^{\prime} = \tilde{\mathbf{w}} + \tilde{\mathbf{F}}\_{d} \tag{5}$$

$$\tilde{\mathbf{F}}\_d = \mathbb{C}\_M \rho A\_E \left( \dot{\mathbf{V}}^n + \dot{\mathbf{v}}\_{\mathcal{S}}^n \right) + \frac{1}{2} \mathbb{C}\_D \rho D \left| \mathbf{V}^n + \mathbf{v}\_{\mathcal{S}}^n - \dot{\mathbf{r}}^n \right| \left( \mathbf{V}^n + \mathbf{v}\_{\mathcal{S}}^n - \dot{\mathbf{r}}^n \right) \tag{6}$$

$$
\widetilde{\lambda} = \widetilde{T} - EI\kappa^2 \tag{7}
$$

$$
\tilde{\mathbf{w}} = \mathbf{w} + \mathbf{B} \tag{8}
$$

$$
\widetilde{T} = T + P \tag{9}
$$

where *<sup>κ</sup>* is local curvature, **<sup>~</sup> w** is wet weight of the rod per unit length, which is comprised of weight **w** and buoyancy **B**, *T*6 is effective tension in the rod, and *P* is the hydrostatic pressure, which is a scalar, at the position **r** on the rod. Therefore, Equation (5) combined with the stretching condition given in Equation (2) are the governing equations for dynamic simulations.

The governing equations are further formulated by Galerkin finite element method [39,43]. The position vector and Lagrange multiplier for a single element of the length *L* are expressed as follows:

$$\mathbf{r}(s,t) = \sum\_{m} A\_{\mathcal{W}}(s)\mathbf{U}\_{\mathcal{W}}(t) \tag{10}$$

$$
\lambda(\mathbf{s}, t) = \sum\_{n} P\_n(\mathbf{s}) \lambda\_n(t) \tag{11}
$$

where *Am* and *Pn* are shape functions defined on the interval 0 ≤ *s* ≤ *L*. The weak form of the governing equation is generated by using the Galerkin method and integration by part:

$$\int\_{0}^{L} \left[ A\_{\text{m}} (m\bar{\mathbf{r}} + \mathbf{C}\_{A} \rho A\_{E} \mathbf{r}') + EI A'' \, \_{m} \tilde{\mathbf{r}}' + \bar{\lambda} A' \, \_{m} r' - A\_{m} (\bar{\mathbf{w}} + \bar{\mathbf{F}}^{d}) \right] ds = EI \tilde{\mathbf{r}} \, A'\_{\text{m}} \Big|\_{0}^{L} + \left[ \bar{\lambda} \mathbf{r}' - (EI \tilde{\mathbf{r}}')^{\prime} \right] A\_{m} |\_{0}^{L} \tag{12}$$

$$\int\_{0}^{L} P\_{\text{fl}} \left\{ \frac{1}{2} (\mathbf{r}' \cdot \mathbf{r}' - 1) - \frac{\lambda}{A\_{I}E} \right\} ds = 0 \tag{13}$$

where first and second terms of the right-hand side in Equation (12) are related to moment and force at the boundary. Cubic and quadratic shape functions, which are continuous on the element, are defined for the position vector and Lagrange multiplier, respectively:

$$\begin{array}{ll} A\_1 = 1 - 3\xi^2 + 2\xi^3, & A\_2 = L(\xi - 2\xi^2 + \xi^3), \\ A\_3 = 3\xi^2 - 2\xi^3, & A\_4 = L(-\xi^2 + \xi^3), \\ P\_1 = 1 - 3\xi + 2\xi^2, & P\_2 = 4\xi(1 - \xi^3), \\ P\_3 = \xi(2\xi - 1) \end{array} \tag{14}$$

where *ξ* = *s*/*L*. The position vector, tangent of the position vector, and Lagrange multiplier are chosen to be continuous at the node between the neighboring elements. Therefore, the parameters **U***<sup>m</sup>* and *λ<sup>n</sup>* can be written as:

$$\begin{array}{ll} \mathbf{U}\_{1} = \mathbf{r}(0,t), & \mathbf{U}\_{2} = \mathbf{r}'(0,t),\\ \mathbf{U}\_{3} = \mathbf{r}(L,t), & \mathbf{U}\_{4} = \mathbf{r}'(L,t),\\ \lambda\_{1} = \lambda(0,t), & \lambda\_{2} = \lambda(L/2,t), \quad \lambda\_{3} = \lambda(L,t) \end{array} \tag{15}$$

The position and its tangent vectors are obtained at both ends of the element while the Lagrange multiplier are computed at both ends and the middle point of the element. The final finite element formulation of the governing equation for the 3 dimensional problem are presented in Equation (16).

$$(M\_{\rm ijlk} + M\_{\rm ijlk}^a)\bar{\mathbf{U}}\_{\rm jk} + (K\_{\rm ijlk}^1 + \lambda\_n K\_{\rm nijlk}^2)\mathbf{U}\_{\rm jk} = \mathbf{F}\_{\rm il} \tag{16}$$

For **U***jk*, subscript *j* is dimension, which is 1–3 for the 3 dimensional problem, and subscript *k* is for 1–4 given in Equation (15). In Equations (17)–(21), the general mass, the added mass, the general stiffness from the bending stiffness and rod tension, and external force matrices are defined with Kronecker Delta function *δij*:

$$M\_{ijlk} = \int\_0^L m A\_l A\_k \delta\_{ij} ds \tag{17}$$

$$\mathbf{M}\_{ijlk}^{\mathbf{d}} = \mathbb{C}\_{A} \rho A\_{E} \left[ \int\_{0}^{L} A\_{I} A\_{k} \delta\_{ij} \mathrm{ds} - \left( \int\_{0}^{L} A\_{I} A\_{k} A'\_{s} A'\_{t} ds \right) \mathbf{U}\_{it} \mathbf{U}\_{js} \right] \tag{18}$$

$$K\_{ijlk}^1 = \int\_0^L EI A'' \, \_l A'' \, \_k \delta\_{ij} ds \tag{19}$$

$$K\_{\rm injlk}^2 = \int\_0^L P\_n A'\_I A'\_k \delta\_{ij} ds\tag{20}$$

$$\mathbf{F}\_{il} = \int\_{0}^{L} (\widetilde{\mathbf{w}}\_{i} + \widetilde{\mathbf{F}}\_{i}^{l}) A\_{l} ds \tag{21}$$

In addition, the stretching condition can be formulated as given in Equation (22):

$$G\_{m} = A\_{mil} \mathbf{U}\_{kl} \mathbf{U}\_{ki} - B\_{m} - \mathbf{C}\_{mt} \boldsymbol{\lambda}\_{l} \tag{22}$$

where

$$A\_{\rm wall} = \frac{1}{2} \int\_0^L P\_m A\_i \,' A\_I \,' ds \tag{23}$$

$$B\_m = \frac{1}{2} \int\_0^L P\_m ds\tag{24}$$

$$C\_{mt} = \frac{1}{A\_I E} \int\_0^L P\_m P\_t ds\tag{25}$$

A dummy 6 DOF rigid body, which is equipped with negligible properties, is introduced to conveniently connect the tunnel and mooring lines. The dummy mass means negligible mass (1 kg in proto type) of dummy rigid body used only for connection purpose. Therefore, force and moment are transferred by using both linear and rotational springs of very large stiffness from the tunnel and mooring lines through the rigid body. Force and moment transmitted from the mooring line to the rigid body are computed as follows [43]:

$$
\tilde{\mathbf{F}}\_P = \tilde{\mathcal{K}}(\tilde{T}\_P^\epsilon \tilde{\mathbf{u}}\_P - \tilde{\mathbf{u}}\_I) + \tilde{\mathcal{C}}(\tilde{T}\_P^\epsilon \dot{\tilde{\mathbf{u}}}\_P - \dot{\tilde{\mathbf{u}}}\_I) \tag{26}
$$

where *K*6 and *C*6 represent coupling stiffness and damping matrices, *T*6*<sup>P</sup>* denotes a transformation matrix between the rigid body origin and the connection location, **<sup>~</sup> <sup>u</sup>***<sup>P</sup>* and **<sup>~</sup> u***<sup>I</sup>* are the displacements of the rigid body and the connecting location. Infinite stiffness values are used in the coupling stiffness matrix to tightly connect lines, and damping matrix is not utilized in the simulations. Therefore, the entire stiffness matrix that couples tunnel elements with mooring lines is created as shown in Figure 2.

**Figure 2.** Stiffness matrix for the simulated SFT (line #1~#16 are for tunnel and line #17~#N are for mooring lines, n(1) means number of sub-elements of line #1, k is the number of the 6 DOF rigid body).

Newton's iteration method is used in static analysis of the SFT. The Adams–Moulton implicit integration method, which has 2nd-degree of accuracy, is used for the time-domain-integration method. Since instantaneous velocity and acceleration are required to calculate hydrodynamic force from Morison equation, the Adam–Bashforth explicit scheme is combined with the Adams–Moulton implicit scheme to avoid iteration.

#### *3.2. Theory of OrcaFlex*

A similar approach is used to model the whole structure in OrcaFlex, a well-known commercial program. The tunnel and mooring lines are modelled by line elements, and the line-element theory is based on the lumped mass method. The line element consists of a series of nodes and segments. Force properties are lumped in the node, which includes weight, buoyancy, and drag etc. Stiffness components i.e., axial, bending, and torsional stiffness, are represented by massless springs [44]. The equation of motion is expressed in Equation (27).

$$\mathbf{M}(\mathbf{p}, \mathbf{a}) + \mathbf{C}(\mathbf{p}, \mathbf{v}) + \mathbf{K}(\mathbf{p}) = \mathbf{F}(\mathbf{p}, \mathbf{v}, t) \tag{27}$$

where **M**(**p**, **a**), **C**(**p**, **v**), and **K**(**p**) are mass, damping, and stiffness matrices. **F**(**p**, **v**, *t*) is external force vector, which is hydrodynamic force in this case. Symbols, **p**, **v**, **a**, and *t* denote position, velocity, acceleration vectors, and time, respectively. Hydrodynamic force is also computed by the same Morison equation for a moving object with consideration for relative velocity and acceleration. The advantage

of the developed program compared to OrcaFlex for the present application can be summarized as follows: (i) In OrcaFlex, hydrodynamic force generated from the seaquake effects is not included. (ii) In CHARM3D, higher-order rod FE elements are used compared to lumped-mass-based OrcaFlex. (iii) The seabed movements can be directly imputed in the developed program.

#### *3.3. Environmental Conditions*

Simultaneous random-wave and seismic excitations are considered for global performance analysis. The same wave and seismic time histories are inputted in both programs for cross-checking. JONSWAP wave spectrum is used to generate time histories of random waves. Significant wave height and peak period for the 100-year-storm condition are 11.7 m and 13.0 s. Enhancement parameter is 2.14 that is the average value in Korea [45]. Random waves are generated by superposing 100 component waves with randomly perturbed frequency intervals to avoid signal repetition. The lowest and highest cut-off frequencies of input spectrum is 0.3 rad/s and 2.3 rad/s, respectively. The wave direction is perpendicular to a longitudinal direction of the tunnel. A 3-h simulation is carried out to analyze the statistics of dynamic behaviors and mooring tensions under the storm condition. Figure 3 shows theoretical JONSWAP wave spectrum and the reproduced spectrum from the time histories of wave elevation. It also shows the time histories of wave elevation produced by the JONSWAP wave spectrum.

Regular (sinusoidal) and recorded irregular seismic excitations data are also employed. The amplitude of regular seismic motion in the vertical direction is 0.01 m at diverse frequencies from 0.781 rad/s to 7.805 rad/s. Figure 4 shows the time histories of seismic displacements and corresponding spectra for recorded irregular seismic excitations in three directions, which are obtained by USGS [46]. The earthquake occurred in 78 km WNW of Ferndale, California, USA in 2014, and the magnitude of this earthquake is 6.8 in Richter scale. Seismic displacements in three directions are inputted for each anchor point of mooring lines and two ends of the tunnel fixture at every time step. Hydrodynamic force from the seaquake effect is also computed for the tunnel and mooring lines.

**Figure 3.** Wave time histories produced by JONSWAP wave spectrum (**a**) and theoretical JONSWAP wave spectrum and reproduced spectrum from wave time histories using FFT (fast Fourier transform) for validation (**b**).

**Figure 4.** Time histories of real seismic excitations in longitudinal = x (**a**), transverse = y (**b**), and vertical =z(**c**) directions and its corresponding spectra.

#### **4. Results and Discussions**

#### *4.1. Static Analysis*

The developed code is first cross-checked with OrcaFlex in the static condition before dynamic simulations. Because static displacements of the tunnel are only affected by weight, buoyancy, and stiffness components of tunnel and mooring lines, direct comparison can be made after initial modeling of the entire SFT system. Figure 5 shows the vertical displacements of tunnel and mooring tension in the static condition. The results produced by the developed program coincide well with OrcaFelx's results. The reference dashed line in the tension figure indicates the allowable tension (minimum break load divided by safety factor).

**Figure 5.** Submerged floating tunnel (SFT) vertical displacement (**a**) and mooring tension (**b**) in the static condition.

#### *4.2. Dynamic Behaviors under Extreme Wave Excitations*

Dynamic simulations under the 100-year-strom condition (Hs = 11.7 m and Tp = 13.0 s) are performed for three hours. As mentioned before, the same wave time histories are inputted to both programs to directly compare the dynamics results. Both computer programs produce almost identical results. Figure 6 shows the envelopes of maximum and minimum for SFT displacements and mooring tension. The maximum horizontal and vertical responses and mooring tension occur in the middle location. The horizontal responses are larger than the vertical responses since the 1st natural frequency of horizontal motion is closer to the input wave spectrum than that of vertical motion. Mooring-tension results show that shorter mooring lines (Line #3) have higher mooring tension than longer mooring lines (Line #1). The maximum mooring tension at the middle section is smaller than the MBL (minimum breaking load) divided by the SF (safety factor), which is presented in Figure 6b as a pink line. Recall that the MBL is 30,689 kN for Grade R5, which is obtained by DNV regulation [33]. The SF 1.67 is used as recommended by API RP 2SK [47]. Even if the extreme 100-year-storm condition is considered, the maximum mooring tension is still smaller than the allowable tension.

Figures 7–9 show the time histories and corresponding spectra of horizontal/vertical responses of the tunnel and mooring tension at the middle section. The spectra of responses indicate that wave-induced motions are dominant since the lowest natural frequencies in both directional motions (1.92 and 3.12 rad/s for horizontal and vertical directions) are away from the dominant input-wave spectral range of Figure 3. It means that there is negligible contribution from the structural elastic resonances. In case of mooring tension, under the given BWR = 1.3, snap-loadings characterized by extraordinary high peaks do not occur, as shown in the time series. However, it should be noted that the snap-loadings tend to occur at lower BWRs [28]. Obviously, smaller dynamic motions and mooring tensions can be obtained by further increasing submergence depth [28]. The relevant statistics obtained from the time series are summarized in Table 3.

**Figure 6.** Envelopes of the maximum and minimum displacements of the tunnel (**a**) and mooring tension (**b**) in the 100-year-strom condition.

**Figure 7.** Time histories (**a**) and spectrum (**b**) of horizontal displacement of the tunnel in the middle location under the 100-year-strom waves.

**Figure 8.** Time histories (**a**) and spectrum (**b**) of vertical displacement of the tunnel in the middle location under the 100-year-strom waves.

**Figure 9.** Time histories (**a**) and spectrum (**b**) of mooring tension (#3) in the middle location under the 100-year-strom waves.

**Table 3.** Statistics of the SFT motions and mooring tensions at the middle location under 100-yr irregular wave excitations (from the time series of Figures 7–9).


#### *4.3. Dynamic Behaviors under Severe Seismic Excitations*

Regular and irregular seismic excitations are utilized for SFT dynamic analysis. Since the fixed–fixed boundary condition is applied at both ends of the tunnel, both ends as well as all anchoring points are assumed to move together with seismic motions. As a result, seismic time histories are inputted to every anchor location of mooring lines and both ends of the tunnel. The hydrodynamic forces generated by sea-water fluctuations under vertical seismic motions are computed by using modified Morison equation (e.g., Islam and Ahrnad [41], Martinelli et al. [24], Mousavi et al. [42], and Wu et al. [25]). The effect is well known and called seaquake. As a result, there are two mechanisms causing SFT dynamics under seabed seismic motions. First, the seismic motions are transferred through mooring lines. Second, sea-water fluctuations in the vertical direction. In this paper, the former will be called earthquake effect and the latter will be called seaquake effect. To investigate the seaquake effect, regular seismic cases only in the vertical direction are simulated and the resulting SFT dynamics are analyzed. Subsequently, strong real seismic displacements are applied to the SFT system to check the global performance and structural robustness.

Figure 10 shows tunnel's vertical motion amplitudes at the mid-section and the corresponding vertical responses of mooring line #1 at its center under regular (sinusoidal) seismic excitations. Vertical motions of tunnel are largely amplified at 3.12 rad/s and 4.89 rad/s, the 1st and 3rd natural frequencies. The amplified tunnel motions at those frequencies directly influence high mooring dynamics, as shown in Figure 10b. A small peak can also be observed at 5.78 rad/s, the lowest natural frequency of mooring lines #1 itself.

The hydrodynamic force by seaquake directly acts on the tunnel with earthquake frequencies. Whereas, the seismic excitations are delivered to the tunnel through mooring lines, as discussed earlier. Then, the resulting tunnel response also causes hydrodynamic force on the tunnel. Therefore, there exist phase effects between the two components. We can see that the tunnel dynamics are significantly reduced after including the seaquake effect when compared to the earthquake-only case. The reason can be found from Figure 11 by plotting the contribution of each constituent component separately. In the figure, the phase of the tunnel response induced by earthquake is opposite to that induced by seaquake at the tunnel's natural frequencies, 3.12 rad/s and 4.89 rad/s. Therefore, there is cancellation effect between the two components so that the total vertical response amplitude can be reduced compared to the earthquake-only case. On the other hand, when earthquake frequency is greater than 5.7 rad/s, the two components become in phase, so the tunnel vertical responses are increased compared to the earthquake-only case although the resulting increment is small. The seaquake effects are not generated by the horizontal seismic motions if the seabed is flat since the horizontal seabed motions do not influence seawater fluctuating motions.

**Figure 10.** Amplitudes of vertical displacements of the tunnel (**a**) and mooring line #1 (**b**) at the middle location under regular seismic excitations of various frequencies (Eq: earthquake only considered; Eq + Sq: both earthquake and seaquake considered).

**Figure 11.** Time histories of vertical displacements of the tunnel at the middle section by respective force components under regular seismic excitations of 3.12 rad/s (**a**), 4.89 rad/s (**b**), and 5.78 rad/s (**c**) (Eq: earthquake only considered; Eq + Sq: both earthquake and seaquake considered; Sq: seaquake only considered; time histories of seismic excitations are multiplied by 10 for better visualization).

Figures 12–14 show the time histories of horizontal/vertical responses of the tunnel and the corresponding mooring tensions at the tunnel's middle section under the real seismic excitations, as given in Figure 4. The case of earthquake effect only is compared with that of earthquake plus seaquake. Firstly, in the earthquake-only case, the tunnel responses are greater than the input seismic motions, horizontally about 3 times and vertically about twice larger. The horizontal responses are more amplified because its lowest natural frequency is closer to the dominant frequency range of seismic excitations than that of vertical response. The corresponding tunnel-response spectra show that they have the first small peak at the seismic frequency, the next highest peak at the lowest natural frequency, and the next small peak at the third-lowest natural frequency. Mooring tensions are mostly influenced by the SFT horizontal and vertical motions at their lowest natural frequencies, while there is virtually little contribution near seismic frequencies. The maximum tensions for this earthquake case are much smaller than those caused by extreme wave excitations, as previously considered. However, the earthquake-induced tunnel dynamics can be significantly more amplified when the

lowest natural frequencies of the tunnel's elastic responses are closer to dominant seismic frequencies. In the figure, the same dynamic simulation results by OrcaFlex are also given for cross-checking. The two independent computer programs produced almost identical results.

In the spectral plots of Figures 12–14, the spectra of tunnel responses and mooring tensions after adding seaquake effects are also given. In Figure 12, there is little change in the case of SFT horizontal motions since the seaquake mainly influences only the vertical responses, as was pointed out earlier. In Figure 13, there is a big reduction in the vertical-response spectrum at its lowest natural frequency (3.12 rad/s) after including the seaquake effect. It is due to the phase-cancellation effects, as discussed in the previous regular-earthquake case of Figure 11a,b. So, this reduction effect directly reflects the reduction in tension i.e., in Figure 14, the tension spectral amplitude is greatly reduced near 3.12 rad/s but remains the same at the lowest natural frequency of horizontal response, 1.92 rad/s. This trend can also be seen in the corresponding time-series comparisons (Figure 15) for the two cases (with and without considering the seaquake effect) regarding vertical tunnel responses and mooring tensions. The relevant statistics obtained from the time series are summarized in Table 4. It is seen that the inclusion of seaquake effect reduces both vertical SFT responses and mooring tensions, as discussed earlier.

**Figure 12.** Time histories (without seaquake) (**a**) and spectra (**b**) of horizontal tunnel responses at the middle location under seismic excitations

**Figure 13.** Time histories (without seaquake) (**a**) and spectra (**b**) of vertical tunnel responses at the middle location under seismic excitations.

**Figure 14.** Time histories (without seaquake) (**a**) and spectra (**b**) of mooring tension #4 at the middle location under seismic excitations

**Figure 15.** Time histories of vertical responses of the tunnel (**a**) and mooring tension #4 (**b**) in the middle location under seismic excitations with and without seaquake effect.

**Table 4.** Statistics of the SFT motions and mooring tensions at the middle location under irregular seismic excitations (Eq: earthquake, Sq: seaquake).


#### **5. Conclusions**

Global performance analysis of the SFT was carried out for survival random wave and seismic excitations. To solve tunnel-mooring coupled hydro-elastic responses, an in-house time-domainsimulation computer program was developed. The hydro-elastic equation of motion for the tunnel and mooring was based on rod-theory-based finite element formulation with Galerkin method. The dummy-connection-mass method was devised to conveniently connect multiple segmented objects and mooring lines with linear and rotational springs. Considering the slender shape of the structure, hydrodynamic forces were computed by the modified Morison equation. The numerical results produced by the developed program were in good agreement with those by the commercial program

OrcaFlex based on lumped-mass method. The extreme wave excitations caused the maximum SFT dynamic motions of 24 cm and 6 cm in the horizontal and vertical directions and the corresponding mooring tensions below allowable level. Snap motions and loadings of mooring lines were not observed. Under regular seismic excitations, large resonant responses of the tunnel were observed at 1st and 3rd natural frequencies. In the case of seabed earthquake, the seabed motions are transferred to SFT through mooring lines and through seawater fluctuations called seaquake. When the latter is further considered, horizontal responses were not affected but vertical responses become significantly reduced especially at its lowest natural frequency. After analyzing the behaviors of the two contributions, it was found that the reduction was caused by the phase-cancellation effect. However, in other cases, the phases could enhance each other to increase the total responses of the SFT. Under extreme irregular seismic excitations, the maximum SFT dynamic motions of 7 cm and 2 cm were generated and the corresponding mooring tensions were about 30% smaller compared to the extreme wave case. However, when the frequencies of seismic excitations are closer to SFT natural frequencies, larger dynamic amplifications are expected.

**Author Contributions:** All authors have equally contributed to publish this article related to design of target model, validation of numerical modeling, simulations, analysis, and writing.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2017R1A5A1014883).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **The Influence of Dynamic Tissue Properties on HIFU Hyperthermia: A Numerical Simulation Study**

#### **Qiaolai Tan 1,2, Xiao Zou 1, Yajun Ding 3, Xinmin Zhao <sup>1</sup> and Shengyou Qian 1,\***


Received: 2 September 2018; Accepted: 10 October 2018; Published: 16 October 2018

**Abstract:** Accurate temperature and thermal dose prediction are crucial to high-intensity focused ultrasound (HIFU) hyperthermia, which has been used successfully for the non-invasive treatment of solid tumors. For the conventional method of prediction, the tissue properties are usually set as constants. However, the temperature rise induced by HIFU irradiation in tissues will cause changes in the tissue properties that in turn affect the acoustic and temperature field. Herein, an acoustic–thermal coupling model is presented to predict the temperature and thermal damage zone in tissue in terms of the Westervelt equation and Pennes bioheat transfer equation, and the individual influence of each dynamic tissue property and the joint effect of all of the dynamic tissue properties are studied. The simulation results show that the dynamic acoustic absorption coefficient has the greatest influence on the temperature and thermal damage zone among all of the individual dynamic tissue properties. In addition, compared with the conventional method, the dynamic acoustic absorption coefficient leads to a higher focal temperature and a larger thermal damage zone; on the contrary, the dynamic blood perfusion leads to a lower focal temperature and a smaller thermal damage zone. Moreover, the conventional method underestimates the focal temperature and the thermal damage zone, compared with the simulation that was performed using all of the dynamic tissue properties. The results of this study will be helpful to guide the doctors to develop more accurate clinical protocols for HIFU treatment planning.

**Keywords:** HIFU; dynamic tissue property; Westervelt equation; thermal damage zone

#### **1. Introduction**

Cancer is one of the serious diseases that threatens the life and health of humans. According to cancer statistics released by the National Cancer Center of China, 3.804 million new cancer cases were diagnosed and 2.296 million cancer deaths were reported in 2014 [1]. Traditional therapies for cancer include surgery resection, chemotherapy, and radiotherapy. In recent years, other alternative therapies such as microwave ablation, laser ablation, cryoablation, and high-intensity focused ultrasound (HIFU) hyperthermia also have developed rapidly [2,3]. HIFU therapy is a non-invasive technology in which an ultrasound beam carries sufficient energy, and the energy is focused onto the target area to cause a local temperature rise, which is sufficiently high to make the lesion tissue undergo coagulative necrosis without causing damage to the overlaying or surrounding tissue [4,5]. It has many advantages such as non-invasive, non-contact, non-ionization, and low cost [6,7], and has been successfully used in clinics to treat solid malignant tumors, including cancers of the prostate, liver, kidney, breast, and pancreas [8]. The clinical success of HIFU hyperthermia depends on the accurate thermal dose at the lesion location. Unfortunately, it is difficult to accurately measure the thermal dose at a depth of the tissue in most

clinical situations. Instead, a numerical simulation method is usually used to predict the transient temperature profiles and thermal dose to assess the thermal damage that will occur in tissue during HIFU ablation [9].

In the conventional method, the numerical simulation of HIFU hyperthermia is usually based on the acoustic model Westervelt equation and the thermal model Pennes bioheat transfer equation, and the tissue properties are set as constants. However, that the tissue properties varied with temperature had been observed in several experimental studies [10–13]. Moreover, the temperature-dependent tissue properties in turn affect the acoustic field and temperature field. Several researchers have considered some temperature-dependent tissue properties to perform the numerical study of HIFU hyperthermia [9,14,15]. For example, Hallaj [14] studied the effect of dynamic sound speed in the liver with and without a fat layer undergoing HIFU surgery. Christopher [15] examined the importance of the thermal lens effect with a phased array transducer in the liver with the fat layer when considering dynamic sound speed and a dynamic acoustic absorption coefficient in the HIFU hyperthermia study using three-dimensional model. Guntur [9] studied the influence of temperature-dependent thermal parameters on temperature during HIFU irradiation by comparing the conventional prediction of temperature and the thermal damage zone with that for different thermal parameters (i.e., specific heat capacity and thermal conductivity) at the given temperatures. However, only one or two dynamic tissue properties were considered in the above studies; to our knowledge, other dynamic tissue properties such as density and blood perfusion have never been considered. Furthermore, the joint effect of two dynamic tissue properties was investigated in the above studies, but the individual influence of each tissue property on HIFU hyperthermia is still unclear. Therefore, we first study the evolutions of the acoustic and temperature fields with each dynamic tissue property independently, and clarify the physical significance of each tissue property. In addition, we develop an acoustic–thermal model to evaluate the joint effect of all of the dynamic tissue properties on temperature distribution and thermal damage, including sound speed, acoustic absorption coefficient, non-linearity parameter, specific heat capacity, thermal conductivity, density, and blood perfusion. The results provide a more accurate prediction of temperature distribution and thermal damage, gaining insight into the complex dynamic processes during HIFU hyperthermia, which are useful for doctors making treatment planning.

#### **2. Theory**

#### *2.1. Acoustic Model for Ultrasound Wave Propagation*

Generally, the Westervelt equation [16,17] is used to model the ultrasound wave propagation in the thermoviscous medium:

$$
\rho \left( \nabla^2 - \frac{1}{c^2} \frac{\partial^2}{\partial t^2} \right) p + \frac{\delta}{c^4} \frac{\partial^3 p}{\partial t^3} + \frac{\beta}{\rho c^4} \frac{\partial^2 p^2}{\partial t^2} = 0 \tag{1}
$$

where ∇2, *<sup>p</sup>*, *<sup>c</sup>*, *<sup>t</sup>* are the Laplace operator, acoustic pressure, sound speed, and time, respectively; the non-linearity coefficient *β* is related to the non-linearity parameter *B*/*A* by *β* = 1 + (*B*/2*A*); and *δ* = 2*αc*3/*ω*<sup>2</sup> is the acoustic diffusivity accounting for the thermoviscous effect in the fluid [18], where *ω* is the acoustic angular frequency and *α* is the acoustic absorption coefficient.

The acoustic field is computed by Westervelt equation in two-dimensional (2D) cylindrical coordinate using the finite-difference time-domain (FDTD). The *z*-axis is the acoustic axis of the ultrasonic transducer, and *r* is the radial coordinate measured from the *z*-axis. The excitation of the ultrasonic transducer is:

$$p(t) = p\_0 \sin(\omega t) \tag{2}$$

where *p*<sup>0</sup> is the amplitude of acoustic pressure on the ultrasonic transducer.

An absorbing boundary condition (ABC) is imposed at the edge of the computation domain to prevent or minimize the reflection from the edges of the domain, and a first-order Mur's absorption boundary condition is employed [19]:

$$\frac{\partial p}{\partial x} - \frac{1}{c} \frac{\partial p}{\partial t} = 0 \tag{3}$$

where *x* denotes *z* or *r* in their own ultrasonic wave propagation direction.

#### *2.2. Thermal Energy Model for Tissue Heating*

The transfer of heat in the tissue under HIFU irradiation is modeled using the Pennes bioheat transfer equation [20]:

$$
\rho\_l \mathbf{C}\_t \frac{\partial T}{\partial t} = k \nabla^2 T - \mathcal{W}\_b \mathbf{C}\_b (T - T\_a) + \mathcal{Q}\_{ext} \tag{4}
$$

where *Ct* and *ρ<sup>t</sup>* are the specific heat and density of tissue, respectively; *Cb*, *Wb*, and *Ta* are the specific heat, perfusion rate, and ambient temperature of blood, respectively. *Qext* is the ultrasound heat deposition term, which can be calculated by employing the time averaged over one acoustic period by numerical integration [14]:

$$Q\_{\rm crit} = \frac{2\alpha}{\rho c \omega^2} \left\langle \left(\frac{\partial p}{\partial t}\right)^2 \right\rangle \tag{5}$$

To evaluate the performance of the HIFU treatment, thermal dose is usually used to estimate the tissue damage. The thermal dose depends on the final time *tf* and temperature level *T*, which was developed by Sapareto and Dewey [21]:

$$t\_{43} = \int\_0^{t\_f} R^{(T-43)} dt \approx \sum\_{0}^{t\_f} R^{(T-43)} \Delta t \tag{6}$$

where *t*<sup>43</sup> is the thermal dose equivalent time at 43 ◦C. *R* = 2 if *T* ≥ 43 ◦C, and *R* = 4 if 37 ◦C < *T* < 43 ◦C. The threshold value of an isothermal dose value of 240 min at 43 ◦C was usually selected to predict the size of the thermal lesion region.

Another way to quantify the tissue thermal damage is to use the Arrhenius equation [22]:

$$
\Omega = \int\_0^{t\_f} A \exp\left(\frac{-E\_a}{R\_a T}\right) dt\tag{7}
$$

where *A*, *Ea*, and *Ra* are the frequency factor, activation energy, and universal gas constant, respectively. For liver thermal damage, *<sup>A</sup>* <sup>=</sup> 9.4 <sup>×</sup> 10104 <sup>s</sup><sup>−</sup>1, *Ea* <sup>=</sup> 6.68 <sup>×</sup> <sup>10</sup><sup>5</sup> J mol−<sup>1</sup> , and *Ra* = 8.31 J mol−<sup>1</sup> K−<sup>1</sup> [23]. The undamaged fraction of the tissue and the damaged fraction can be estimated by *fu* = exp(−Ω) and *fd* = 1 − *fu*, respectively [24].

#### *2.3. Dynamic Tissue Propertiesc*

The acoustic and thermal parameters of tissue were strongly dependent on tissue temperature, and many experimental data had been obtained [10–13]. The data for acoustic absorption coefficient and sound speed were derived from measurements in liver tissue by Damianou [10] and Bamber [11], respectively. The polynomials to fit the acoustic absorption coefficient and sound speed in liver tissue to experimental data are [15]:

$$\begin{aligned} a\_{\text{lirier}} &= 5.5367 - 2.9950 \times 10^{-1}T + 3.3357 \times 10^{-2}T^2 - 1.6058 \times 10^{-3}T^3 + 3.4382 \times 10^{-5}T^4 \\ &- 3.2486 \times 10^{-7}T^5 + 1.1181 \times 10^{-9}T^6 & \quad 30 \, ^\circ \text{C} \le \text{T} \le 90 \, ^\circ \text{C} \end{aligned} \tag{8}$$

*Appl. Sci.* **2018**, *8*, 1933

$$\begin{array}{ll} \text{c}\_{liver} = 1529.3 + 1.6856T + 6.1131 \times 10^{-2} T^2 - 2.2967 \times 10^{-3} T^3 \\ + 2.2657 \times 10^{-5} T^4 - 7.1795 \times 10^{-8} T^5 & 30 \, ^\circ \text{C} \le \text{T} \le 90 \, ^\circ \text{C} \end{array} \tag{9}$$

In this study, the experimental data for the change in the non-linearity parameter with temperature in the liver tissue are derived from measurements by Choi [12], and the experimental data for the changes in the specific heat capacity, thermal conductivity, and density with temperature in liver tissue are derived from measurements by Guntur [13]. We obtain the expressions of the temperature-dependent non-linearity parameter, specific heat capacity, thermal conductivity, and density respectively by the least squares polynomial fitting their experimental data in liver tissue:

$$
\begin{pmatrix} \left(\frac{B}{A}\right)\_{\text{livir}} = 6.68 - 0.41448\,\,T + 0.03364\,\,T^2 - 0.00101\,\,T^3 + 1.34407 \times 10^{-5}T^4 \\ -6.35346 \times 10^{-8}T^5 & 30\,\,^\circ\text{C} \le \text{T} \le 75\,\,^\circ\text{C} \end{pmatrix} \tag{10}
$$

$$\begin{array}{ll} \text{C}\_{\text{liver}} = 3600 + 53.55552T - 3.96009T^2 + 0.10084T^3 - 0.00106T^4 \\ \text{ } + 4.01666 \times 10^{-6}T^5 & 20 \, ^\circ \text{C} \le T \le 90 \, ^\circ \text{C} \end{array} \tag{11}$$

$$\begin{array}{ll} \mathbb{K}\_{\text{livr}} = 0.84691 - 0.02094T + 3.89971 \times 10^{-4} T^2 - 5.47451 \times 10^{-7} T^3 \\ -4.14455 \times 10^{-8} T^4 + 2.97188 \times 10^{-10} T^5 & 20 \, ^\circ \text{C} \le T \le 90 \, ^\circ \text{C} \end{array} \tag{12}$$

$$\begin{array}{ll} \rho\_{\text{livir}} = 1084.09352 - 2.97434T + 0.0042T^2 + 0.00293T^3 - 6.14447 \times 10^{-5}T^4 \\ + 3.33019 \times 10^{-7}T^5 & 20 \, ^\circ \text{C} \le T \le 90 \, ^\circ \text{C} \end{array} \tag{13}$$

The polynomials above are shown in Figure 1, which have the validity in their own temperature range. In this study, their use is also strictly restricted to their respective temperature ranges.

**Figure 1.** Temperature-dependent tissue properties in liver tissue.

The variation of the blood perfusion rate with temperature and thermal damage can be described by:

$$\mathcal{W}\_{b,liver}(T,\Omega) = \mathcal{W}\_{b,0} f\_T f\_u \tag{14}$$

where *Wb*,0 is the constitutive blood perfusion rate, 18.2 Kg m−<sup>3</sup> s−<sup>1</sup> for liver, and *fT* is a dimensionless function that accounts for vessel dilation at slightly elevated temperatures, which can be approximated as [23,24]:

$$f\_T = \begin{cases} \ 4 + 0.6 \ (T - 42) \ 37 \ ^\circ \text{C} \le T \le 42 \ ^\circ \text{C} \\\ \ 4T \ge 42 \ ^\circ \text{C} \end{cases} \tag{15}$$

The blood perfusion rate increases as the temperature rises, but as tissue coagulation develops, it is decreased to zero due to the factor of thermal damage [25].

In order to study the effects of dynamic tissue properties on HIFU hyperthermia, we compare the simulation using dynamic tissue properties with the conventional method using tissue properties with constant values. In this study, the constant values of tissue properties were obtained from the values of the above fitting formula at 37 ◦C. The values of the acoustic and thermal parameters are listed in Tables 1 and 2, respectively.

**Table 1.** Values of acoustic parameters in this study (37 ◦C). **Material** *ρ***(Kg m**−3**)** *c***(m s**−1**)** *α***(Np m**−<sup>1</sup> **MHz**−1**)** *β*




#### *2.4. Description of the Simulation*

The HIFU transducer is a spherical cap with an aperture radius *a* of 35 mm, a focal length *F* of 62.64 mm, and a center frequency *f* of 1 MHz. The transducer and liver tissue are placed in the water at 37 ◦C, and a geometric configuration of the physical model is shown in Figure 2.

**Figure 2.** Geometric configuration of the physical model. The liver tissue is a cylinder with a radius of 35 mm and a length of 50 mm, and is placed at *z*<sup>1</sup> = 40 mm.

As the tissue temperature rises, the tissue properties also change dynamically. These tissue properties need to be updated in real time according to the temperature, and the updated tissue properties are fed back into the calculation of the acoustic field and temperature field. The flowchart in Figure 3 shows how to carry out the coupling calculation of the acoustic and temperature field under such dynamic conditions. The acoustic field and temperature field are coupled by the heat deposition term *Qext*, which is computed from the acoustic pressure. In the practical simulation, the temperature field is calculated periodically, and the resulting temperature data is used to renew the tissue properties using the function above at each spatial point on the tissue domain. The updated tissue properties are then used as an input to recalculate the acoustic field. Therefore, the acoustic field, temperature field, and tissue properties are mutually influenced. In this study, the acoustic parameters and acoustic field are updated for the simulations here every 1 s unless otherwise noted, and the thermal parameters are updated in real time. This coupling method is based on the time rates of change of the tissue properties being slow enough in the given period interval.

**Figure 3.** Flowchart of the iterative method for coupling acoustic pressure and temperature calculation.

For this study, the acoustic field and temperature field are calculated on a polar cylindrical grid using the explicit finite-difference time-domain (FDTD) method as described by Hallaj [26]. The spatial grids for the simulation are: Δ*z* = Δ*r* = 10−<sup>4</sup> m. The time step for the acoustic field and temperature field simulation are 10−<sup>8</sup> s and 0.01 s, respectively [26]. All of the simulations are performed with MATLAB programming based on the FDTD method.

#### **3. Result and Discussion**

In this manuscript, we focus on the effect of each dynamic tissue property independent from each other, and compared these effects with the conventional method of keeping the tissue properties as constant. When the effect of one dynamic tissue property is studied, the other tissue properties remain constant unless otherwise noted. In the following study, the amplitude of acoustic pressure on the sound source face *<sup>p</sup>*<sup>0</sup> is 1.4 × <sup>10</sup><sup>5</sup> Pa unless otherwise noted.

#### *3.1. Dynamic Acoustic Absorption Coefficient*

Simulations are carried out that only consider the change of the acoustic absorption coefficient with temperature independently. To get more accurate simulation results, the acoustic absorption coefficient and acoustic field are updated here every 0.2 s. Figure 4a depicts the axial profile of the acoustic absorption coefficient during 3 s of HIFU irradiation. The acoustic absorption coefficient near ultrasonic focus increases with the time of HIFU irradiation. At time t = 3 s, Figure 4b illustrates the axial distribution of peak acoustic pressure. Clearly, the peak acoustic pressures are almost the same between dynamic *αliver* and constant *αliver*. It can be explained that the temperature only has a great effect on the acoustic absorption coefficient near the ultrasonic focus, as shown in Figure 4a. In Figure 4c, the maximum value of *Qext* is 8.938 × <sup>10</sup>7W/m2 for constant *<sup>α</sup>liver*, and 2.179 × 108W/m2 for dynamic *αliver*. The features can be explained that *Qext* is proportional to the acoustic absorption coefficient, according to Formula (5). Figure 4d contrasts the evolution of the focal temperature with time for dynamic and constant *αliver*. Before t = 1 s, the rate of the focal temperature rise is almost the same for simulations using dynamic *αliver* and constant *αliver*. This may be due to the small change of the acoustic absorption coefficient during the early HIFU irradiation stage, as shown in Figure 4a. After t = 1 s, the focal temperature for dynamic *αliver* rises much faster than that for constant *αliver*. When t = 3 s, the focal temperature is 65.94 ◦C for constant *αliver* and 85.53◦C for dynamic *αliver*. Figure 4e plots the shape of the thermal damage zone, representing the heated region for more than 240 min equivalent time at 43 ◦C. The thermal damage zone is an ellipse of 0.51 cm × 0.12 cm size for constant *αliver*, and an ellipse of 0.6 cm × 0.16 cm size for dynamic *αliver*. These phenomena indicate that dynamic *αliver* has a greater effect on the focal temperature and thermal damage zone as the

HIFU irradiation time increases, compared with constant *αliver*, which can be explained by the greater acoustic absorption coefficient being related to a greater the value of *Qext*, higher focal temperature, and larger thermal damage zone. Figure 4f describes the axial profile of the thermal dose, and the black dotted line denotes the value log10(240) min. The axial length AB of the thermal damage zone is 0.51 cm for constant *αliver*, and the axial length CD of the thermal damage zone is 0.6 cm for dynamic *αliver*, which are consistent with the axial length of thermal damage zone in Figure 4e. Meanwhile, the thermal dose of ultrasonic focus for dynamic *αliver* is much greater than that for constant *αliver*.

**Figure 4.** (**a**) The evolution of *αliver* at t = 1 s and 3 s. The effects of the dynamic acoustic absorption coefficient on: (**b**) *p* (**c**) *Qext* (**d**) *T* at the ultrasonic focus (**e**) thermal damage zone (**f**) *t*<sup>43</sup> at t = 3 s.

#### *3.2. Dynamic Non-linearity Parameter*

Simulation is carried out considering the change of the non-linearity parameter with temperature independently. The HIFU irradiation time is set to 5 s to ensure the validity of the dynamic non-linearity parameter used in the range of 30 ◦C to 75 ◦C, and the non-linearity parameter and acoustic field are updated here every 0.5 s. In Figure 5b, the axial profile of peak acoustic pressure at 5 s is almost identical for dynamic and constant (*B*/*A*)*liver*. This phenomenon can be attributed to the local increase

of the non-linearity parameter near the ultrasonic focus with the increase of HIFU irradiation time, as shown in Figure 5a. According to Formula (5), the value of *Qext* for dynamic (*B*/*A*)*liver* is almost the same as that for constant (*B*/*A*)*liver*. Consequently, the change of the focal temperature with time and the thermal damage zone are almost identical for dynamic and constant (*B*/*A*)*liver*, as shown in Figure 5c,d. It can be concluded that the dynamic acoustic non-linear parameter has little effect on the HIFU hyperthermia.

**Figure 5.** (**a**) The evolution of (*B*/*A*)*liver* at t = 2 s and 5 s. The effects of the dynamic non-linearity parameter on: (**b**) *p* and (**c**) *T* at the ultrasonic focus, and (**d**) the thermal damage zone at t = 5 s.

#### *3.3. Dynamic Sound Speed, Specific Heat Capacity, Thermal Conductivity, and Density*

Simulations are carried out using the dynamic sound speed, dynamic specific heat capacity, dynamic thermal conductivity, and dynamic density, respectively, and the HIFU irradiation time is 10 s. Figure 6 describes the axial profiles of dynamic sound speed, dynamic specific heat capacity, dynamic thermal conductivity, and dynamic density, which are affected only by the temperature in the vicinity of the ultrasonic focus.

**Figure 6.** The axial profiles of (**a**) *cliver* (**b**) *Cliver* (**c**) *Kliver*, and (**d**) *ρliver* at t = 2 s and 10 s.

Figure 7a shows that the axial profiles of the peak acoustic pressure for simulations with dynamic *Cliver*, dynamic *Kliver*, and dynamic *ρliver* are almost the same as that for simulation with constant tissue properties. At ultrasonic focus, the peak acoustic pressure with dynamic *cliver* is a little greater than that with constant tissue properties, which is consistent with previously reported results [14]. Figure 7b demonstrates the evolutions of focal temperature with time for simulations using dynamic *cliver*, dynamic *Cliver*, dynamic *Kliver*, dynamic *ρliver*, and constant tissue properties, respectively. Before t = 2 s, the rate of focal temperature rise is almost the same for simulations using dynamic *cliver*, dynamic *Cliver*, dynamic *Kliver*, dynamic *ρliver*, and constant tissue properties. This may be due to the very small change in the tissue properties during the early HIFU irradiation stage, as shown in Figure 6. After t = 2 s, Figure 6 shows that the sound speed and density decrease with the increase of HIFU irradiation time, but the specific heat capacity and thermal conductivity have the opposite trend. According to Formula (5), the values of focal *Qext* for dynamic *cliver* and dynamic *ρliver* are both greater than that for the constant tissue properties. Consequently, the rate of focal temperature rise for dynamic *cliver* and dynamic *ρliver* is faster than that for constant tissue properties, as shown in Figure 7b. The rate of focal temperature rise for dynamic *Cliver* is slower than that for constant tissue properties. This feature can be explained by the physical significance of specific heat capacity, which is defined as the amount of energy that is required to increase the temperature of a unit mass of tissue by 1 ◦C [27]. In other words, for the same amount of heat energy and mass, the larger the specific heat capacity, the smaller the temperature rise. The focal temperature for dynamic *Kliver* rises slower than that for constant *Kliver*, and the focal temperature for dynamic *Kliver* is 5.33 ◦C lower than that for the constant tissue properties at 10 s. This feature can be explained by the greater thermal conductivity meaning that more thermal energy is lost from the treated areas because of thermal diffusion [28]. Therefore, it can be concluded that greater thermal conductivity leads to a slower focal temperature rise, which is similar to Guntur's result [9]. The maximum focal temperatures for simulations using dynamic *cliver*, dynamic *Cliver*, dynamic *ρliver*, and constant tissue properties are 86.66 ◦C, 84.58 ◦C, 88.62 ◦C, and 85.47 ◦C, respectively, indicating that dynamic *cliver*, dynamic *Cliver*,

and dynamic *ρliver* have little effect on the temperature during HIFU hyperthermia. This is mainly due to the local variations of tissue property near the ultrasonic focus. Figure 7c shows that the thermal damage zones for dynamic *cliver*, dynamic *Cliver*, dynamic *Kliver*, dynamic *ρliver*, and constant tissue properties are almost the same, which can also be confirmed from Figure 7d. It's interesting to note that the maximum focal temperature for dynamic *Kliver* is lower than that for constant *Kliver*, but the thermal damage zone is almost the same for dynamic *Kliver* and constant *Kliver*. It is mainly because the size of the thermal damage zone depends on the thermal dose above 240 min at 43 ◦C, rather than the maximum focal temperature.

**Figure 7.** The effects of dynamic sound speed, dynamic specific heat capacity, dynamic thermal conductivity, and dynamic density on: (**a**) *p* and (**b**) *T* at the ultrasonic focus, (**c**) the thermal damage zone, and (**d**) *t*<sup>43</sup> at t = 10 s.

#### *3.4. Dynamic Blood Perfusion*

The simulation is performed only considering the dynamic change of blood perfusion, and the HIFU irradiation time is 10 s. The blood perfusion firstly increased, then remained unchanged, and finally decreased to zero as shown in Figure 8a. This is because the initial increase in temperature causes the blood vessels to inflate to increase blood perfusion. As the temperature continues to increase, the tissue damage fraction increases so that the blood perfusion decreases. When the tissue undergoes coagulation necrosis, the blood perfusion decreases to zero. Figure 8b describes the axial profile of dynamic blood perfusion at t = 2 s and 10 s. Compared with other tissue properties, the temperature has a greater impact on the blood perfusion of the surrounding tissue around the central axis. At time t = 10 s, the axial profile of the peak acoustic pressure for dynamic *Wb*,*liver* is almost the same as that for the constant *Wb*,*liver*, as shown in Figure 8c. In Figure 8d, the temperature rise for dynamic *Wb*,*liver* is slower at first; then, it is faster than that for the constant *Wb*,*liver*. It is because blood perfusion is first increased to four times the constant blood perfusion, and then remains unchanged, and is finally reduced to zero, as shown in Figure 8a. Figure 8e shows that the thermal damage zone is an ellipse of 0.97 cm × 0.26 cm size for constant *Wb*,*liver* and an ellipse of 0.89 cm × 0.24 cm size for dynamic *Wb*,*liver*, respectively. Thus, it can be seen that the thermal damage zone for dynamic *Wb*,*liver* is smaller than that for constant *Wb*,*liver*, which can also be verified by the axial thermal dose distribution of Figure 8f.

**Figure 8.** (**a**) Dynamic blood perfusion change with time. (**b**) Axial profile of dynamic blood perfusion at t = 2 s and 10 s. The effects of dynamic blood perfusion on: (**c**) *p* and (**d**) *T* at the ultrasonic focus and (**e**) thermal damage zone; (**f**) *Q*<sup>43</sup> at t = 10 s.

#### *3.5. Considering All Dynamic Tissue Properties*

In the above research results, the individual influence of each tissue property on HIFU hyperthermia was studied independently. Therefore, in this section, it is necessary to perform the simulation using all of the dynamic tissue properties to explore the joint influence on HIFU hyperthermia by comparing them with simulations using dynamic *αliver* and constant tissue properties. Note that we assume that the non-linearity parameter above 75 ◦C is replaced by that at 75 ◦C to simplify the physical model owing to (i) the dynamic non-linear parameter being found to have little effect on the acoustic pressure, temperature, and thermal damage zone in our calculation (Figure 5b–d); and (ii) above 75 ◦C, biological tissue having been coagulated. To ensure that all of the dynamic tissue properties are valid within their respective temperature ranges, the HIFU irradiation time is set to

3 s. At time t = 3 s, Figure 9a shows that the peak acoustic pressure for simulation using all of the dynamic tissue properties is larger than that using dynamic *αliver* and constant tissue properties due to the influence of dynamic sound velocity, and the peak acoustic pressures are almost the same between the dynamic *αliver* and constant tissue properties. In Figure 9b, the peak value of *Qext* is the greatest for simulation using all of the dynamic tissue properties, followed by that for simulation using dynamic *αliver*, and the smallest for simulation using constant tissue properties. In Figure 9c, beforet=1s, the focal temperature is almost the same for simulation using dynamic *αliver* across all of the dynamic tissue properties and constant tissue properties; after t = 1 s, the rate of focal temperature rise is fastest for simulation using dynamic *αliver*, followed by that for simulation using all of the dynamic tissue properties, and slowest for simulation using constant tissue properties. The maximum focal temperature for all of the dynamic tissue properties, dynamic *αliver*, and constant tissue properties are 81.56 ◦C, 85.53 ◦C, and 65.94 ◦C, respectively, indicating that the maximum focal temperature for all of the dynamic tissue properties is lower than that for dynamic *αliver*, although the peak value of *Qext* for all of the dynamic tissue properties is greater than that for dynamic *αliver*. Based on the above research, this is mainly due to the influence of the comprehensive factors such as dynamic *Cliver*, dynamic *Kliver*, and dynamic *Wb*,*liver* on focal temperature, especially the influence of dynamic *Kliver*. In Figure 9d, the thermal damage zone is an ellipse of 0.57 cm × 0.16 cm size for all of the dynamic tissue properties, an ellipse of 0.6 cm × 0.16 cm size for dynamic *αliver*, and an ellipse of 0.51 cm × 0.12 cm size for constant tissue properties, respectively. Consequently, it is can be concluded that the simulation using constant tissue properties significantly underestimates the focal temperature and thermal damage zone compared with the simulation using all of the dynamic tissue properties or dynamic *αliver*. Meanwhile, although the dynamic acoustic absorption coefficient plays the most important role in relation to the focal temperature and thermal damage zone, other dynamic tissue properties ought to be considered.

**Figure 9.** The effects of all of the dynamic tissue properties on: (**a**) *p*, (**b**) *Qext*, (**c**) *T* at the ultrasonic focus, and (**d**) thermal damage zone.

#### **4. Conclusions**

The influence of each dynamic tissue property on HIFU hyperthermia is studied independently based on the reported experimental data of dynamic tissue properties. The findings in the present study suggest that the acoustic pressure is insensitive to the dynamic tissue properties. The numerical results also show that the dynamic acoustic absorption coefficient significantly affects the temperature and thermal damage zone; on the contrary, the dynamic non-linearity parameter has almost no effect on the temperature and thermal damage zone. It is found that the thermal damage zone for dynamic *Wb*,*liver* is smaller than that for constant *Wb*,*liver*, and the influence of a dynamic sound speed, dynamic specific heat capacity, and dynamic density on the thermal damage zone is slight. It is also worth mentioning that the maximum focal temperature for dynamic *Kliver* is lower than that for constant *Kliver*, but the thermal damage zone is almost the same for dynamic *Kliver* and constant *Kliver*. Among all of the individual dynamic tissue properties, the dynamic acoustic absorption coefficient has the greatest influence on the temperature and thermal damage zone. Knowing the influence of each dynamic tissue property is beneficial to our deep understanding of the principle of HIFU therapy. Besides studying the influence of each individual dynamic tissue property, the simulation considering all of the dynamic tissue properties to explore the comprehensive influence on HIFU hyperthermia is performed. The numerical results show that the maximum focal temperature and thermal damage zone for simulation using all of the dynamic tissue properties increase, compared with those for simulation using constant tissue properties, implying that the simulation using constant tissue properties underestimates the focal temperature and thermal damage zone compared with the simulation using all of the dynamic tissue properties. Moreover, it is interesting to point out that the thermal energy absorbed by the tissue for simulation using all of the dynamic tissue properties is greater than that for simulation using dynamic *αliver*, but the maximum focal temperature and thermal damage zone for simulation using all of the dynamic tissue properties decrease, compared with those for simulation using dynamic *αliver*. Consequently, when doctors develop a more accurate clinical protocol for HIFU treatment planning, it is necessary to consider all of the dynamic tissue properties to assess the size of thermal damage zone, so as not to damage normal tissue.

**Author Contributions:** S.Q. conceived and designed the research idea and the framework; Q.T. and X.Z. (Xiao Zou) performed the simulations; Q.T. and X.Z. (Xiao Zou) wrote the paper; S.Q., Q.T. and Y.D. analyzed the data, S.Q. and X.Z. (Xinmin Zhao) modified the paper.

**Acknowledgments:** This work is partially supported by the National Nature Science Foundation of China (No. 11474090, 11774088, 11174077, 61502164), Hunan Provincial Natural Science Foundation of China (No. 2016JJ3090), Scientific Research Fund of Hunan Provincial Education Department (No. 16B155), Aid program for Science and Technology Innovative Research Team in Higher Educational Institutions of Hunan Province, Science and Technology Research Program of Chenzhou City (No. CZ2014039) and Research Program of Xiangnan University (No. 2014XJ63).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Fingerprinting Acoustic Localization Indoor Based on Cluster Analysis and Iterative Interpolation**

### **Shuopeng Wang 1,***<sup>∗</sup>* **ID , Peng Yang 1,2 and Hao Sun 1,2 ID**


Received: 6 August 2018; Accepted: 4 October 2018; Published: 10 October 2018

**Abstract:** Fingerprinting acoustic localization usually requires tremendous time and effort for database construction in sampling phase and reference points (RPs) matching in positioning phase. To improve the efficiency of this acoustic localization process, an iterative interpolation method is proposed to reduce the initial RPs needed for the required positioning accuracy by generating virtual RPs in positioning phase. Meanwhile, a two-stage matching method based on cluster analysis is proposed for computation reduction of RPs matching. Results reported show that, on the premise of ensuring positioning accuracy, two-stage matching method based on feature clustering partition can reduce the average RPs matching amount to 30.14% of the global linear matching method taken. Meanwhile, the iterative interpolation method can guarantee the positioning accuracy with only 27.77% initial RPs of the traditional method needed.

**Keywords:** fingerprinting acoustic localization; iterative interpolation; K-Means clustering; Two-stage matching; Adjacent RPs

#### **1. Introduction**

With the development of signal processing technology and artificial intelligence technology, voice interaction has been gaining extensive attention in the smart device field [1–3]. Nowadays, the autonomous robot, as the representative of intelligent equipments, is expected to interact with people in a human-like way [4], and voice interaction can effectively improve its intelligence level. During human–robot interaction (HRI) process, acoustic localization technology can provide necessary reference for robot's pose adjustment to enhance the HRI reliability [5,6]. In recent years, great advancement of theory and application has been made in acoustic localization field. Most existing acoustic localization methods are parametric positioning methods, which are based on the space geometrical propagation models of acoustical signal [6–11]. Usually, these models are simplified with the assumptions of the sound source and the transmission channel listed as follow:


The geometry model acoustic localization methods can achieve acceptable results outdoors, where the actual signal propagation model is similar to the ideal assumptions mentioned above. However, for indoor circumstances, the signal propagation model may be altered by the multipath effect, shadowing effect, fading effect, and delay distortion from walls, floors, furniture, or ceilings [12,13]. Meanwhile, it is difficult to provide compensation for model distortion analytically [14,15].

Different from the acoustic localization methods based on geometry model, the fingerprinting acoustic localization method simply adopted in our previous work [16], as a nonparametric location approach, can effectively accomplish sound positioning task according to the idea of environment perception. Compared with the precondition of the parametric localization method mentioned above, avoiding dramatic environment changes in target area, as the necessary requirement of the fingerprinting localization method, is easier in the practical application [17,18].

Many studies indicate that the positioning accuracy of non-parametric positioning method largely depends on the sampling density [19,20]. Therefore, for high resolution positioning indoor, considerable amount of sampling work for the database construction is needed in the offline phase. Additionally, during the online phase, the involved algorithms need considerable number of data, and large amounts of memory and computation resources to carry out the target position estimation in real time.

Interpolation is a mathematical tool to estimate the value of a function at a certain point using available values at other arguments. Interpolation methods for scattered data are widely implemented in mathematical, industrial and manufacturing applications. Radial basis function (RBF) [21], Linear [22], Inverse Distance Weighting (IDW) [23], and kriging [24] are well-known interpolation methods for positioning database expansion. Even though the numerous methods are effective to reduce sampling task quantity under the premise of ensuring the positioning accuracy [25], there are still many problems in the existing interpolation methods to be solved. In our previous work [26], the interpolation methods were executed in global interpolation way, which resulted in the rapid expansion of virtual RPs quantity and increase the calculation amount for RP matching [27]. Meanwhile, the conventional interpolation methods usually rely on the experience of the implementer and cannot accurately reckon the quantity of virtual RPs needed. In this paper, we propose an iterative interpolation method to refine the interpolation scope and, at the same time, the interpolation process can be monitored to avoid unnecessary virtual RPs. The estimation result of each iteration process can be compared with the one of the previous iteration process, and the interpolation will end when the different of the estimation results between the two adjacent process is less then the given threshold value.

The Selective matching combination of the target point (TP) and the RPs can reduce the matching task, thus improve the positioning efficiency of fingerprinting acoustic localization. Therefore, the positioning database is considered to be divided into a certain number of sub-databases in the offline phase, and then the matching scope can be shrunk to a smaller one through the search of adjacent sub-database [28]. For the database division method, the coordinate space partition is investigated firstly, which is easy to implement and can reduce the influence of outliers. However, this method has the defects of uncertain partitioning results and large positioning error. That is mainly because the division result is greatly influenced by the subjective judgment in the coordinate space partition. The adjacent RPs of the TP may be divided into different sub-databases, which will cause the RPs matching error. In machine learning technology, cluster analysis, as a precursor process of other learning tasks, is often used for classification of unlabeled samples. The RPs with similar features can be assigned to the same sub-database automatically by cluster analysis technique to accomplish the purpose of the database partition [29].

The rest of the paper is organized as follows: In Section 2, the general process of the fingerprinting acoustic localization is briefly introduced. In Section 3, the positioning database partition by cluster analysis and the adjacent RPs searching based on the two-stage matching method are stated, and then the iterative interpolation method is proposed to generate the virtual RPs for ensuring the target position estimation accuracy with few initial RPs. Section 4 presents the implementation details and evaluates the performance of the novel methods from the results obtained. Finally, some conclusions are drawn in Section 5.

#### **2. The Fingerprinting Localization Model**

Fingerprinting localization method is a database matching approach. As Figure 1 shows, the fingerprinting localization method uses the position information and the related features measured in the target region to establish the positioning database. In the actual positioning, the signal captured by the positioning system will match with the samples in the positioning database, and the samples most similar to the target signal are selected to accomplish the position estimation.

**Figure 1.** Illustration of the fingerprinting localization process.

As what has been introduced in our previous work [16], fingerprinting acoustic localization method requires an offline phase to construct the positioning database and an online phase for acoustic target location [30].

In the offline phase, the positioning database can be constructed by the coordinate of position marks and the corresponding features. The coordinates of the position marks are usually determined according to the site environment of the positioning area and the location accuracy requirements of the task. As the location-related feature in this work, time difference of arrival (TDOA) is widely used in real-time acoustic positioning applications for its low computational complexity and small data size [31–35]. Finally, samples, also known as position fingerprints, are formed by the coordinate of position marks and their corresponding features. The fingerprints in each position mark are collected and the location fingerprint database is established.

In the online phase, the feature vector of the observed target signal is matched with each sample of the positioning database. A specific number of samples are selected as the adjacent RPs according to the similarity with the target. Finally, the position of the target can be calculated by the specific position estimation algorithm based on the adjacent RPs according to the matching result. Exactly the same estimation algorithm used in the RADAR system, weighted-nearest neighbor (WKNN) [36] is usually used for the fingerprinting localization process.

#### **3. The Proposed Fingerprinting Acoustic Localization Approach**

In the traditional fingerprinting localization approach, the target signal needs to match with all samples in the database to select its adjacent RPs for location estimation. Therefore, the large scale positioning database that the fingerprinting localization accuracy depends on means the complexity of matching operation improvement and the efficiency of positioning reduction. Aiming at solving the contradictory in traditional fingerprinting localization approach, this paper makes some improvements. As Figure 2 shows, after the database construction in the offline sampling phase, the entire positioning database is divide into sub-databases. Then, in the online positioning phase, the matching scope can be narrowed by the adjacent sub-database matching stage, and the adjacent RPs searching can be accomplished with a small amount of matching computation.

Meanwhile, the offline phase of the fingerprinting localization approach requires a great sampling effort as the mobile sensor has to be placed at every position marks in the location area. To reduce the initial sampling effort, the database can be constituted using the sparse samples in the target area and extended afterwards by interpolation functions. As Figure 2 shows, an iterative interpolation method is presented to further refine the interpolation scope to avoid unnecessary virtual RPs. At the same time, the estimation result of each iteration process is compared with the one of the previous iterations, and the interpolation ends when the difference of the estimation results between the two adjacent process is less than the given threshold value.

The novel acoustic localization approach consists of three main stages:


#### *3.1. Database Partition by Clustering Method*

Clustering is an elements grouping process according to some specific features, which is called the cluster key, such as the TDOA value we choose in this paper. The prototype-based clustering method has the advantages of simple, fast and efficient process for big datasets classification. As one of the prototype-based clustering algorithms, K-Means clustering algorithm is a classical and efficient algorithm for cluster analysis [37].

The database partition process by K-Means clustering algorithm includes three steps. Suppose there is a database *D* with *N* samples that needs to be partitioned to *K* (*K* < *N*) groups. The clustering algorithm for the sound-position fingerprint database can be described as Algorithm 1.

**Algorithm 1:** The K-Means clustering algorithm for positioning database partition. **Input:** database *D*= [*S*1, *S*2, ··· , *SN*] *<sup>T</sup>* ; cluster class number *K*. **Output:** *D* = {*C*1, *C*2, ··· , *CK*} **<sup>1</sup>** Randomly selected *K* samples from *D* as the initial cluster centers :{*μ*1, *μ*2, ··· , *μK*}; **<sup>2</sup> while** *flag* > 0 **do <sup>3</sup>** cluster centers update flag: *flag* = 0; **<sup>4</sup>** *C*i=∅(*i* = 1, 2, . . . , *K*); **<sup>5</sup> for** *j = 1,2,...,N* **do <sup>6</sup>** calculate the distance between *<sup>S</sup><sup>j</sup>* and each cluster center *<sup>μ</sup>i*: *dji* = <sup>2</sup> 2*S<sup>j</sup>* − *μ<sup>i</sup>* 2 2 2 ; **<sup>7</sup>** definite the cluster mark of *<sup>S</sup><sup>j</sup>* by the nearest cluster center: *<sup>λ</sup><sup>j</sup>* = arg min*i*∈{1,2,··· ,*K*}*dji*; **<sup>8</sup>** classify the sample *S<sup>j</sup>* into the corresponding cluster: *Cλ<sup>j</sup>* = *Cλ<sup>j</sup>* ∪ *S<sup>j</sup>* ; **<sup>9</sup> for** *i = 1,2,. . . ,K* **do <sup>10</sup>** calculate the new cluster center *μ <sup>i</sup>* <sup>=</sup> <sup>1</sup> <sup>|</sup>*Ci*<sup>|</sup> <sup>∑</sup> *<sup>S</sup>*∈*C<sup>i</sup> S*; **<sup>11</sup> if** *μ <sup>i</sup>* = *μ<sup>i</sup>* **then <sup>12</sup>** update the current value of *μ<sup>i</sup>* to *μ i* ; **<sup>13</sup>** *flag* = *flag* + 1; **<sup>14</sup> else <sup>15</sup>** the current value *μ<sup>i</sup>* remain the same;

Firstly, *K* samples are randomly selected from the positioning database *D* as the initial cluster center [*μ*1, *μ*<sup>2</sup> ··· *μK*]. Then, the remaining samples are assigned to the most similar clusters according to the similarity with each cluster center in feature space. Then, the cluster center is updated by *μ<sup>i</sup>* = <sup>1</sup> <sup>|</sup>*Ci*<sup>|</sup> <sup>∑</sup> *<sup>S</sup>*∈*C<sup>i</sup> S*, where *S* means the samples clustered to *Ci*. The clustering process is repeated until the cluster centers stop updating, and finally *D* = {*C*1, *C*<sup>2</sup> ··· *CK*}.

#### *3.2. Two-Stage RPs Matching*

The vocal target can be located in the positioning area after the RPs sampling and the database construction. In the positioning phase, a two-stage matching algorithm is proposed to compare the feature vector of vocal target *<sup>F</sup>*=[ *<sup>f</sup>* 1, *<sup>f</sup>* 2, ··· , *<sup>f</sup> <sup>M</sup>*] with each simple in database *<sup>D</sup>* to find the adjacent RPs with the minimum matching error.

The Euclidean distance between target point *F* and cluster center *μ<sup>i</sup>* of cluster *C<sup>i</sup>* can be defined by:

$$\text{Dis}\_{i} = \|F - \mu\_{i}\|\_{2^{\prime}} \quad i = 1, 2, \dots, K. \tag{1}$$

The adjacent cluster can be chosen through:

$$\mathbf{C}\_{\mathbf{a}} = \mathbf{C}\_{\text{arg min}\_{i \in \{1, 2, \dots, k\}} Dis\_i} \tag{2}$$

Then, as shown in Figure 3b, the adjacent RPs can be searched according to the Euclidean distance distance *disj* between the target point and each sample of the adjacent cluster in feature space.

The distance can be defined as:

$$dis\_{j} = \left\| F - F\_{a}^{j} \right\|\_{2'} \quad j = 1, 2, \dots, n\_{c}. \tag{3}$$

where *F<sup>j</sup> <sup>a</sup>* is the feature vector of the *j*th RP in adjacent clustering *C<sup>a</sup>* , and *nc* denotes the total number of samples in *Ca*. Adjacent RPs set *Da* can be gathered by:

$$D\_d = D\_d \cup \mathbb{S}\_{\text{arg min}\_{j \in \{1, 2, \dots, n\_d\}} dis\_j} \tag{4}$$

Because the complexity of matching process is far greater than the other parts of the location process, and the computation complexity of the other parts in the positioning process is almost the same, the computational complexity of matching operations is investigated in this paper. *N* denotes the total number of RPs in the database and *K* is the number of clusters. The complexity of linear matching process is *O* (*N*) , and the average complexity of matching process based on cluster analysis is *O* (*N*/*K*). Comparing with the conventional matching method, the proposed approach can reduce the complexity of matching process to its 1/*K*.

Firstly, as Figure 3a shows, adjacent cluster is determined based on the Euclidean distance between the target point and each cluster center in feature space.

**Figure 3.** Two-stage RPs matching process: (**a**) the adjacent cluster matching process; and (**b**) the adjacent RPs matching process. The purple, orange, yellow, blue and green dots are the RPs clustered to different clusters marked as *c*1 to *c*5; the dark blue dots are the cluster centers marked as *μ*<sup>1</sup> to *μ*5; the red dot denotes the TP; *DIS* means the Euclidean distance between TP and each cluster center; and *dis* means the Euclidean distance between TP and each RP in adjacent cluster.

#### *3.3. Location Estimation Based on Iterative Interpolation*

To reduce the sampling effort, global interpolation methods are usually used to improve the positioning accuracy under the sparse sample points collection. However, the global interpolation method usually results in the rapid expansion of virtual RPs quantity, and cannot accurately reckon the quantity of virtual RPs required for the satisfactory positioning accuracy. In this paper, we propose an iterative interpolation method to avoid the unnecessary virtual RPs and further improve the location efficiency.

In the online positioning process, the virtual RPs can be generated by the iterative interpolation method, as Figure 4 shows, where the iteration interpolation process is based on four adjacent RPs. In the interpolation process, the first generation virtual RPs are defined as the adjacent RPs selected before as *D*<sup>1</sup> *<sup>v</sup>* = *Da*. During the iteration interpolation process, the elements of *D<sup>t</sup> <sup>v</sup>* are refreshed by:

$$\begin{cases} \mathbf{S}\_{n}^{t+1} = \omega\_{n}^{t}\mathbf{S}\_{n}^{t} + \omega\_{n+1}^{t}\mathbf{S}\_{n+1'}^{t} & n < N\\ \mathbf{S}\_{n}^{t+1} = \omega\_{n}^{t}\mathbf{S}\_{n}^{t} + \omega\_{1}^{t}\mathbf{S}\_{1'}^{t} & n = N \end{cases} \tag{5}$$

where *S<sup>t</sup> <sup>n</sup>* is the *n*th element of *t*th generation *D<sup>t</sup> <sup>v</sup>*, and *ω<sup>t</sup> <sup>n</sup>* is the according weight that can be calculated through IDW method by the feature space Euclidean distance between recent generation virtual RPs and the target point that needs to be located.

**Figure 4.** Virtual RPs generation process.

As Algorithm 2 shows, online positioning process based on the iteration process continues until the iteration time beyond the maximum iteration value *T* or the difference between two-estimation process results is satisfied with |*l* − *l* | ≤ *ε*, where *ε* is the iterative process end threshold value.


**Input:** adjacent RPs *Da* **Output:** test point location estimation results *l* **<sup>1</sup>** iteration time *t* = 1; **<sup>2</sup>** first generation virtual RPs *D*<sup>1</sup> *<sup>v</sup>* = *Da*; **<sup>3</sup> while** *t* < *T* |*l* − *l* | ≤ *ε* **do <sup>4</sup>** *t* = *t* + 1; **<sup>5</sup>** *l* = *l*; **<sup>6</sup>** new generation virtual RPs *Dt*+<sup>1</sup> *<sup>v</sup>* can be generated by Equation (5); **<sup>7</sup>** location estimation *l* can be calculated by the WKNN algorithm ;

#### **4. Experimental Validation**

To demonstrate the performance of the proposed acoustic localization approach based on the cluster analysis and iterative interpolation, real-world experiments have been carried out in a practical room. The room is w 9.64 × 7.04 × 2.95 m3, where the noise is about 40 dB and the walls are not insulated. The scene and equipment of the experiments are shown in Figure 5. The target area is a rectangular plane with the length of about 6 m and width of about 5 m. The four-channel microphone array is composed of the MPA201 microphones produced by the BSWA Technology Co., Ltd., Beijing, China. The microphones are installed at the four vertices of the positioning area with the height about 1.35 m above the floor. The type of the acquisition card is known as NI9215A from NI company, Austin, USA. The sampling frequency is set as 100 kHz, and the sampling period is 1 s. The sound source is a Bluetooth speaker with the same height as microphone array. A system-provided text tone called "Popcom" in iPhone 6 is selected for localization sound signal.

**Figure 5.** The fingerprint-based acoustic localization system and experiment scene.

In the sampling process, the coordinates of the samples are uniformly distributed in the location area by grid division, and the distance between each samples is 0.593 m. The total number of the samples prepared for database construction is 72, and 13 test points are used for target point estimation.

#### *4.1. Analysis of the Two-Level RPs Matching Method*

According to Section 3.3, more subsets in the positioning database partitioned means more online positioning efficient improvement by two-level RPs matching method. However, the same as the coordinate space partition method, when the subsets reaches a certain number, distinguishing between sub-databases partitioned by feature clustering partition method is no longer obvious. Then, the adjacent RPs may be divided into different sub-databases, which will cause RP matching error.

To investigate the effect of the division number on location accuracy, we explored the localization results with division number from 1 to 6, where 1 means the the matching process without partition, that is, global linear matching localization.

As shown in Figure 6, when division number increased from 2 to 4, positioning accuracy slightly improved compared with global linear matching. That is mainly because, according to the clustering results, the outlier points with large measurement errors in the sub-database can be eliminated by outlier test method. However, when the number of sub-databases increased to 5, the localization result began to deteriorate significantly. Moreover, when the division number increased to 6, the average error exceeded 0.18 m, while the maximum error reached 0.2780 m and 61.5% of the test point positioning accuracy could not meet the 0.20 m positioning requirements.

**Figure 6.** The effect of sub-database number on positioning.

To compare the positioning effect between the coordinate space partition method and feature clustering partition method, the average matching amount, matching time and the average positioning error were considered when the division number is 4. As shown in Table 1, the average matching amount and average matching time of the two different partition methods in the RPs matching process are basically the same, and their online positioning efficiencies greatly improved compared with the global linear matching method. Among them, the feature clustering partition method can reduce the average matching amount and the average matching time to 30.13% and 29.89% of the global linear matching method, respectively, while the coordinate space partition method was 30.97% and 30.13%. In comparison to the positioning accuracy, the positioning error of 0.0813 m based on feature clustering partition method is significantly superior to the 0.1214 m based on the coordinate partitioning partition method, and the positioning accuracy is improved by 13.97% compared with the traditional linear matching method.

**Table 1.** Comparison of the influence on positioning effect between database partition methods. A-amount, average matching amount; A-time, average matching time; A-error, average positioning error.


#### *4.2. Analysis of the Iterative Interpolation Method*

In this work, four adjacent RPs selected by the global linear matching method are used for target location estimation. In the process of virtual RPs generation, the maximum number of iterations is set as *Tmax* = 10. In addition, when the difference between the results of two adjacent iterations interpolation positioning process is less than *ε* = 0.0001 m, the iteration process will end.

The global positioning results by examining the average error and maximum error of the positioning results of 13 test points has been evaluated. As shown in Figure 7, the iterative interpolation method can reduce the average error from 0.0945 m to 0.0406 m, and the maximum error from 0.2290 m to 0.0818 m. In the process of iterative interpolation, the effect of location accuracy improvement is obvious in the first six iterations. However, along with the iterative process and the improvement of positioning accuracy, the effect is gradually weakened. The same phenomenon also occurred at the maximum error.

**Figure 7.** The changes of mean error and maximum error of location estimation according with the iterate interpolation process.

As Table 2 shows, 6 cases were considered to compare the positioning effect of different interpolation methods. The maximum errors and average errors were selected as the evaluation indicators. For iterative interpolation methods, the V-RPs is defined as the average value of the virtual RPs generated during the localization process of the test points.


**Table 2.** The location results comparison of different fingerprinting acoustic localization methods. I-RPs, initial RPs; V-RPs, virtual RPs; M-error, maximum positioning error; A-error, average positioning error.

According to Table 2, in the fingerprinting acoustic localization process without interpolation, 72 RPs can provide apparently higher accuracy than the one with 20 RPs. The results confirmed the viewpoint that improving RPs density can directly improve the positioning accuracy.

In the cases of global interpolation method, the interpolation method can make further improvement for the positioning accuracy. On the other side, the initial RPs ratio can also affect the location results. That is, when the total RPs of the acoustic localization process based on global interpolation method are the same, more initial RPs means better positioning accuracy, but the influence of initial RPs ratio is weaker than the number of the total RPs.

In the case of iterative interpolation method, it is easy to see that iterative interpolation method needs only 12.3% virtual RPs of the global interpolation method for similar precise location results when the number of initial RPs is 20. When the number of initial RPs is 72, iterative interpolation method needs only 10.5% virtual RPs of the global interpolation method for a slightly less precise positioning results.

#### *4.3. Analysis of the Novel Method*

The fingerprinting acoustic localization approach based on iterative interpolation and cluster analysis is presented in this work. The positioning database consisting of 72 initial RPs is divided into four sub-databases by K-Means clustering algorithm, and four adjacent RPs selected by the two-stage matching method are used for 13 test point's location estimation based on iterative interpolation.

As Figure 8 shows, all of the estimated positions of the 13 target points obtained good concordance with the true positions. Meanwhile, the interpolation process at most target points ended in five iterations. Take Test Point 3, for instance: the location error decreased during iterative interpolation process and ended at the seventh iteration.

To analyze the influence at different test points, the position accuracy comparison of the novel method and the original method were taken on each test point. As Figure 9 shows, the novel method brought significant improvement of positioning accuracy for 11 of the 13 test points. In the novel method, the errors of Test Points 3, 7, 9, 11 and 12 decreased more than 50% from the original location method. However, Test Points 2, 4, 5, 6, 10 and 13 were not sensitive to interpolation process because they already had relatively high positioning accuracy. It must to be pointed out that the location results of Test Points 1 and 8 got worse and result in no apparent improvement in maximum error. That is because these points were located at the boundary of two sub-databases, and their adjacent RPs were assigned to different clusters by feature clustering partition method. The causes of location error are complex and varied; to further decrease the location error, improvements of other links in fingerprinting acoustic localization process are also needed.

**Figure 8.** The positioning results of the fingerprinting acoustic localization based on iterative interpolation method. The green, pink, blue and yellow dots are the RPs clustered to different clusters, red dots denote the test points, and the black dots are the estimation results of each interpolation.

**Figure 9.** The positioning results comparison of the acoustic localization without interpolation process and acoustic localization base on iterative interpolation process.

#### **5. Conclusions**

In this paper, the iterative interpolation method and cluster analysis method has been presented for improving the positioning efficiency of indoor fingerprinting acoustic localization. In the fingerprinting acoustic localization process, the calibration efforts in offline phase can be reduced due to the sparse sampling treatment, and the satisfactory positioning accuracy can be guaranteed by virtual RPs generated by the iterative interpolation method. Meanwhile, the K-Means cluster analysis method was adopted for database partition, and a two-level RPs matching method was used to speeding up the online positioning phase. The results show that the fingerprinting acoustic localization method can achieve satisfactory accuracy with few initial RPs sampling in offline phase and a more rapid RPs matching process in online phase by iterative interpolation and cluster analysis. As future works, an extension of the clustering method to reduce the location results deterioration of the frontier points and various types of complex tasks for further verification of the novel method are being considered .

**Author Contributions:** S.W. conceived and designed the experiments and wrote the paper. P.Y. and H.S. contributed to project research scheme formulation. All authors contributed to the final version.

**Funding:** This research was funded by the National Natural Science Foundation of China (No. 61373017), Natural Science Foundation of Hebei Province (No. F2014202121), and Graduate Student Innovation Funding Project of Hebei Province (No. 220056).

**Acknowledgments:** The authors thank all the reviewers and editors for their valuable comments and work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Spatial Information on Voice Generation from a Multi-Channel Electroglottograph**

#### **Lamberto Tronchin 1,\* ID , Malte Kob <sup>2</sup> and Claudio Guarnaccia <sup>3</sup> ID**


Received: 13 August 2018; Accepted: 3 September 2018; Published: 5 September 2018

**Abstract:** In the acoustics of human voice, an important role is reserved for the study of larynx movements. One of the most important aspects of the physical behavior of the larynx is the proper description and simulation of swallowing and singing register changes, which require complex laryngeal *manoeuvres*. In order to describe (and solve, in some cases) these actions, it is fundamental to analyze the accurate synchronization of vocal fold adduction/abduction and the change of the larynx position. In the case of dysfunction, which often occurs for professional singers, this synchronization can be disturbed. The simultaneous assessment of glottal dynamics (typically electroglottograph, EGG signal) and larynx position might be useful for the diagnosis of disordered voice and swallowing. Currently, it is very difficult to instantaneously gather this information because of technology problems. In this work, we implemented a time-multiplex measurement approach of space-resolved transfer impedances through the larynx (Multi-Channel electroglottograph MC-EGG). For this purpose, we developed specific software (Labview code) for the visualization of the main waveforms in the study of the EGG signals. Moreover, the data acquired by the Labview code have been used to create a theoretical algorithm for deriving the position of the larynx inside the neck. Finally, we verified the results of the algorithm for the 3D larynx movement by comparing the data acquired with the values described in the literature. The paths of the larynx and the displacement on the sagittal and transverse plans matched the ones known for the emission of low/high notes and for swallowing. Besides, we have introduced the possibility to study the movement on the coronal (x) plan (so far, unexplored), which might be a starting point for further analysis.

**Keywords:** voice generation; multichannel electroglottograph; larynx acoustics

#### **1. Introduction**

The study of musical acoustics includes several aspects about the physics of musical instruments, and the main purpose consists of describing their sound [1], including the development of new physical parameters [2]. One of the most important applications of these studies is to emulate their sound by means of the proper description of their behavior, by means of convolution between the music piece played by the musician and impulse responses of the instrument [3]. However, sound production in humans is a complex process depending on different singing styles, which involves several anatomic structures [4]. This process is responsible for the generation of formant frequencies [5]. For these reasons, it is necessary to properly describe their movements, also including nonlinear aspects, in order to emulate nonlinearities using novel approaches [6,7]. Considering these aspects, it would be feasible to obtain a proper reconstruction of the diffuseness of musical signals for subjective evaluations [8].

Nevertheless, the interest in the description and modelling of the phonetic act includes researchers working in medicine and singing teaching. This interest has grown in the last few years and is continuing to grow even more.

Scientific studies of the human voice started with Helmholtz, who gave a detailed explanation of this phenomenon in 1863, describing that the voice is produced by a steady flow of air from the lungs, segmented at the laryngeal level into a series of air puffs at a fundamental frequency (f0) that generates higher harmonics in the cavity of the upper airway. The supra-laryngeal cavity plays the role of a resonator, only filtering some frequencies, and finally the mouth and nose cavities modify the air flux, generating sound [9].

Mechanically, the phenomenon can be compared with the pression provoked by a piston. The air pressure forces the vocal folds to open. As the suction produced by the drop in pressure in the region of the folds plus static tissue forces begins to counterbalance the subglottic pressure in the region of the lungs, the folds begin to move inward, and the narrowing channel causes an increase in suction until the folds snap shut. Once the vocal fold cycle is completed, the folds return to the starting position.

Complex laryngeal manoeuvres occur during swallowing and singing register changes. These actions require an accurate synchronization of vocal fold adduction or abduction and the change of the larynx position. The simultaneous assessment of glottal dynamics and larynx position could be beneficial for several reasons: it might be an important instrument for the diagnosis of disordered voice or speech production and swallowing, it might be useful in the research of effective correlations between the control of the speech frequency f0 and the position of the larynx, and it can also be an instrument for the mechanic evaluation of singing techniques. Currently, the existing tools normally available do not allow this simultaneous assessment because of their features (e.g., the incompatibility between MRI and other electric devices) or low resolution (e.g., CT).

For the aforementioned reasons, there is interest in a device which might be capable of making both the measurements at the same time. This is the reason why a prototype of MC-EGG (Multi-Channel Electroglottograph) was realized. This new device differs from a standard EGG in that more electrodes are rapidly switched to give information about the larynx position inside the neck [10].

#### **2. Multi-Channel Electroglottograph**

There are several different devices that might be used for the evaluation of the glottal dynamic. One of the most important is the EGG, which was utilized in this research [11].

This device evaluates the TEC (Transverse Electrical Conductance) between two electrodes placed on the sides of the neck. The first electrode sends a low intensity-high frequency current stimulus that is received by the second electrode.

The typical EGG signal appears as in Figure 1; the maximum conductance is at the maximum contact point of vocal folds and the minimum is at the maximum opening point. A standard EGG has two electrodes (one sender and one receiver), while the MC-EGG uses two six-electrode arrays (Figures 2 and 3).

**Figure 1.** Phases of the idealized EGG waveform related to the vibratory cycle of the folds: 1: closing phase; 2: maximum contact; 3 opening phase; 4: open, no contact.

**Figure 2.** A standard EGG.

**Figure 3.** An example of MC-EGG, used for the experiments.

For each electrode's switch in the transmitter array, we have a fast switch of all the electrodes on the receiving array (every 25 ms). In this way, we could obtain all the 36 possible paths of current inside the neck [10]. In other words, by using an MC-EGG, it is possible to simultaneously obtain much more information if compared with a normal EGG. In this way, the resolution of the possible movements of the larynges increases. Further information about the behavior of the MC-EGG could be found in [10].

#### **3. Methodologies**

In order to describe the laryngeal *manoeuvres*, it was very important to focus on acquiring, visualizing, and saving data on a computer from the MC-EGG. Moreover, an algorithm for the evaluation of the larynx position inside the neck has been developed.

For the acquisition, we used a DAQ 6035E, 38.5 kHz, and we developed a Labview (National Instrument) tool to interface the device with a laptop. This tool included a user interface (Front Panel) and a code interface (Block Diagram), following the numerical description of the phenomenon [12].

The Front Panel tool consisted of a macro-box with three folders (Figure 4): the first was used for the electrodes' positioning; the second, called the EGG, was developed to acquire data and to evaluate the larynx position inside the neck. The same box also includes a graph that shows the real-time dynamic of one channel (user defined), which represents the TEC variation in time (the typical EGG signal that evaluates the glottal dynamic).

**Figure 4.** Labview Main Panel for acquiring data: EGG page.

The second folder included another box that allowed the user to set the simulation time or to manually stop it.

The third (last) folder, called "Setting", enclosed all the settable parameters. The acquired-data matrix is also visualized in that folder. This might be saved as a text file, which is useful for a Matlab post-processing, in a spreadsheet (.xls) file, or both [13].

The output matrix contained 36 columns, with each one representing a possible current path inside the neck between a sender and a receiver electrode: the number of rows depends on the simulation time. The algorithm for the evaluation of the larynx position has been developed in a "light" version, in terms of computational cost, in order to work online with Labview. In Matlab, the algorithm is more complex and more precise because it might work offline. The EGG signal consists of an AM (Amplitude Modulated) signal; its value is bigger when the current flows through the vocal folds' plane. On the other hand, it becomes smaller if that plane is partially crossed or not crossed at all. We approximated the field between two electrodes as a cylindrical shape. Therefore, we used the

information given by the EGG signal to obtain information about the distance between the axis of each cylinder and the vocal folds' plane.

When the distance between electrodes is known, we can calculate all the 36 possible cylinders representing the 36 current paths (Figure 5).

The mathematical equation that should be solved for each cylinder is:

$$\left[ (X - X\_{i0})^2 + (Y - Y\_{i0})^2 + (Z - Z\_{i0})^2 - \left[ (X - X\_{i0})v\_{lj\bar{j}} + (Y - Y\_{i0})v\_{lj\bar{j}} + (Z - Z\_{i0})v\_{lj\bar{j}} \right]^2 - R\_{l\bar{j}}^2 = \mathcal{C}\_{i\bar{j}} \tag{1}$$

where i = 1, ... ,6 is the index for the sending electrodes; j = 1, ... ,6 is for the receiver ones; and X, Y, Z represents the larynx position coordinates, which start from position (*X*i0, *Y*i0, *Z*i0). They should potentially assume any value inside the volume mapped by the two electrodes' arrays.

In Formula (1), C represents the "cost function". Therefore, for each set of X, Y, Z, we could obtain 36 possible cost functions. The sum of the cost function for all 36 paths will give the global cost function. The most probable point where the larynx is located is obtained by minimizing the global cost function, by varying the X, Y, Z sets.

**Figure 5.** Cylindrical field between two electrodes.

Since in a volume we could localize infinite values for X, Y, and Z, it was necessary to divide them into finite elements in each direction, otherwise the problem could not be solved in a continuous medium.

In order to obtain a finite number of values for X, Y, and Z in a specific volume, we have divided the global volume through three grids on the main axis, evaluating only the intersection points. Nevertheless, even in this case, we would have had an excessive number of points, aiming to reach a good spatial resolution (i.e., millimeters). Figure 6 reports the position of MC-EGG for humans.

**Figure 6.** Application of MC-EGG on humans.

This problem has been solved using the EGG values, acquired at the beginning of each measure cycle, to restrict the number of possible values.

In order to reduce the number of tested points, we introduced two logic trees, one for the Y coordinates and the other one for the Z coordinates. These trees exclude, by logical operation of the 36 EGG signal, the zones of the mapped volume that could not be interested by the position of the larynx; in this way, we reduced the number of tested points. The full algorithm has been implemented in Matlab to obtain an accurate solution.

The lateral displacement of the larynx (on the X axis) has never been studied and there is no literature material about it; nevertheless, this algorithm gives the user the possibility to also set a displacement range on the X axis.

The Labview code worked online and the implementation of the whole algorithm was not possible; for this reason, we built another, lighter algorithm, that uses just the two logic trees to define a range of possible values on Y and Z. This algorithm considered the midpoint as the most probable point for the larynx position.

#### **4. Comparison between Software and Experiments**

The new developed software has been tested by studying the larynx displacement during two well-known vocal acts: the alternate emission of the vowel/**a**, first with a low note and then with a high note; and swallowing.

In order to guarantee a correct synchronization between the physical (measured) signal and the acquisition (samples acquired), the acquisition chain in Labview should be set to read data in Finite Mode. The sampling rate that allows the best synchronization was estimated to be around 38.5 kHz: using this sampling rate, we could acquire 36 samples every 3.5 ms. It is also important to remember that the EGG signals are normally studied during the emission of a low note characterized by a fundamental frequency of *f* <sup>0</sup> ≈ 100 Hz. It is also important to note that during the alternate emission of a low and high note, the larynx has a marked displacement, in the range between 18 and 22 mm [14].

Moreover, the high level of background noise caused some difficulties of accuracy during the acquisition of the experimental data, since the EGG signal has a magnitude of around 1 mV, which is comparable with the background noise. However, this background noise, which represents the main issue during these experiments with an EGG device, is often discussed in scientific literature [14,15].

In order to reduce the background noise, the first attempt consisted of using a proper contact gel which could increase the ECC signal, improving the contact between the electrode and the skin surface. The second attempt consisted of using a higher voltage range, which was in the order of tens of mV.

The two codes were initially tested using the Labview code, and then the Matlab code. As expected, Labview allowed us to visualize the larynx in a downward position during the emission of the low note and an upward displacement during the emission of the same note at a higher frequency. There is no back-forward displacement of the larynx during this phonetic act. The range of displacement was 0 mm on the sagittal plane and 18.2 mm on the vertical plane, as described in scientific literature [15].

The data acquired by the Labview code was processed with the Matlab algorithm. In this way, the larynx movement was graphically visualized. The resulting movement was similar to the Labview one, and the resulting displacement was 0 mm for the sagittal plane and 18.4 mm for the vertical one. Besides the graphs, a video of the displacement was also obtained. Figure 7 reports some frames of the video.

**Figure 7.** Frames from the created Matlab movie for the larynx movement.

During swallowing, the path of the larynx inside the neck is more complex; the larynx responds to this act by rising up in the first moment to push down the bolus, the epiglottis then moves backward to avoid the bolus penetration into the respiratory airways and, when the bolus is passed, the larynx returns to the original position.

The results obtained from the evaluation made through Labview and Matlab confirmed this path [13]. We recorded a 19.7 mm vertical displacement (both in Labview and Matlab), while the sagittal movement was 16.65 mm with Labview and 16.75 mm in Matlab. All these values are inside the range described by scientific literature [15]. Figure 8 reports the swallowing displacement as elaborated by Matlab.

**Figure 8.** Trajectory of the larynx evaluated by the Matlab code for swallowing.

#### **5. Conclusions**

The purpose of this work was to develop a tool able to visualize the glottal dynamic and the displacement of the larynx inside the neck during phonetic acts. This task was possible to achieve by means of fast EGG data acquisition, properly designed and configured, and by means of the development of a specific algorithm to process the data acquired.

The Matlab code also allowed us to study the larynx displacement on the coronal plane. Currently, there is not enough knowledge about this kind of movement and this research could be a starting point for further analysis.

Moreover, there is the prospective to extend the potentiality of the numerical code for exploiting the number of electrodes. In this way, it might be possible to study the behavior of the ventricular (or false) vocal folds. These are not exactly vocal folds, because they have different tissues and do not display muscular activity. The false vocal folds are not usually used in the normal phonation, but could be used for some kinds of singing styles, and they take the place of the true vocal folds in some voice diseases. So far, the ventricular (false) vocal folds have been less investigated, but the interest in them is growing. This research could contribute to, for example, detecting voice disorders in a non-invasive way.

**Author Contributions:** L.T., M.K. and C.G. contributed equally for writing original draft, control, review and editing, for setting up the experiments and the codes, for formal analysis and funding.

**Funding:** This research received no external funding.

**Acknowledgments:** The Authors wish to thank Andrea Casadei for having collaborated with the measurements.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Enhancing Target Speech Based on Nonlinear Soft Masking Using a Single Acoustic Vector Sensor**

#### **Yuexian Zou 1,\*, Zhaoyi Liu <sup>1</sup> and Christian H. Ritz <sup>2</sup>**


Received: 15 June 2018; Accepted: 25 July 2018; Published: 23 August 2018

**Abstract:** Enhancing speech captured by distant microphones is a challenging task. In this study, we investigate the multichannel signal properties of the single acoustic vector sensor (AVS) to obtain the inter-sensor data ratio (ISDR) model in the time-frequency (TF) domain. Then, the monotone functions describing the relationship between the ISDRs and the direction of arrival (DOA) of the target speaker are derived. For the target speech enhancement (SE) task, the DOA of the target speaker is given, and the ISDRs are calculated. Hence, the TF components dominated by the target speech are extracted with high probability using the established monotone functions, and then, a nonlinear soft mask of the target speech is generated. As a result, a masking-based speech enhancement method is developed, which is termed the AVS-SMASK method. Extensive experiments with simulated data and recorded data have been carried out to validate the effectiveness of our proposed AVS-SMASK method in terms of suppressing spatial speech interferences and reducing the adverse impact of the additive background noise while maintaining less speech distortion. Moreover, our AVS-SMASK method is computationally inexpensive, and the AVS is of a small physical size. These merits are favorable to many applications, such as robot auditory systems.

**Keywords:** Direction of Arrival (DOA); time-frequency (TF) mask; speech sparsity; speech enhancement (SE); acoustic vector sensor (AVS); intelligent service robot

#### **1. Introduction**

With the development of information technology, intelligent service robots will play an important role in smart home systems. Auditory perception is one of the key technologies of intelligent service robots [1]. Research has shown that special attention is currently being given to human–robot interaction [2], and especially speech interaction in particular [3,4]. It is clear that service robots are always working in noisy environments, and there are possible directional spatial interferences such as the competing speakers located in different locations, air conditioners, and so on. As a result, additive background noise and spatial interferences significantly deteriorate the quality and intelligibility of the target speech, and speech enhancement (SE) is considered the most important preprocessing technique for speech applications such as automatic speech recognition [5].

Single-channel SE and two-channel SE techniques have been studied for a long time, while practical applications have a number of constraints, such as limited physical space for installing large-sized microphones. The well-known single channel SE methods, including spectral subtraction, Wiener filtering, and their variations, are successful for suppressing additive background noise, but they are not able to suppress spatial interferences effectively [6]. Besides, mask-based SE methods have predominantly been applied in many SE and speech separation applications [7]. The key idea behind mask-based SE methods is to estimate a spectrographic binary or soft mask to suppress the

unwanted spectrogram components [7–11]. For binary mask-based SE methods, the spectrographic masks are "hard binary masks" where a spectral component is either set to 1 for the target speech component or set to 0 for the non-target speech component. Experimental results have shown that the performance of binary mask SE methods degrades with the decrease of the signal-to-noise ratio (SNR) and the masked spectral may cause the loss of speech components due to the harsh black or white binary conditions [7,8]. To overcome this disadvantage, the soft mask-based SE methods have been developed [8]. In soft mask-based SE methods, each time-frequency component is assigned a probability linked to the target speech. Compared to the binary mask SE methods, the soft-mask SE methods have shown better capability to suppress the noise with the aid of some priori information. However, the priori information may vary with time, and obtaining the priori information is not an easy task.

By further analyzing the mask-based SE algorithms, we have the following observations. (1) It is a challenging task to estimate a good binary spectrographic mask. When noise and competing speakers (speech interferences) exist, the speech enhanced by the estimated mask often suffers from the phenomenon of "musical noise". (2) The direction of arrival (DOA) of the target speech is considered as a known parameter for the target SE task. (3) A binaural microphone and an acoustic vector sensor (AVS) are considered as the most attractive front ends for speech applications due to their small physical size. For the AVS, its physical size is about 1–2 cm3 and AVS also has the merits such as signal time alignment and a trigonometric relationship of signal amplitudes [12–16]. A high-resolution DOA estimation algorithm with a single AVS has been proposed by our team [12–16]. Some effort has also been made for the target SE task with one or two AVS sensors [17–21]. For example, with the minimum variance distortionless response (MVDR) criterion, Lockwood et al. developed a beamforming method using the AVS [17]. Their experimental results showed that their proposed algorithm achieves good performance for suppressing noise, but brings certain distortion of the target speech.

As discussed above, in this study, we focus on developing the target speech enhancement algorithm with a single AVS from a new technical perspective in which both the ambient noise and non-target spatial speech interferences can be suppressed effectively and simultaneously. The problem formulation is presented in Section 2. Section 3 shows the derivation of the proposed SE algorithm. The experimental results are given in Section 4, and conclusions are drawn in Section 5.

#### **2. Problem Formulation**

In this section, the sparsity of speech in the time-frequency (TF) domain is discussed first. Then, the AVS data model and the corresponding inter-sensor data ratio (ISDR) models are presented for completeness, which was developed by our team in a previous work [13]. After that, the derivation of monotone functions between ISDRs and the DOA is given. Finally, the nonlinear soft TF mask estimation algorithm is derived specifically.

#### *2.1. Time-Frequency Sparsity of Speech*

In the research of speech signal processing, the TF sparsity of speech is a widely accepted assumption. More specifically, when there is more than one speaker in the same spatial space, the speech TF sparsity implies the following [5]. (1) It is likely that only one speaker is active during certain time slots. (2) For the same time slot, if more than one speaker is active, it is probable that the different TF points are dominated by different speakers. Hence, the TF sparsity of speech can be modeled as:

$$S\_m(\tau,\omega)S\_n(\tau,\omega) = 0, m \neq n \tag{1}$$

where *Sm*(*τ,ω*) and *Sn*(*τ,ω*) are the speech spectral at (*τ,ω*) for the *m*th speaker and *n*th speaker, respectively. (3) In practice, at a specific TF point (*τ,ω*), it is most probably true that only one speech source with the highest energy dominates, and the contributions from the other sources can be negligible.

#### *2.2. AVS Data Model*

An AVS unit generally consists of *J* co-located constituent sensors, including one omnidirectional sensor (denoted as *o*-sensor) and *J*-1 orthogonally oriented directional sensors. Figure 1 shows the data capture setup with a single AVS. It is noted that the left bottom plot in Figure 1 shows a 3D-AVS unit implemented by our team, which consists of one o-sensor with three orthogonally oriented directional sensors depicted as the *u*-sensor, *v*-sensor, and *w*-sensor, respectively. In theory, the directional response of the oriented directional sensor has dipole characteristics, as shown in Figure 2a, while the omnidirectional sensor has the same response in all of the directions, as shown in Figure 2b. In this study, one target speaker is considered. As shown in Figure 1, the target speech *S*(*t*) is impinging from (*θs*,*φs*) meanwhile, interference *Si*(*t*) are impinging from (*θj*,*φj*), where *φs*, *φ<sup>i</sup>* (0◦,360◦) are the azimuth angles, and *θs*, *θ<sup>i</sup>* (0◦,180◦) are the elevation angles.

**Figure 1.** Illustration of a single acoustic vector sensor (AVS) for data capturing.

**Figure 2.** (**a**) The directional response of oriented directional sensor; (**b**) The directional response of omnidirectional sensor.

For simplifying the derivation, without considering room reverberation, the received data of the AVS can be modeled as [13]:

$$\mathbf{x}\_{\rm avs}(t) = \mathbf{a}(\theta\_{\rm s}, \phi\_{\rm s})\mathbf{s}(t) + \sum\_{i=1}^{M\_i} \mathbf{a}(\theta\_{i\prime}, \phi\_{i\prime})\mathbf{s}\_i(t) + \mathbf{u}\_{\rm avs}(t) \tag{2}$$

where *xavs*(*t*), *navs*(*t*) and *a*(*θs*,*φs*) are defined respectively as:

$$\mathbf{x}\_{\text{BUS}}(t) = \left[ \mathbf{x}\_{\text{\textquotedblleft}}(t), \mathbf{x}\_{\text{\textquotedblleft}}(t), \mathbf{x}\_{\text{\textquotedblleft}}(t), \mathbf{x}\_{\text{\textquotedblleft}}(t) \right] \right] \tag{3}$$

$$n\_{\rm avs}(t) = [n\_u(t), n\_v(t), n\_w(t), n\_o(t)] \tag{4}$$

$$\mathbf{a}(\theta\_{\sf s}, \phi\_{\sf s}) = [u\_{\sf s}, v\_{\sf s}, w\_{\sf s}, 1]^T = \begin{bmatrix} \sin \theta\_{\sf s} \cos \phi\_{\sf s}, \sin \theta\_{\sf s} \sin \phi\_{\sf s}, \cos \theta\_{\sf s}, 1 \end{bmatrix}^T \tag{5}$$

$$\mathbf{a}(\theta\_{\mathbf{i}\prime}\phi\_{\mathbf{i}}) = \begin{bmatrix} \mu\_{i\prime}\upsilon\_{i\prime}\upsilon\_{i\prime}\mathbf{1} \end{bmatrix}^T = \begin{bmatrix} \sin\theta\_{\mathbf{i}}\cos\phi\_{\mathbf{i}\prime}\sin\theta\_{\mathbf{i}}\sin\phi\_{\mathbf{i}\prime}\cos\theta\_{\mathbf{i}\prime}\mathbf{1} \end{bmatrix}^T \tag{6}$$

In Equation (3), *xu*(*t*), *xv*(*t*), *xw*(*t*), *xo*(*t*) are the received data of the *u*-sensor, *v*-sensor, *w*-sensor, and *o*-sensor, respectively; *nu*(*t*), *nv*(*t*), *nw*(*t*), *no*(*t*) are assumed as the additive zero-mean white Gaussian noise captured at the *u*-sensor, *v*-sensor, *w*-sensor, and *o*-sensor, respectively; *s*(*t*) is the target speech; *si*(*t*) are the *i*th interfering speech; the number of interferences is *Mi*; *a*(*θs*,*φs*) and *a*(*θj*,*φj*) are the steering vectors of *s*(*t*) and *si*(*t*), respectively. [.]*<sup>T</sup>* denotes the vector/matrix transposition.

From the AVS data model given in Equation (2), taking the short-time Fourier transform (STFT), for a specific TF point (*τ*,*ω*), we have:

$$X\_{\rm avs}(\tau,\omega) = \mathfrak{a}(\theta\_{\rm s},\phi\_{\rm s})\mathcal{S}(\tau,\omega) + \sum\_{i=1}^{M\_i} \mathfrak{a}(\theta\_{i\prime},\phi\_{i\prime})\mathcal{S}\_i(\tau,\omega) + \mathcal{N}\_{\rm avs}(\tau,\omega) \tag{7}$$

where *Xavs*(*τ,ω*)=[*Xu*(*τ,ω*), *Xv*(*τ,ω*), *Xw*(*τ,ω*), *Xo*(*τ,ω*)]*T*; *Xu*(*τ,ω*), *Xv*(*τ,ω*), *Xw*(*τ,ω*), and *Xo*(*τ,ω*) are the STFT of *xu*(*t*), *xv*(*t*), *xw*(*t*), and *xo*(*t*), respectively. Meanwhile, *Navs*(*τ,ω*)=[*Nu*(*τ,ω*), *Nv*(*τ,ω*), *Nw*(*τ,ω*), *No*(*τ,ω*)]*T*; *Nu*(*τ,ω*), *Nv*(*τ,ω*), *Nw*(*τ,ω*), and *No*(*τ,ω*) are the STFT of *nu*(*t*), *nv*(*t*), *nw*(*t*), and *no*(*t*), respectively. Since the target speech spectral is *S*(*τ,ω*), let us define a quantity as follows:

$$\mathbf{N}\_{total}(\tau,\omega) = \sum\_{i=1}^{M\_i} \mathbf{a}(\theta\_i, \phi\_i) \mathbf{S}\_i(\tau, \omega) + \mathbf{N}\_{\text{avs}}(\tau, \omega) \tag{8}$$

where we define *Ntotal*(*τ,ω*)=[*Ntu*(*τ,ω*), *Ntv*(*τ,ω*), *Ntw*(*τ,ω*), *Nto*(*τ,ω*)]*<sup>T</sup>* to represent the mixture of the interferences and additive noise. Therefore, from Equations (7) and (8), we have the following expressions:

$$X\_{\mathfrak{U}}(\mathfrak{r},\omega) = \mathfrak{u}\_{\mathfrak{s}}\mathcal{S}(\mathfrak{r},\omega) + \mathcal{N}\_{\mathfrak{U}}(\mathfrak{r},\omega) \tag{9}$$

$$X\_v(\tau,\omega) = \upsilon\_s S(\tau,\omega) + N\_{tv}(\tau,\omega) \tag{10}$$

$$X\_w(\tau,\omega) = w\_s S(\tau,\omega) + N\_{tw}(\tau,\omega) \tag{11}$$

$$X\_o(\tau,\omega) = S(\tau,\omega) + \mathcal{N}\_{to}(\tau,\omega) \tag{12}$$

In this study, we make the following assumptions. (1) *s*(*t*) and *si*(*t*) are uncorrelated and are considered as far-field speech sources; (2) *nu*(*t*), *nv*(*t*), *nw*(*t*) and *no*(*t*) are uncorrelated. (3) The DOA of the target speaker is given as (*θs*,*φs*); the task of target speech enhancement is essentially to estimate *S*(*τ*,*ω*) from *Xavs*(*τ*,*ω*).

#### *2.3. Monotone Functions between ISDRs and the DOA*

Definition and some discussions on the inter-sensor data ratio (ISDR) of the AVS are presented in our previous work [13]. In this subsection, we briefly introduce the definition of ISDR first, and then present the derivation of the monotone functions between the ISDRs and the DOA of the target speaker.

The ISDRs between each channel of the AVS are defined as:

$$I\_{\vec{i}\vec{j}}(\tau,\omega) = X\_{\vec{i}}(\tau,\omega) / X\_{\vec{j}}(\tau,\omega) \text{ where } (\vec{i} \neq \vec{j})\tag{13}$$

where *i* and *j* are the channel index, which refers to *u*, *v*, *w,* and *o*, respectively. Obviously, there are 12 different computable ISDRs, which are shown in Table 1. In the following context, we carefully evaluate *Iij*, and it is clear that only three ISDRs (*Iuv*, *Ivu* and *Iwo*) hold the approximate monotone function between ISDR and the DOA of the target speaker.

**Sensor** *u vwo u* NULL *Ivu Iwu Iou v Iuv* NULL I*wv Iov w Iuw Ivw* NULL *Iow*

**Table 1.** Twelve computable inter-sensor data ratios (ISDRs).

According to the definition of ISDRs given in Equation (13), we look at *Iuv*, *Ivu* and *Iwo* first. Specifically, we have:

*o Iuo Ivo Iwo* NULL

$$I\_{uv}(\tau,\omega) = X\_u(\tau,\omega) / X\_v(\tau,\omega) \tag{14}$$

$$I\_{vu}(\tau,\omega) = X\_v(\tau,\omega) / X\_u(\tau,\omega) \tag{15}$$

$$I\_{\rm uv}(\tau,\omega) = X\_{\rm w}(\tau,\omega) / X\_{\rm v}(\tau,\omega) \tag{16}$$

Substituting Equations (9) and (10) into Equation (14) gives:

$$I\_{\rm tr}(\tau,\omega) = \frac{u\_{\rm s}S(\tau,\omega) + N\_{\rm lll}(\tau,\omega)}{v\_{\rm s}S(\tau,\omega) + N\_{\rm tr}(\tau,\omega)} = \frac{u\_{\rm s} + N\_{\rm lll}(\tau,\omega)/S(\tau,\omega)}{v\_{\rm s} + N\_{\rm tr}(\tau,\omega)/S(\tau,\omega)} = \frac{u\_{\rm s} + \varepsilon\_{\rm trls}(\tau,\omega)}{v\_{\rm s} + \varepsilon\_{\rm trls}(\tau,\omega)}\tag{17}$$

where *εtus*(*τ*,*ω*) = *Ntu*(*τ*,*ω*)/*S*(*τ*,*ω*), and *εtvs*(*τ*,*ω*) = *Ntv*(*τ*,*ω*)/*S*(*τ*,*ω*).

Similarly, we get *Iuw* and *Iwo*:

$$l\_{\rm vu}(\tau,\omega) = \frac{v\_{\rm s}S(\tau,\omega) + N\_{\rm tr}(\tau,\omega)}{u\_{\rm s}S(\tau,\omega) + N\_{\rm tr}(\tau,\omega)} = \frac{v\_{\rm s} + N\_{\rm tr}(\tau,\omega)/S(\tau,\omega)}{u\_{\rm s} + N\_{\rm tr}(\tau,\omega)/S(\tau,\omega)} = \frac{v\_{\rm s} + \varepsilon\_{\rm trs}(\tau,\omega)}{u\_{\rm s} + \varepsilon\_{\rm trs}(\tau,\omega)}\tag{18}$$

$$I\_{\rm ltr}(\boldsymbol{\tau},\boldsymbol{\omega}) = \frac{w\_{\rm s}S(\boldsymbol{\tau},\boldsymbol{\omega}) + N\_{\rm ltr}(\boldsymbol{\tau},\boldsymbol{\omega})}{S(\boldsymbol{\tau},\boldsymbol{\omega}) + N\_{\rm ltr}(\boldsymbol{\tau},\boldsymbol{\omega})} = \frac{w\_{\rm s} + N\_{\rm ltr}(\boldsymbol{\tau},\boldsymbol{\omega})/S(\boldsymbol{\tau},\boldsymbol{\omega})}{1 + N\_{\rm ltr}(\boldsymbol{\tau},\boldsymbol{\omega})/S(\boldsymbol{\tau},\boldsymbol{\omega})} = \frac{w\_{\rm s} + \varepsilon\_{\rm ltrs}(\boldsymbol{\tau},\boldsymbol{\omega})}{1 + \varepsilon\_{\rm ltr}(\boldsymbol{\tau},\boldsymbol{\omega})} \tag{19}$$

In Equation (19), *εtws*(*τ,ω*) = *Ntw*(*τ,ω*)/*S*(*τ,ω*) and *εtos*(*τ,ω*) = *Nto*(*τ,ω*)/*S* (*τ,ω*).

Based on the assumption of TF sparsity of speech shown in Section 2.1, we can see that if the TF points (*τ,ω*) are dominated by the target speech from (*θs,φs*), the energy of the target speech is high, and the value of *εtus*(*τ,ω*), *εtvs*(*τ,ω*), *εtws*(*τ,ω*) and *εtos*(*τ,ω*) tends to be small. Then, Equations (17)–(19) can be accordingly approximated as:

$$I\_{uv}(\boldsymbol{\pi},\omega) \approx \boldsymbol{u}\_s/\boldsymbol{\upsilon}\_s + \varepsilon\_1(\boldsymbol{\pi},\omega) \tag{20}$$

$$I\_{\rm vu}(\boldsymbol{\pi},\boldsymbol{\omega}) \approx \boldsymbol{v}\_{\rm s}/\boldsymbol{u}\_{\rm s} + \varepsilon\_{2}(\boldsymbol{\pi},\boldsymbol{\omega})\tag{21}$$

$$I\_{\rm two}(\boldsymbol{\pi},\boldsymbol{\omega}) \; \approx \; \boldsymbol{w}\_{\rm s} + \varepsilon\_{\rm 3}(\boldsymbol{\pi},\boldsymbol{\omega}) \tag{22}$$

where *ε*1, *ε*2, and *ε*<sup>3</sup> can be viewed as the ISDR modeling error with zero-mean introduced by interferences and background noise. Moreover, *εi*(*τ,ω*) (*i* = 1, 2, 3) is inversely proportion to the local SNR at (*τ,ω*).

Furthermore, from Equation (5), we have *us* = sin*θs*·cos*φs*, *vs* = sin*θs*·sin*φ<sup>s</sup>* and *ws* = cos*θs*. Then, substituting Equation (5) into Equations (20)–(22), we obtain the following equations:

$$I\_{\rm inv}(\tau,\omega) \approx \frac{\sin\theta\_{\rm s}\cos\phi\_{\rm s}}{\sin\theta\_{\rm s}\sin\phi\_{\rm s}} + \varepsilon\_1(\tau,\omega) \ = \cot\phi\_{\rm s} + \varepsilon\_1(\tau,\omega) \tag{23}$$

$$I\_{\rm vu}(\tau,\omega) \approx \frac{\sin \theta\_{\rm s} \sin \phi\_{\rm s}}{\sin \theta\_{\rm s} \cos \phi\_{\rm s}} + \varepsilon\_2(\tau,\omega) \ = \tan(\phi\_{\rm s}) + \varepsilon\_2(\tau,\omega) \tag{24}$$

$$I\_{\rm uv}(\boldsymbol{\tau}, \omega) \approx w\_s + \varepsilon\_3(\boldsymbol{\tau}, \omega) \,\, = \cos(\theta\_s) + \varepsilon\_3(\boldsymbol{\tau}, \omega) \tag{25}$$

From Equations (23)–(25), it is desired to see that the approximate monotone functions between *Iuv*, *Ivu*, and *Iwo* and the DOA (*θ<sup>s</sup>* or *φs*) of the target speaker have been obtained since arccot, arctan, and arccos functions are all monotone functions.

However, except for *Iuv*, *Ivu*, and *Iwo*, other ISDRs do not hold such a property. Let's take *Iuw* as an example. From the definition in Equation (13), we can get:

$$I\_{\rm luv}(\tau,\omega) = \frac{u\_{\rm s}S(\tau,\omega) + N\_{\rm lu}(\tau,\omega)}{u\_{\rm s}S(\tau,\omega) + N\_{\rm lu}(\tau,\omega)} = \frac{u\_{\rm s} + N\_{\rm lu}(\tau,\omega)/S(\tau,\omega)}{u\_{\rm s} + N\_{\rm lu}(\tau,\omega)/S(\tau,\omega)} = \frac{u\_{\rm s} + \varepsilon\_{\rm lu}(\tau,\omega)}{u\_{\rm s} + \varepsilon\_{\rm lu}(\tau,\omega)} = \frac{u\_{\rm s}}{u\_{\rm s}} + \varepsilon\_{\rm l}(\tau,\omega) \tag{26}$$

where *ε*<sup>4</sup> can be viewed as the ISDR modeling error with zero-mean introduced by unwanted noise. Obviously, Equation (26) is valid when *ws* is not equal to zero. Substituting Equation (5) into Equation (26) yields:

$$I\_{uw}(\tau,\omega) \approx \frac{\sin\theta\_s \cos\phi\_s}{\cos\theta\_5} + \varepsilon\_4(\tau,\omega) \ = \tan\theta\_s \cos\phi\_s + \varepsilon\_4(\tau,\omega) \tag{27}$$

From Equation (27), we can see that *Iuw* is a function of both *θ<sup>s</sup>* and *φs*.

In summary, after analyzing all of the ISDRs, we find that the desired monotone functions between ISDRs and *θ<sup>s</sup>* or *φs*, which are given in Equations (23)–(25), respectively. It is noted that Equations (23)–(25) are derived conditioned by assuming *vs*, *us*, and *ws* are not equal to zero. Therefore, we need to find out where *vs*, *us*, and *ws* are equal to zero. For presentation clarity, let's define an ISDR vector *I*isdr = [*Iuv*, *Ivu*, *Iwo*].

From Equation (5), it is clear that when the target speaker is at angles of 0◦, 90◦, 180◦, and 270◦, one of *vs*, *us*, and *ws* becomes zero, and it means that *I*isdr is not fully available. Specifically, we need to consider the following cases:

**Case 1**: the elevation angle *θ<sup>s</sup>* is about 0◦ or 180◦. In this case, *us* = sin*θs*·cos*φ<sup>s</sup>* and *vs* = sin*θs*·sin*φ<sup>s</sup>* are close to zero. Then, the denominator in Equations (20) and (21) is equal to zero, and we cannot obtain *Iuv* and *Ivu*, but we can get *Iwo*.

**Case 2**: *θ<sup>s</sup>* is away from 0◦ or 180◦. In this condition, we need to look at *φ<sup>s</sup>* carefully.


To visualize the discussions above, a decision tree of handling the special angles in computing *I*isdr is plotted in Figure 3.

**Figure 3.** The decision tree of handling the special angles in computing *I*isdr.

When *I*isdr = [*Iuv*, *Ivu*, *Iwo*] has been computed properly, with simple manipulation from Equations (23)–(25), we get:

$$\phi\_{\mathbb{S}}(\tau,\omega) = \operatorname{arccot}(I\_{\mathbb{u}\mathbb{D}}(\tau,\omega) - \varepsilon\_1(\tau,\omega))\tag{28}$$

$$\phi\_s(\tau,\omega) = \arctan(I\_{vu}(\tau,\omega) - \varepsilon\_2(\tau,\omega))\tag{29}$$

$$\theta\_s(\tau,\omega) = \arccos(I\_{\text{wo}}(\tau,\omega) - \varepsilon\_3(\tau,\omega))\tag{30}$$

From Equations (28)–(30), we can see that arccot, arctan, and arccos functions are all monotone functions, which are what we expected. Besides, we also note that (*θs*,*φs*) is given, and *Iuv, Ivu* and *Iwo* can be computed by Equations (14)–(16). However, *ε*1, *ε*2, and ε<sup>3</sup> are unknown, which reflect the impact of noise and interferences. According to the assumptions made in Section 2.1, if we are able to select the TF components (*θs*,*φs*) dominated by the target speech, and the local SNR at this (*τ,ω*) is high, then ε1, *ε*2, and ε<sup>3</sup> can be ignored, since they will have values approaching zero at these (*τ,ω*) points. In such conditions, we obtain the desired formulas to compute (*θs*,*φs*):

$$\phi\_{\sf s}(\tau,\omega) \approx \text{arccot}(I\_{\text{uv}}(\tau,\omega)),\\\phi\_{\sf s}(\tau,\omega) \approx \arctan(I\_{\text{vu}}(\tau,\omega)) \\ \text{and} \\ \theta\_{\sf s}(\tau,\omega) \approx \arccos(I\_{\text{vu}}(\tau,\omega)) \tag{31}$$

#### *2.4. Nonlinear Soft Time-Frequency (TF) Mask Estimation*

As discussed above, Equation (31) is valid when the (*τ,ω*) points are dominated by target speech with high local SNR. Besides, we have three equations to solve two variables, *θ<sup>s</sup>* and *φs*. In this study, from Equation (31), we estimate *θ<sup>s</sup>* and *φ<sup>s</sup>* in the following way:

$$\hat{\phi}\_{s1}(\tau,\omega) = \text{arccot}I\_{\text{uv}}(\tau,\omega) + \Delta\eta\_1 \tag{32}$$

$$
\hat{\phi}\_{s2}(\tau,\omega) = \arctan I\_{vu}(\tau,\omega) + \Delta \eta\_2 \tag{33}
$$

$$
\hat{\phi}\_{\mathfrak{s}}(\mathfrak{r}, \omega) = mean(\hat{\phi}\_{\mathfrak{s}1}, \hat{\phi}\_{\mathfrak{s}2}) \tag{34}
$$

$$\hat{\theta}\_{\delta}(\tau,\omega) = \arccos l\_{\text{uv}}(\tau,\omega) + \Delta \eta\_{\text{3}} \tag{35}$$

where Δ*η*<sup>1</sup> and Δ*η*<sup>2</sup> are estimation errors. Comparing Equation (31) and Equations (32)–(35), we can see that if the estimated DOA values (*φ*ˆ*s*(*τ*, *ω*),ˆ *θs*(*τ*, *ω*)) approximate to the real DOA values (*θs*,*φs*), then Δ*η*<sup>1</sup> and Δ*η*<sup>2</sup> should be small. Therefore, for the TF points (*τ,ω*) dominated by the target speech, we can derive the following inequality:

$$|\left|\dot{\phi}\_{\sf s}(\sf{r},\omega) - \phi\_{\sf s}\right| < \delta\_1 \tag{36}$$

$$\left|\hat{\theta}\_{\mathfrak{s}}(\boldsymbol{\pi},\boldsymbol{\omega})-\theta\_{\mathfrak{s}}\right|\leqslant\delta\_{2}\tag{37}$$

where *φ*ˆ*s*(*τ*, *ω*) and ˆ *θs*(*τ*, *ω*) are the target speaker's DOA estimated by Equations (34) and (35), respectively. *θ<sup>s</sup>* and *φ<sup>s</sup>* are given the DOA of the target speech for the SE task. The parameters *δ*<sup>1</sup> and *δ*<sup>2</sup> can be set as the predefined permissible parameters (referring to an angle value). Following the derivation up to now, if Equations (36) and (37) are met at (*τ,ω*) points, we can infer that these (*τ,ω*) points are dominated by the target speech with high probability. Therefore, using Equations (36) and (37), the TF points (*τ,ω*) can be extracted, and a mask associated with these (*τ,ω*) points dominated by the target speech can be designed accordingly. In addition, we need to take the following facts into account. (1) The value of *φ<sup>s</sup>* belongs to (0,2*π*]. (2) The principal value interval of the arccot function is (0,*π*), and the arctan function is (−*π*/2,*π*/2). (3) The value range of *θ<sup>s</sup>* is (0,2*π*]. (4) The principal value interval of the arccos function is [0,*π*]. (5) To make the principal value of the anti-trigonometric function match the value of *θ<sup>s</sup>* and *φs*, we need to add *Lπ* to avoid ambiguity. As a result, a binary TF mask for preserving the target speech is designed as follows:

$$\text{mask}(\boldsymbol{\tau}, \boldsymbol{\omega}) = \begin{cases} 1, \text{if } \begin{cases} \quad \Delta\boldsymbol{\phi}(\boldsymbol{\tau}, \boldsymbol{\omega}) = \left| \boldsymbol{\phi}\_{\text{s}}(\boldsymbol{\tau}, \boldsymbol{\omega}) - \boldsymbol{\phi}\_{\text{s}} + \boldsymbol{L}\boldsymbol{\pi} \right| < \delta\_1 \\\ \quad \Delta\boldsymbol{\theta}(\boldsymbol{\tau}, \boldsymbol{\omega}) = \left| \hat{\boldsymbol{\theta}}\_{\text{s}}(\boldsymbol{\tau}, \boldsymbol{\omega}) - \boldsymbol{\theta}\_{\text{s}} + \boldsymbol{L}\boldsymbol{\pi} \right| < \delta\_2 \\\ \boldsymbol{0}, \text{else} \end{cases} \tag{38}$$

where *L* = 0, ± 1. (Δ*φ*(*τ,ω*), Δ*θ*(*τ,ω*)) is the estimation difference between the estimated DOA and the real DOA of the target speaker at TF point (*τ,ω*). Obviously, the smaller the value of (Δ*φ*(*τ,ω*), Δ*θ*(*τ,ω*)), the more probable it is that the TF point (*τ,ω*) is dominated by the target speech. To further improve the estimation accuracy and suppress the impact of the outliers, we propose a nonlinear soft TF mask as:

$$\text{mask}(\tau, \omega) = \begin{cases} \frac{1}{1 + e^{-\frac{3}{5}(1 - (\Lambda\phi(\tau, \omega)/\delta\_1 + \Lambda\delta(\tau, \omega)/\delta\_2)/2)}} & \Delta\phi < \delta\_1 \& \Delta\theta < \delta\_2\\ & \rho & \text{else} \end{cases} \tag{39}$$

where *ξ* is a positive parameter and *ρ* (0 ≤ *ρ* < 1) is a small positive parameter tending to be zero, which reflects the noise suppression effect. The parameters Δ<sup>1</sup> and Δ<sup>2</sup> control the degree of the estimation difference (Δ*φ*(*τ,ω*), Δ*θ*(*τ,ω*). When parameters Δ1, Δ2, and *ρ* become larger, the capability of suppressing noise and interferences degrades, and the possibility of the (*τ,ω*) being dominated by the target speech also degrades. Hence, selecting the values of *ρ*, Δ1, and Δ<sup>2</sup> is important. In our study, these parameters are determined through experiments. Future work could focus on selecting these parameters based on models of human auditory perception. In the end, we need to emphasize that the mask designed in Equation (39) has the ability to suppress the adverse effects of the interferences and background noise, and preserve the target speech simultaneously.

#### **3. Proposed Target Speech Enhancement Method**

The diagram of the proposed speech enhancement method (termed as AVS-SMASK) is shown in Figure 4, which is processed in the time-frequency domain. The details of each block in Figure 4 will be addressed in the following context.

**Figure 4.** Block diagram of our proposed AVS-SMASK algorithm (STFT: Short-Time Fourier Transform; FBF: a fixed beamformer; ISTFT: inverse STFT; y(n): enhanced target speech).

#### *3.1. The FBF Spatial Filter*

As shown in Figure 4, the input signals to the FBF spatial filter are the data captured by the *u*, *v*, and *w*-sensor of the AVS. With the given DOA (*θs*,*φs*), the spatial matched filter (SMF) is employed as the FBF spatial filter, and its output can be described as:

$$\mathbf{Y}\_{m}(\boldsymbol{\pi},\omega) = \mathbf{w}\_{m}^{H} \mathbf{X}\_{\text{avs}}(\boldsymbol{\pi},\omega) \tag{40}$$

where *wm<sup>H</sup>* = *aH*(*θs*,*φs*)/||*a*(*θs*,*φs*) ||2 is the weight vector of the SMF, and *a*(*θs*,*φs*) is given in Equation (5). [.]*<sup>H</sup>* denotes the vector/matrix conjugate transposition. Substituting the expressions in Equations (5), (3), and (9)–(11) in Equation (40) yields:

$$\begin{aligned} Y\_{\mathfrak{w}}(\tau,\omega) &= u\_{s}\mathbf{X}\_{\mathfrak{u}}(\tau,\omega) + v\_{s}\mathbf{X}\_{\mathfrak{v}}(\tau,\omega) + w\_{s}\mathbf{X}\_{\mathfrak{w}}(\tau,\omega) \\ &= u\_{s}^{2}S(\tau,\omega) + u\_{s}N\_{\text{lin}}(\tau,\omega) + v\_{s}^{2}S(\tau,\omega) + v\_{s}N\_{\text{lv}}(\tau,\omega) + w\_{s}^{2}S(\tau,\omega) + w\_{s}N\_{\text{lw}}(\tau,\omega) \\ &= (u\_{s}^{2} + v\_{s}^{2} + w\_{s}^{2})S(\tau,\omega) + N\_{\text{luvw}}(\tau,\omega) \\ &= S(\tau,\omega) + N\_{\text{luvw}}(\tau,\omega) \end{aligned} \tag{41}$$

where *Ntuvw*(*τ,ω*) is the total noise component given as:

$$\begin{aligned} N\_{\text{ltrw}}(\mathbf{r},\omega) &= \left. u\_s N\_{\text{lw}}(\mathbf{r},\omega) + v\_s N\_{\text{ltr}}(\mathbf{r},\omega) + w\_s N\_{\text{lw}}(\mathbf{r},\omega) \\ &= \left. u\_s (u\_i S\_i(\mathbf{r},\omega) + N\_{\text{l}}(\mathbf{r},\omega)) + v\_s (v\_i S\_i(\mathbf{r},\omega) + N\_{\text{v}}(\mathbf{r},\omega)) \right. \\ &\left. + w\_s (w\_i S\_i(\mathbf{r},\omega) + N\_{\text{w}}(\mathbf{r},\omega)) \right] \\ &= \left. \left( u\_b u\_i + v\_b v\_i + w\_i w\_i \right) S\_i(\mathbf{r},\omega) + u\_b N\_{\text{l}}(\mathbf{r},\omega) + v\_b N\_{\text{l}}(\mathbf{r},\omega) + w\_b N\_{\text{w}}(\mathbf{r},\omega) \right) \end{aligned} \tag{42}$$

It can been seen that *Ntuvw*(*τ,ω*) in Equation (42) consists of the interferences and background noise captured by directional sensors, while *Ym*(*τ,ω*) in Equation (41) is the mix of the desired speech source *S*(*τ,ω*) and unwanted component *Ntuvw*(*τ,ω*).

#### *3.2. Enhancing Target Speech Using Estimated Mask*

With the estimated mask in Equation (39) and the output of the FBF spatial filter *Ym*(*τ,ω*) in Equation (42), it is straightforward to compute the enhanced target speech as follows:

$$Y\_{\mathfrak{s}}(\tau,\omega) = Y\_{\mathfrak{m}}(\tau,\omega) \times mask(\tau,\omega) \tag{43}$$

where *Ys*(*τ,ω*) is then the spectra of the enhanced speech or an approximation of the target speech.

For presentation completeness, our proposed speech enhancement algorithm is termed as an AVS-SMASK algorithm, which is summarized in Table 2.

**Table 2.** The pseudo-code of our proposed AVS-SMASK algorithm.


<sup>(1)</sup> Segment the output data captured by the *u*-sensor, *v*-sensor, *w*-sensor, and *o*-sensor of the AVS unit by the N-length Hamming window;

#### **4. Experiments and Results**

The performance evaluation of our proposed AVS-SMASK algorithm has been carried out with simulated data and recorded data. Five commonly used performance measurement metrics—SNR, the signal-to-interference ratio (SIR), the signal-to-interference plus noise ratio (SINR), log spectral division (LSD), and the perceptual evaluation of speech quality (PESQ)—have been adopted. The definitions are given as follows for presentation completeness.

(1) Signal-to-Noise Ratio (SNR):

$$SNR = 10\log\left(\left\|s(t)\right\|^2/\left\|n(t)\right\|^2\right) \tag{44}$$

(2) Signal-to-Interference Ratio (SIR)

$$SIR = 10\log\left(\left\|s(t)\right\|^2/\left\|s\_i(t)\right\|^2\right) \tag{45}$$

(3) Signal-to-Interference plus Noise Ratio (SINR):

$$SINR = 10\log\left(||s(t)||^2/||x(t) - s(t)||^2\right) \tag{46}$$

where *s*(*t*) is the target speech, *n*(*t*) is the additive noise, *si*(*t*) is the *i*th interference, and *x*(*t*) = *s*(*t*) + *si*(*t*) + *n*(*t*) is the received signal of the *o*-sensor. The metrics are calculated by averaging over frames to get more accurate measurement [22].

(4) Log Spectral Deviation (LSD), which is used to measure the speech distortion [22]:

$$LSD = \left\| \ln \left( \psi\_{ss}(f) / \psi\_{yy}(f) \right) \right\|\tag{47}$$

where *ψss*(*f*) is the power spectral density (PSD) of the target speech, and *ψ*yy(*f*) is the PSD of the enhanced speech. It is clear that smaller LSD values indicate less speech distortion.

(5) Perceptual Evaluation of Speech Quality (PESQ). To evaluate the perceptual enhancement performance of the speech enhancement algorithms, the ITU-PESQ software [23] is utilized.

In this study, the performance comparison is carried out with the comparison algorithm AVS-FMV [17] under the same conditions. We do not take other SE methods into account since they use different transducers for signal acquisition. One set of waveform examples that is used in our experiments is shown in Figure 5, where *s*(*t*) is the target speech, *si*(*t*) is the *i*-th interference speech, *n*(*t*) is the additive noise, and *y*(*t*) is the enhanced speech.

**Figure 5.** Waveform examples: *s*(*t*) is the target speech, *si*(*t*) is the interference speech, *n*(*t*) is the additive noise, and *y*(*t*) is the enhanced speech signal.

#### *4.1. Experiments on Simulated Data*

In this section, three experiments have been carried out. The simulated data of about five seconds duration is generated, where the target speech *s*(*t*) is male speech, and two speech interferences *si*(*t*) are male and female speech, respectively. Moreover, the AURORA2 database [24] was used, which includes subway, babble, car, exhibition noise, etc. Without loss of generality, all of the speech sources are placed one meter away from the AVS.

#### 4.1.1. Experiment 1: The Output SINR Performance under Different Noise Conditions

In this experiment, we have carried out 12 trials (numbered as trial 1 to trial 12) to evaluate the performance of the algorithms under different spatial and additive noise conditions following the experimental protocols in Ref. [25]. The details are given below:

(1) The DOAs of target speech, the first speech interference (male speech) and the second speech interference (female speech) are at (*θs,φs*) = (45◦,45◦), (*θ*1*,φ*1) = (90◦,135◦), and (*θ*2*,φ*2) = (45◦,120◦), respectively. The background noise is chosen as babble noise *n*(*t*);

(2) We evaluate the performance under three different conditions: (a) there exists only additive background noise: *n*(*t*) = 0 and *si*(*t*) = 0; (b) there exists only speech interferences: *n*(*t*) = 0 and *si*(*t*) = 0; (c) there exists both background noise and speech interferences: *n*(*t*) = 0 and *si*(*t*) = 0;

(3) The input SINR (denoted as SINR-input) is set as −5 dB, 0 dB, 5 dB, and 10 dB, respectively. Following the setting above, 12 different datasets are generated for this experiment.

In addition, the parameters of algorithms are set as follows. (1) The sampling rate is 16 kHz, 1024-point FFT (Fast Fourier Fransform), and 1024-point Hamming window with 50% overlapping are used. (2) For our proposed AVS-SMASK algorithm, we set δ<sup>1</sup> = δ<sup>2</sup> = 25◦, *ρ* = 0.07, and *ξ* = 3. (3) For comparing algorithm AVS-FMV: F = 32, M = 1.001 followed Ref. [17]. The experimental results are given in Table 3.


**Table 3.** Output signal-to-interference plus noise ratio (SINR) under different noise conditions.

As shown in Table 3, for all of the noise conditions (Trial 1 to Trial 12), our proposed AVS-SMASK algorithm outperforms AVS-FMV [17]. From Table 3, we can see that our proposed AVS-SMASK algorithm gives about 3.26 dB, 4.14 dB, and 2.25 dB improvement compared with that of AVS-FMV under three different experimental settings, respectively. We can conclude that our proposed AVS-SMASK is effective in suppressing the spatial interferences and background noise.

#### 4.1.2. Experiment 2: The Performance versus Angle Difference

This experiment evaluates the performance of SE methods versus the angle difference between the target and interference speakers. Let's define the angle difference as Δ*φ*= *φ<sup>s</sup>* − *φ<sup>I</sup>* and Δ*θ* = *θ* − *θ<sup>i</sup>* (here, the subscripts *s* and *i* refer to the target speaker and the interference speaker, respectively). Obviously, the closer the interference speaker is to the target speaker, the speech enhancement is more limited. The experimental settings are as follows. (1) PESQ and LSD are used as metrics. (2) The parameters of algorithms are set as the same as those used in *Experiment 1*. (3) Without loss of generality, the SIR-input is set 0 dB, while SNR-input is set 10 dB. (4) We consider two cases.


**Figure 6.** (Experiment 2) The perfomance versus Δ*φ*. (**a**) Perceptual evaluation of speech quality (PESQ) results and (**b**) Log spectral division (LSD) results (Case 1: *φs* of the target speaker changes from 0◦ to 180◦) (Case 1).

In summary, from the experimental results, it is clear that our proposed AVS-SMASK algorithm is able to enhance the target speech and suppress the interferences when the angle difference between the target speaker and the interference is larger than 20◦.

**Figure 7.** (Experiment 2) The performance versus Δ*θ*. (**a**) PESQ results and (**b**) LSD results (Case 2: *θs* of the target speaker changes from 0◦ to 160◦).

4.1.3. Experiment 3: The Performance versus DOA Mismatch

In practice, the DOA estimation of the target speaker may be inaccurate or the target speaker may make a small movement that causes the DOA mismatch problem. Hence, this experiment evaluates the impact of the DOA mismatch on the performance of our proposed speech enhancement algorithm. The experimental settings are as follows. (1) The parameters of algorithms are set as same as the *Experiment 1*. (2) (*θs,φs*) = (45◦,45◦) and (*θ*1*,φ*1) = (90◦,135◦). (3) The SIR-input is set to 0 dB, while the SNR-input is set to 10 dB; the performance measurement metrics are chosen as SINR and LSD. (4) We consider two cases:

Case 1: Only *φ<sup>s</sup>* is mismatched, and the mismatch (*∂φs*) ranges from 0◦ to 30◦ with 5◦ increments. Case 2: Only *θ<sup>s</sup>* is mismatched, and the mismatch (*∂θs*) ranges from 0◦ to 30◦ with 5◦ increments.

Experimental results are given in Figures 8 and 9 for Case 1 and Case 2, respectively. From these results, we can clearly see that when the DOA mismatch is less than 20◦, our proposed AVS-SMASK algorithm is not sensitive to DOA mismatch. Besides, our AVS-SMASK algorithm outperforms the AVS-FMV algorithm under all of the conditions. However, when the DOA mismatch is larger than 20◦, the performance of our proposed AVS-SMASK algorithm drops significantly. Fortunately, it is easy to achieve 20◦ DOA estimation accuracy.

**Figure 8.** (Experiment 3) The performance versus the *∂φs*. (**a**) SINR results and (**b**) LSD results (Case 1).

**Figure 9.** (Experiment 3, Case 2) The performance versus the *∂θs*. (**a**) SINR results and (**b**) LSD results (Case 2).

#### *4.2. Experiments on Recorded Data in an Anechoic Chamber*

In this section, two experiments have been carried out with the recorded data captured by an AVS in an anechoic chamber [25]. Every set of recordings lasts about six seconds, which is made by the situation that the target speech source and the interference source are broadcasting at the same time along with the background noise, as shown in Figure 1. The speech sources taken from the Institute of Electrical and Electronic Engineers (IEEE) speech corpus [26] are placed in the front of the AVS at a distance of one meter, and the SIR-input is set to 0 dB, while the SNR-input is set to 10 dB, and the sampling rate was 48 kHz, and then down-sampled to 16 kHz for processing.

#### 4.2.1. Experiment 4: The Performance versus Angle Difference with Recorded Data

In this experiment, the performance of our proposed method has been evaluated versus the angle difference between the target and interference speakers (Δ*φ* = *φ<sup>s</sup>* − *φ<sup>i</sup>* and Δ*θ* = *θ<sup>s</sup>* − *θi*). The experimental settings are as follows. (1) PESQ is taken as the performance measurement metric. (2) The parameters of algorithms are set as the same as that of *Experiment 1*. (3) Considering page limitation, here, we only consider the changing of azimuth angle *φ<sup>s</sup>* while *θ<sup>s</sup>* = 90◦. The interfering speaker *s*1(*t*) is at (*θ*1*,φ*1) = (90◦,45◦). *φ<sup>s</sup>* varies from 0◦ to 180◦ with 20◦ increments. Then, there are 13 recorded datasets. The experimental results are shown in Figure 10. It is noted that the *x*-axis represents the azimuth angle *φs*. It is clear to see that the overall performance of our proposed AVS-SMASK algorithm is superior to that of the comparing algorithm. Specifically, when *φ<sup>s</sup>* approaches *φ*<sup>1</sup> = 45◦, the PESQ degrades quickly for both algorithms. When the angle difference Δ*φ* is larger than 30◦ (*φ<sup>s</sup>* is smaller than 15◦ or larger than 75◦), the PESQ of our proposed AVS-SMASK algorithm goes up quickly, and is not sensitive to the angle difference.

**Figure 10.** (Experiment 4) The performance versus *φs*. (**a**) PESQ results and (**b**) LSD results.

#### 4.2.2. Experiment 5: Performance versus DOA Mismatch with Recorded Data

This experiment is carried out to evaluate the performance of speech enhancement algorithms when there are DOA mismatches. The experimental settings are as follows. (1) PESQ and LSD are taken as the performance measurement metric. (2) The parameters of algorithms are set the same as those of *Experiment 1*. (3) The target speaker is at (*θs,φs*) = (45◦,45◦), and the interference speaker is at (*θ*1*,φ*1) = (90◦,135◦). The azimuth angle *φ<sup>s</sup>* is assumed to be mismatched. We consider the mismatch of *φ<sup>s</sup>* (denoted as *φ<sup>s</sup>* ") varying from 0◦ to 30◦ with 5◦ increments. The experimental results are shown in Figure 11, where the x-axis is the mismatch of the azimuth angle *φ<sup>s</sup>* (*φ<sup>s</sup>* "). It is noted that our proposed AVS-SMASK is superior to the compared algorithm under all conditions. It is clear to see that our proposed algorithm is not sensitive to DOA mismatch when the DOA mismatch is smaller than 23◦.

**Figure 11.** (Experiment 5) The performance versus the *φs* mismatch *φs* ". (**a**) PESQ results and (**b**) LSD results.

We are encouraged to conclude that our proposed algorithm will offer a good speech enhancement performance in practical applications when the DOA may not be accurately estimated.

#### **5. Conclusions**

In this paper, aiming at the hearing technology of service robots, a novel target speech enhancement method has been proposed systematically with a single AVS to suppress spatial multiple interferences and additive background noise simultaneously. By exploiting the AVS signal model and its inter-sensor data ratio (ISDR) model, the desired monotone functions between ISDR and the DOA of the target speaker is derived. Accordingly, a nonlinear soft mask has been designed by making use of speech time-frequency (TF) sparsity with the known DOA of the target speaker. As a result, a single AVS-based speech enhancement method (named as AVS-SMASK) has been formulated and evaluated. Comparing with the existing AVS-FMV algorithm, extensive experimental results using simulated data and recorded data validate the effectiveness of our AVS-SMASK algorithm in suppressing spatial interferences and the additive background noise. It is encouraging to see that our AVS-SMASK algorithm is able to maintain less speech distortion. Due to page limitations, we did not show the derivation of the algorithm under reverberation. The signal model and ISDR model under reverberant conditions will be presented in our paper [27]. Our preliminary experimental results show that the PESQ of our proposed AVS-SMASK degrades gradually when the room reverberation becomes stronger (RT60 > 400 ms), but LSD is not sensitive to the room reverberation. Besides, there is an argument that learning-based SE methods achieve the state-of-art. In our opinion, in terms of SNR, PESQ, and LSD, this is true. However, learning-based SE methods ask for large amounts of training data, and require much larger memory size and a high computational cost. In contrast, the application scenarios of this

research are different to learning-based SE methods, and our solution is more suitable for low-cost embedded systems. A real demo system was established in our lab, and the conducted trials further confirmed the effectiveness of our method where room reverberation is moderate (RT60 < 400 ms). We are confident that with only four-channel sensors and without any additional training data collected, the subjective and objective performance of our proposed AVS-SMASK is impressive. Our future study will investigate the deep learning-based SE method with a single AVS to improve its generalization and capability to handle different noise and interference conditions.

**Author Contributions:** Original draft preparation and writing, Y.Z. and Z.L.; Review & Editing, C.H.R., Y.Z. and Z.L. carried out the studies of the DOA estimation and speech enhancement with Acoustic Vector Sensor (AVS), participated in algorithm development, carried out experiments as well as drafted the manuscript. C.H.R. contributed to the design of the experiments, analyzed the experimental results and helped to review and edit the manuscript. All authors read and approved the final manuscript.

**Funding:** This research was funded by National Natural Science Foundation of China (No: 61271309), Shenzhen Key Lab for Intelligent MM and VR (ZDSYS201703031405467) and the Shenzhen Science & Technology Fundamental Research Program (JCYJ20170817160058246).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **The Accuracy of Predicted Acoustical Parameters in Ancient Open-Air Theatres: A Case Study in Syracusae**

**Elena Bo 1,\*, Louena Shtrepi <sup>1</sup> ID , David Pelegrín Garcia 2, Giulio Barbato 3, Francesco Aletta <sup>4</sup> ID and Arianna Astolfi <sup>1</sup> ID**


Received: 20 June 2018; Accepted: 10 August 2018; Published: 17 August 2018

**Featured Application: The work aims to give more insights into the relation between the sensitivity of the simulated objective parameters and the software input parameters for open-air ancient theatres. It is meant to raise awareness on the use of predictive acoustic software for unconventional outdoor environments in order to validate the possibility of re-using them as performance spaces.**

**Abstract:** Nowadays, ancient open-air theatres are often re-adapted as performance spaces for the additional historical value they can offer to the spectators' experience. Therefore, there has been an increasing interest in the modelling and simulation of the acoustics of such spaces. These open-air performance facilities pose several methodological challenges to researchers and practitioners when it comes to precisely measure and predict acoustical parameters. Therefore this work investigates the accuracy of predicted acoustical parameters, that is, the Reverberation Time (T20), Clarity (C80) and Sound Strength (G), taking the ancient Syracusae open-air theatre in Italy as a case study. These parameters were derived from both measured and simulated Impulse Responses (IR). The accuracy of the acoustic parameters predicted with two different types of acoustic software, due to the input variability of the absorption and scattering coefficients, was assessed. All simulated and measured parameters were in good agreement, within the range of one "just noticeable difference" (JND), for the tested coefficient combinations.

**Keywords:** open-air theatres; acoustical measurements; prediction models; historical acoustics

#### **1. Introduction**

The recent interest in the design of ancient theatres and in their acoustical characteristics has drawn attention to the lack of methodologies in metrology for historical acoustics [1]. The ISO 3382-1 standard [2] was used in the European ERATO project [3] to evaluate the acoustical apparatus of ancient theatres through room acoustic parameters, such as the Early Decay Time (EDT), Reverberation Time (RT), Clarity (C80), and Sound Strength (G). However, ISO 3382-1 basically refers to indoor environments and temporal decay parameters seem to be less suitable for open-air conditions [4–8]. Farnetani et al. [4] reported that EDT is not a robust predictor of the acoustic quality of open-air theatres. The lack of robustness in EDT is due to a marked and intrinsic variability of this parameter, according to the source position, which defines the delay and incidence direction of the first reflections to the receivers. The same study asserted that RT behaviour in an open-air theatre is clearly different from that dealt with in the classical reverberation theory, which refers to a reference room volume. However, this parameter showed a limited variability. Chourmouziadou et al. [5] also suggested the use of RT when comparative studies are performed. However, it should be utilised with caution since it is usually used to evaluate enclosed spaces. Mo et al. [6] conducted a listening test with monaural and binaural auralisations of an open-air space. They stated that the perceived reverberance in an unroofed space is not only affected by the temporal characteristics during the decay process, but also by the spatial characteristics, due to the distribution of the reflections. The results showed that the conventional RT described in ISO 3382-1, which only deals with the sound energy decay rate, is not suitable for evaluating the reverberance of an unroofed space. Thus, more insight is needed into the adoption of an indoor acoustic measurements standard for the investigation of the acoustic conditions of open-air theatres. These sites represent particular environments that have their own specific sound field, which is rather different from the ideal diffuse field.

Besides the doubts about the applicability of the aforementioned indoor standard to outdoor case studies, other specific problems could arise when conducting measurements in ancient theatres. In fact, archaeological field measurements are also clearly influenced by the current conditions of the architecture of the theatres. Most ancient theatres have undergone damage of anthropologic and atmospheric nature. It was attested in Farnetani et al. [4] that the measured values of RT, G, and C80 in ancient theatres are affected to a great extent by the state of conservation of the theatres themselves, with particular reference to the completeness of the architectural elements. Therefore, it is currently difficult to design acoustical correction guidelines for their contemporary reuse as performance spaces. Moreover, particular attention should be paid to the outdoor environmental conditions, such as temperature (t), relative humidity (RH), and air velocity, which could affect the variability of the measurement results, in the same way as for indoor measurements [9,10].

The topic of acoustical characterisation has already been examined in detail for indoor spaces, through statistical analysis, in order to investigate the reproducibility of measurements, the accuracy of the parameter calculation, the influence of source-receiver position displacement, and the measurement chains of different systems [9,11,12].

An alternative to the experimental acoustical characterisation is the virtual reconstruction of the theatre, using room acoustics simulation software. Since they were introduced, geometrical acoustic (GA) software applications have been used as the standard room acoustics models [13]. In order to enable a better acoustic design of existing buildings, the simulations first need to replicate the real acoustical conditions of the examined environment through three important steps: (1) appropriate geometry modelling; (2) material properties; and (3) simulation settings. This procedure, namely, the calibration of the model, is even more complicated for open-air theatres as the acoustic scattering and diffraction phenomena are more relevant than in closed theatres [14]. An appropriate calculation method and a geometrically detailed model are of fundamental importance to achieve accurate predicted results [15].

The reliability of simulations is an on-going matter of discussion and interest, as testified by the Round Robin comparisons of room acoustic modelling tools [16–18], and the more recent overview on the uncertainties of input data in simulations [13]. In the latter overview, it was reported that the specific uncertainties that characterise the absorption coefficient (αw) and scattering coefficient (s) of materials [19,20] could affect the estimation accuracy of room acoustic parameters in the end. Such parameters are derived from simulated Impulse Responses (IR) or from energy reflectograms, depending on which analysis algorithm of the room acoustics software is being used. In situ and scale measurements [4] have revealed that the IRs of ancient theatres are composed of the direct sound and of two major reflections, which come from the orchestra floor and the scaenae frons (the ancient stage

building), respectively, when these parts of the theatres still exist. Therefore, in the case of open-air theatres, the IR should be modelled with a limited number of specular reflections and a high number of scattered reflections, because of the irregularities in the steps of the cavea [21]. This configuration is difficult to handle using geometrical acoustic-based software (GA), such as Odeon (Version) and CATT-Acoustic [22,23]. Yet, most of researchers still rely on such tools also for open-air theatres in everyday practice; thus, special attention should be given to properly controlling the boundary conditions. In fact, open-air theatres represent a special case, which creates a challenge for these prediction algorithms. The absence of a roof, and therefore of a reverberant field, urges to have a high reliability in the prediction of the early reflections. Moreover, the concave shape of these theatres is responsible for the creation of "shadow zones" of the mirroring surfaces in great lateral areas of the cavea [14]. This affects the deterministic method of the Image Source, which is used by the GA software to build the early part of the IRs.

The aim of this work is to assess the performance of predictive software in calculating a set of acoustic parameters for ancient theatres, a particular type of open-air spaces, taking the case study of the ancient theatre of Syracusae (SR). The objective is to give more insight about the relation of the sensitivity of the simulated results to the input parameters. It is mainly referred to raise awareness on the use of this kind of software for outdoor unconventional environments. The theatre is located in Sicily, an island in the South of Italy, a region where ancient Greek culture had historically a lot of influence. The simulation accuracy of two kinds of software, Odeon and CATT-Acoustic, is considered. This theatre was selected because it was relatively easy to model due to the lack of contemporary additional elements. In this manner, the virtual model of SR could be considered as a valid archetype model. The paper is organised as follows:


#### **2. Case Study**

The theatre of Syracusae (SR) was chosen as case study for a measurement campaign carried out by the Department of Energy at the Politecnico di Torino, from the 5th to 7th September 2015. SR (Figure 1) has Greek origins, dating back to the 5th century BC, but it was later modified by the Romans. Apart from a few ruins, nothing is visible of the original scaenae frons, but the surviving part of the rock-cut cavea has a diameter of 105 m.

Several studies that refer to the acoustics of SR have been retrieved from literature. These studies refer to measurements on a scale model of the ancient theatre and its contemporary use [4,24], to acoustic and lighting simulations [25], and to in situ acoustical characterisations with temporary scenery [26]. Measurements had only been carried out in empty conditions at one point of the orchestra area, as a pilot study in which different techniques were used [27].

This ancient open-air theatre is intensively used during cyclic summer season festivals in its current (deteriorated) condition, and acoustic measurements are made also for conservation purposes. Therefore, this study concerns the "historical acoustics" research field, which is the study of the auditory and acoustic environment of historic sites and monuments [1], with a valorisation purpose. The empty condition has been chosen for obvious practical reasons, as with the public present it is

very difficult to carry out reliable measurements due to high background noise levels and unsteady boundary conditions [28]. Moreover, in order to simulate correctly the presence of public or the placement of an acoustic shell for renovation purposes, the reliability of simulated data must be verified, starting from the calibration of the acoustic model.

**Figure 1.** Present conditions of the ancient theatre of Syracusae (**a**) and measurement set-up (**b**). S1 and S2 represent the source positions. R1 to R10 indicate the receiver positions.

#### **3. In Situ Measurement Methods**

Standard measurements have been performed in unoccupied conditions, with omnidirectional sound sources and receivers, as stated in the ISO 3381-1 [2]. Different considerations on ancient theatre measurements, defined during the European ERATO Project [3], were taken into account. The measurement results for SR have been used in Section 4 for the calibration of the simulation model and as references for the acoustic parameters predicted through computer simulations. The source and receiver positions for the theatre are shown in Figure 1b.

Receivers were positioned on three radial axes of the cavea in the theatre, 1.2 m above the floor resembling the height of the ear of a sitting person. An omnidirectional microphone (Shoeps CMC 5-U, Durlach, Germany) was used to record the IRs. Ten receiver positions were considered. There was only a single microphone, meaning that all position measurements were carried out sequentially.

Measurements were repeated two or three times for each source position for most of the receivers, in order to evaluate the repeatability of the results. Two source positions were investigated: S1 was

shifted horizontally by 1 m from the centre of the orchestra, in order to avoid any acoustical focus [29]; S2 was located behind S1, closer to the ancient scaenae frons position. The S1–S2 distance was equal to 7.6 m. Firecrackers were used as impulsive sources to measure the IRs ("Raudo Manna New Ma1b" Napoli, Italy and "Perfetto C00015 Raudo New", Napoli, Italy). The IRs were measured directly by recording the impulse produced by the firecracker blast. Firecrackers were used in S1 and S2, in order to overcome the problem of the low Signal-to-Noise Ratio (SNR): they maximise the SNR and this constitutes a significant advantage in outdoor measurements, but on the other hand, caution should be used as they are also more likely to be influenced by random effects (e.g., atmospheric conditions and random directivity). According to San Martin et al. [30], in the case of firecrackers the generated impulse is nearly omnidirectional. Its directivity index is, on average, around 1 dB for the octave bands between 125 Hz and 16 kHz. In addition, both its time curve and spectral power are highly repetitive, resulting in levels above 115 dB (reference 1 pW) within the aforementioned range.

The Background Noise Level (BNL) was measured as an equivalent continuous A-weighted sound pressure level (LAeq) over a period of 10 min, before the measurement sessions. The measured BNL was 45 dB (A), in unoccupied conditions.

The sound source was positioned at a height of 1.5 m from the floor, and a custom-made tripod was used to hold the firecrackers in a fixed position. Aurora (version 4.4, Parma, Italy) was used as acquisition software.

The air temperature and relative humidity were monitored during the whole measurement campaign, using a thermometer/hygrometer, Testo 608-H1 (Croydon South, VIC, Australia). The wind speed was measured by means of an anemometer, Testo 450-V1 (Croydon South, VIC, Australia). The environmental parameters acquired during the measurements campaign were t = 33 ◦C, RH = 65%, wind speed = 0.30 m/s. These did not change significantly during the measurement campaign.

In order to characterise the acoustical conditions of a performance space, the ISO 3382-1 standard lists a series of parameters that can be obtained from the IRs measured at each receiver position. Although open-air theatres cannot be considered typical performance spaces, like closed theatres or concert-halls, the ISO 3382-1 standard was used as the reference for the acoustical characterisation. In particular, the following room acoustical parameters were measured, as these are considered the most relevant parameters for the acoustical characterisation of open-air theatres [4]:


$$\mathbf{C}\_{80} = 10 \text{ log } \frac{\int\_0^{80} \mathbf{p}^2(\mathbf{t}) \mathbf{d} \mathbf{t}}{\int\_{80}^{80} \mathbf{p}^2(\mathbf{t}) \mathbf{d} \mathbf{t}} \tag{1}$$

where p(t) is the instantaneous sound pressure of the impulse response measured at the measurement point.

• Sound Strength, G, (dB): the logarithmic ratio of the measured sound energy (i.e., the squared and integrated sound pressure) to the sound energy that would arise in a free field at a distance of 10 m from a calibrated omnidirectional sound source, as expressed in the following equations:

$$\mathbf{G} = 10 \log \frac{\int\_0^\infty \mathbf{p}^2(\mathbf{t}) \mathbf{d} \mathbf{t}}{\int\_0^\infty \mathbf{p}\_{10}^2(\mathbf{t}) \mathbf{d} \mathbf{t}} = \mathbf{L}\_{\mathbb{P}\mathbf{E}} - \mathbf{L}\_{\mathbb{P}\mathbf{E}, 10} \tag{2}$$

in which

$$\mathcal{L}\_{\text{PE}} = 10 \log \left[ \frac{1}{T\_0} \int\_0^\infty \frac{\mathbf{p}^2(\mathbf{t}) \mathbf{d} \mathbf{t}}{\mathbf{p}\_0^2} \right] \tag{3}$$

and

$$\mathcal{L}\_{\rm PE, 10} = 10 \log \left[ \frac{1}{T\_0} \int\_0^\infty \frac{\mathbf{P}\_{10}^2(\mathbf{t}) \, \text{dt}}{\mathbf{P}\_0^2} \right] \tag{4}$$

where:

p(t) is the instantaneous sound pressure of the impulse response measured at the measurement point;

p10(t) is the instantaneous sound pressure of the impulse response measured at a distance of 10 m in a free field;

LpE (dB) is the sound exposure level of p(t);

LpE,10 (dB) is the sound exposure level of p10(t);

p0 is the reference sound pressure of 20 μPa;

T0 is the reference time interval of 1 s.

In the above equations, t = 0 corresponds to the start of the direct sound, i.e., which corresponds to the arrival of the direct sound at the receiver, and ∞ should correspond to a time that is greater than or equal to the point at which the decay curve has decreased by 30 dB [2].

G requires a calibration procedure for the sound power of the source. Different procedures have been described previously [2]. LpE,10 can be calculated from the sound pressure pd(t) measured at a source-to-receiver distance d (≥3 m) according to the following equation:

$$\mathcal{L}\_{\rm pE,10} = \mathcal{L}\_{\rm pE,d} + 20 \log \left( \frac{\rm d}{10} \right) \tag{5}$$

where:

LpE,d (dB) is the sound exposure level of pd(t), obtained from (3) (using pd instead of p10).

The Aurora plugin was used for the calculation of G with the firecrackers [31]. According to this procedure, the anechoic segment (direct sound) of each IR was used for calibration, providing the distance between the source and the receiver that allows for the estimation of LpE,10, and it is recommended to keep a length of the IR of at least 1 s and to silence the signal just after the end of the direct sound. In this way, the smearing out in time caused by the octave filtering does not push the energy outside the time window, even at a low frequency, and the correct value of the signal level can be computed. A calibration file was obtained in situ from each analysed IR and was used to calculate the G value for that measurement path, with the knowledge of the exact source-to-receiver distance.

The resulting dataset is composed of the octave-band values from 125 Hz to 8 kHz of the acoustic parameters calculated by the Aurora software (v. 4.4) [31] from the measured IRs**.**

#### *Measurements Results*

The measurement results at receiver positions R1–R10 are reported in Table 1, expressed as T20, C80, and G acoustical parameters obtained with firecrackers at source positions S1 and S2. All the values are the averages of two or three repetitions at each receiver position and of the central 500 Hz and 1 kHz octave band frequencies, as indicated in ISO 3382-1 [2]. In accordance with the ISO 3382-1, spatial averages for each row were also reported in Table 1. It was assumed that each row can be considered as a homogeneous area, as in open-air theatres the direct sound and the distance from the source play a predominant role in the acoustic response. The Impulse Response-to-Noise Ratio, INR, (dB) is also reported as a parameter for judging the validity of the measurement, in order to establish

the reliability of the outdoor acoustical measurements [31]. According to ISO 3382-1, the source level should be at least 35 dB above the background noise level in the corresponding frequency band for the case of T20. All the measurements considered in this study had INR values well above 35 dB and up to 60 dB for the octave bands from 250 Hz to 8 kHz. It is important to underline that in the case of the T20 values, the larger standard deviation is due to the presence of only one strong reflection from the orchestra after the direct sound (as shown in Figure 2), which determines an irregular course of the decay curve and a greater variability in the slope of the decay curve.



**Figure 2.** Measured Impulse Response (IR) in Syracusae (SR) for the S1-R6 measurement path, for the firecracker source. Δt is the time interval between the direct sound (D) and the first reflection (R) from the orchestra floor.

#### **4. Uncertainty of the Geometrical Acoustic Prediction Models**

In the acoustic domain, it is important to recall that the parameters have the aim of evaluating the perception of the acoustic signal, namely the average capability of a "conventional" listener to notice sound variations. An important factor that correlates the subjective field to objective measures has been defined as the Just Noticeable Difference (JND), that is, the smallest perceivable change in a given acoustical parameter, which is specified for information in Annex A of ISO 3382-1 [2] for central frequencies (500 Hz and 1 kHz), but which is also acceptable for lower and higher frequencies [32–35]. This issue will be further discussed, when analysing the accuracy of the acoustic prediction models.

The uncertainty contribution of the input data, propagated to the results obtained from two different types of room acoustic software, Odeon version 13.02 and CATT-Acoustic version 9, was assessed and compared with the measurements values.

Odeon version 13.02 [22] is based on a hybrid calculation method. Early reflections are calculated through a mixture of the Image Source Method and the Ray-Tracing Method (RTM), by means of a stochastic scattering process that uses secondary sources. Late reflections are calculated by means of a special RTM, where the secondary sources radiate energy locally from the surfaces and are assigned a frequency-dependant directionality, namely the reflection-based scattering coefficient. The secondary sources may have a Lambert, Lambert oblique, or Uniform directivity: this directivity depends on the properties of the reflections as well as on the calculation settings.

CATT-Acoustic version 9 [23] is made up of two modules: CATT-A is the main programme, and it handles the modelling, surface properties, and directivity libraries, and TUCT (The Universal Cone Tracer), which is the main prediction and auralisation programme. TUCT can use three alternative cone-tracing algorithms: the first algorithm is based on stochastic diffuse rays, while the second and third algorithms are based on the split-up of the actual diffuse rays. The difference between these modules is that the latter handles two orders of diffuse split-up reflections in a deterministic way, thus resulting in lower random run-to-run variations.

Both CATT-Acoustic and Odeon base their scattering algorithms on two main implementations, which are described in detail in a previous paper [36]. These two methods are the Hybrid Reflectance Model (HRM) and vector mixing (VM). The HRM method complies with the definition of the scattering coefficient based on ISO 17497-1 [20] which defines it in a quantitative way as the fraction of the non-symmetrically reflected energy. In the HRM method, a random number between 0 and 1 is used to determine whether the reflection is specular or scattered. This number is compared with the surface scattering coefficient (s) assigned to the surface. In case it exceeds the value of s, the scattered energy is assumed to be distributed according to Lambert's Law, i.e., the intensity of the reflected ray is independent on the angle of incidence but proportional to the cosine of the angle of reflection. This is the basic concept implemented in CATT-Acoustic [23] and in Odeon for the uniform and Lambert directivity scattering [22]. On the contrary, the VM is based on the linear interpolation of the specular and diffuse reflection [37]. In this way the direction of a reflection vector is calculated by adding the specular vector scaled by a factor (1-s) to a scattered vector following a certain direction that has been scaled by a factor s. This is the basic concept implemented in Odeon [22,38,39], named "vector-based scattering", where the scattered vector follows a random direction, generated according to the Lambert distribution named oblique Lambert directivity.

#### *4.1. General Procedure for the Implementation of the Models*

In order to compare the two software packages and to obtain the best match with the measurement results, it was necessary to perform simulations with the same geometric model and source/receiver positions as in the measurements. To the best of the authors' knowledge, this preliminary benchmark procedure has never been performed before on ancient open-air theatres, although many studies on indoor environments have been conducted [13–15]. Both types of software used for the simulation, that is, Odeon and CATT-Acoustic, have been validated in Round Robin tests. One of the main findings of these tests was that precise knowledge of the characteristics of the surface material is an important

prerequisite for a reliable room simulation. Thus, a more detailed analysis on absorption and scattering coefficient changes was proposed.

A preliminary benchmark test study was carried out on SR, whose model had previously been used in different investigations, e.g., simulations concerning its ancient conditions, during the European ERATO project, [3] and in investigations on its contemporary use [25]. Figure 3a shows the 3D model configuration of SR.

**Figure 3.** 3D model and source-receiver simulation set-up of SR (**a**) and scheme of the characteristics of the material chosen for the cavea (**b**).

The procedure applied for the comparison of the simulation tools was focused on solving the following issues:


maximi), which was considered as an aperture (α<sup>w</sup> = 0.9; s = 0), and the floor, which includes the ruins of the scaenae frons (α<sup>w</sup> = 0.8; s = 0.8) and the better conserved orchestra area (α<sup>w</sup> = 0.1; s = 0.2). Odeon and CATT-Acoustic software allow for frequency dependent absorption coefficients. The same absorption coefficients have been used for both software. The Odeon software allows giving as input value for the scattering coefficient the value as an average between 500 and 1000 Hz, and considers a frequency dependent scattering by using default interpolation curves as shown in the Manual. These curves have been used in CATT-Acoustic, i.e., a frequency dependent scattering coefficient, by inserting each value for each octave-band. The values given in Figure 4b refer to the mean values at 500 and 1000 Hz.



**Table 2.** Scattering and diffraction set-up in Odeon and CATT-Acoustic.

As reported in Vorländer [13], the level of detail in the model, besides the curved surfaces, is considered a systematic source of uncertainty. Besides the number of rays employed in the simulation, the absorption and scattering coefficients are defined as random sources of uncertainty. Both kinds of software use a ray-tracing method to build the late part of the IR. Since this method is based on stochastic calculation, which depends on the input general set-up data, it could affect the uncertainty of the resulting parameters when a run-to-run analysis is considered. All the aforementioned random sources of uncertainty were subjected to analysis, considering both the Odeon and CATT-Acoustic software, which, for the sake of an easier presentation of the results, will hereafter be referred to as O and C, respectively; the results are shown in the following sub-sections.

#### *4.2. Run-to-Run Variation*

The run-to-run variations of the applied algorithms are due to the stochastic implementation of the ray-tracing algorithm in the GA software. In order to test this effect, ten repeated simulations were performed with the GA model of SR, using both kinds of software. An analysis based on the assessment of the Normalized Error [46] was performed on the T20, C80, and G results, considering a confidence level of 95%. The results for each receiver position and octave-band frequency were all within the upper and lower limits of the respective limit range. This confirms the results obtained in analogous analyses conducted on an enclosed space [47].

#### *4.3. Number of Rays*

GA software usually distinguishes between deterministic and stochastic ray-tracing, depending on which algorithm is applied: The first algorithm is used to detect the image sources, while the second is used to estimate the reverberant tail. It is possible to select separately the number of early and late rays in O. Early rays are used in the deterministic ray-tracing, while late rays determine the ray density in the late part of the IR. The number of rays/cones in C only refers to the stochastic ray-tracing; that is the construction of the late rays. It becomes important to investigate the variation in results due to stochastic ray-tracing, which is a random source of uncertainty in GA.

Stochastic ray-tracing was here investigated by comparing simulations with different numbers of rays (4000–40,000–400,000–4 million). A Normalized Error analysis revealed that the results for each receiver position and octave-band frequency were all within the upper and lower limits of the respective limit range. This investigation was performed in order to verify the stochastic fluctuation, which may result as numerical errors in the results due to the low number of rays. This has been extensively studied and validated in systematic experiments [48]. The number of rays is strictly related to the systematic uncertainty in the final results of the parameters, and independently on the used method of the ray tracing, the fluctuations can be reduced by increasing the number of rays or by averaging repeated simulations. The choice of the number of rays becomes important in cases where large environments with uneven distribution of the absorption are considered. Therefore, a compromise should be reached between a very large number of rays and a smaller one since it may affect significantly the computation time. In fact, the reverberant field in a simulated open-air theatre is spatially uneven. The absorbing area is concentrated on the ceiling of the boundary box (in the case of O), while the theatre itself is mostly reflective. Thus, despite the prolongation of the computation time, a number of rays above 1 million would be preferable for the correct estimation of the reverberation tails at different receiver positions [22]. It is assumed that at least one ray is received at the longest source-to-receiver distance, which in this case is about 40 m (R10). The receiving area is considered as a spherical receiver with a radius rd of about 0.06 m, thus the area of the visibility cone per ray A(ray) was 0.01 m2. Considering that the total surface covered by the emitted rays is a sphere of radius 40 m, whose surface A(sph) is equal to 20,096 m2, it is possible to calculate the minimum required number of rays Nmin(rays) by means of Equation (6), which was also indicated in Vorländer [13]:

$$\text{N}\_{\text{min}}(\text{rays}) = \frac{\text{A(sph)}}{\text{A(ray)}} = \frac{4\text{(ct)}^2}{\text{r}\_{\text{d}}^2} \tag{6}$$

where c and t are the speed of sound in air and the max arrival time counted from source excitation, respectively.

Nmin(rays) is equal to 2 million rays. Thus, 4 million rays are necessary to ensure that at least two rays (instead of one) arrive at the receiver at a distance of 40 m from the source.

#### *4.4. Absorption and Scattering Coefficients*

The predictive software considers α<sup>w</sup> and s as input variables that have to be assigned to the surfaces of the model. Thus, it is important to evaluate the uncertainty (U) of the calculated values, due to the uncertainty of the absorption (Uαw) and scattering (Us) variables. These uncertainties were estimated to be higher than 0.05 and 0.15, respectively, as was found on the basis of the user's experience in Vorländer [13] and Shtrepi et al. [49]. This case study considered only a few materials: stones and grass in particular. This allowed variations due to different α<sup>w</sup> and s combinations regarding the cavea stone, which is the main surface considered in the model, to be investigated. To this aim, as shown in Figure 4b, twenty alternative materials were considered in both kinds of software, with α<sup>w</sup> equal to 0.05, 0.10, 0.15, and 0.20, and with s equal to 0.25, 0.40, 0.55, 0.70, and 0.85. These values considered the possibility of having different degrees of damage on the steps of the cavea. In the case of the scattering coefficients of 0.85 [41], a perfectly preserved periodic triangular section with an angle of 45◦ has been considered, whereas in the case of scattering coefficient of 0.25, a heavily damaged cavea was represented.

As suggested previously [50], the sensitivity coefficients were calculated in order to evaluate the uncertainty propagation. This evaluation was conducted considering the average simulation results of the 500 Hz and 1 kHz octave bands [2]. The variability of each simulated receiver was calculated, and no systematic effects were detected. Thus, the sensitivity coefficients were calculated considering the normalized values, with respect to the relevant average value. An appropriate mathematical model, based on linear regression, was defined so as to relate the simulated values of each acoustical parameter to the absorption and scattering coefficients [50,51]. The expanded uncertainty was obtained as 2σ, where σ is the standard deviation of the model [50].

The expanded uncertainties for the O and C simulation software (UO and UC) are shown in Table 3. The uncertainty, due to the input variability of α<sup>w</sup> and s, is lower than the JND for all the parameters, except for T20 and C80 when the C software is used. The lower uncertainty values are due to the software algorithm, which is less sensitive to variations in α<sup>w</sup> and s.

**Table 3.** Just Noticeable Difference (JND) of the T20, C80, and G acoustical parameters, the expanded uncertainty due to the variability of the input values of α<sup>w</sup> and s for the simulation software O and C (UO and UC). Values higher than the JNDs are reported in bold.


#### **5. Discussion**

This work aimed at providing an overview of the many methodological challenges that should be faced when dealing with the acoustics of open-air ancient theatres, both in the case of measured (i.e., for the acoustical characterisation of the current state) and predicted (i.e., for the simulation of a no

longer/not yet existing state) room acoustics parameters. Measurement and simulation are strictly interconnected, also considering that the former is often required to validate the latter; the rationale for addressing both these aspects within the framework of this paper is that this is particularly true for open-air ancient theatres. Indeed, measurements of such unroofed spaces have been shown to be problematic with the application of current standards. Achieving reliable acoustical measurements is important in order to provide calibration data for the simulation software. In the context of cultural heritage research, and specifically for archaeological or historical acoustics, simulation becomes crucial because of the need to investigate (in most of cases) physical conditions which no longer exist (acoustics of the past), due to, among other aspects, the deterioration of the architectural elements. For these reasons, while measurements and simulations are concerned with different uncertainty issues, it was decided to compare the measured and calculated parameters (Section 5.1), as well as discussing the overall limitations of the considered protocols (Section 5.2).

#### *5.1. Comparison of the Measured and Simulated Results*

The aim of acoustical simulations is to obtain predictions that would closely match measured data. A well-calibrated model should minimise the perceivable differences between simulation and measurements for any considered acoustic parameter.

The subsequent considerations were also based on the α<sup>w</sup> and s values of the cavea surface and its variations. The differences between the measured and simulated results are shown in Figure 4, which reports the acoustical behaviour during the calibration of both kinds of software, considering the variations due to the 20 alternative combinations (5 scattering coefficients × 4 absorption coefficients), for all the receivers, and the average between the 500 Hz and 1 kHz octave-bands. The isolevel curves shown in Figure 4 have been obtained by a two-dimensional data interpolation using the MATLAB function "interp2" with the "spline" method active. This method was chosen in order to have smooth first and second derivatives throughout the curves. Figure 4a,b, which pertain to O and C, respectively, refer to parameter T20, while Figure 4c,d refer to C80 and Figure 4e,f to G. The light yellow colour in the graphs shows the α<sup>w</sup> and s combinations for which the simulated values were closest to the measured ones. These isolevel curves were based on SAD, i.e., the Sum of the Absolute Differences between the simulated values, sn, and the measured ones, mn, for each receiver position, expressed as follows by Equation (7) [52]:

$$\text{SAD} = \sum\_{1}^{n} |\mathbf{s}\_{\mathbf{n}} - \mathbf{m}\_{\mathbf{n}}| \tag{7}$$

The results show that, depending on which parameter is considered, the best agreement between the simulated and measured values could not be obtained for the same combination of α<sup>w</sup> and s. From the isolevel curves layout it is observable that, apart from T20, Odeon software is more sensitive to variations of α<sup>w</sup> than of s, while the opposite occurs for CATT-Acoustic. For T20, lower differences between the simulated and measured values are detectable for both high and low absorption and scattering values in the case of Odeon software, while mainly for high scattering values over the whole range of absorption values in the case of CATT-Acoustic. For C80, a good matching between measured and simulated values occurs with high absorption values over the whole range of scattering coefficients in the case of Odeon software, while it occurs with low scattering coefficients over the whole range of absorption coefficients in the case of CATT-Acoustic. For G, the best matching between measured and simulated values occurs with low absorption values over the whole range of scattering coefficients in the case of Odeon software, while it occurs with a medium scattering coefficient over the whole range of absorption coefficients in the case of CATT-Acoustic. Only in the case of G do both kinds of software show an agreement that is obtained in a range around the values of α<sup>w</sup> = 0.10 and s = 0.55. Thus, this combination was considered for the calibration of the model.

Table 4 shows all the simulation results of the calibrated model of SR, expressed as T20, C80, and the G acoustical parameters, considering both O and C. All the values are averaged over the central 500 Hz and 1 kHz octave-band frequencies and spatial values have been added for each row. In this

way, the results can be compared directly with those of the corresponding measurements. A good agreement has been shown between the results obtained with the two different types of software, as can be also seen from the graph in Figure 5, where the average G for each row is represented along the average distance from the source, in the cases of measurements and simulations with Odeon and CATT-Acoustic.

In particular, the average values for each row obtained from the two software are always within or at the limit of the JND for each parameter, except C80 in the first row. The differences between the simulated and measured results, in terms of average values per each row, are within two to seven times the JND for T20, without any systematic behaviour related to the row. In the case of C80, the differences from simulated and measured values are higher for the first row, with average simulated values that are three and five times the JND with Odeon and Catt-Acoustic, respectively, within 2 JND for the second and third rows, and within the JND for the last row, for both the software. For G, the average simulated values for each row are always within or quite close the JND compared to measured values for both the software, with a slightly worse behaviour for Odeon. Figure 5 shows as both the software correctly simulated the reduction of G with the distance from the source, with slopes in dB per distance doubling (dB/dd) that are 6.6 dB/dd and 6.3 dB/dd, for Odeon and Catt-Acoustic, respectively, compared to 6.3 dB/dd for the measurements.

**Figure 4.** Sum of Absolute Differences (SAD) between the measurements and simulations overall the receivers, for T20, C80, and G in Odeon (**a**,**c**,**e**) and for CATT (**b**,**d**,**f**). Light yellow refers to very similar values between simulation and measurements.



**Figure 5.** G values averaged over the central 500 Hz and 1 kHz octave-band frequencies, and for each row, represented along the average distance from the source, derived from the measurements and from the simulations with Odeon and CATT-Acoustic.

#### *5.2. Limitations of the Study*

Given the complexity of the task, there are, of course, a number of limitations in the methodological approach implemented in the present study. Most of such shortcomings are related, as previously mentioned, to the actual applicability of the ISO 3382-1, intended for roofed performance spaces, to open-air environments.

Certainly, Section 7 of the ISO 3382-1 deals with the "Measurement uncertainty" and specifies that for practical evaluation of the measurement uncertainty of reverberation time using the integrated impulse response method, it can be considered as being of the same order of magnitude as that using an average of n = 10 measurements in each position with the interrupted noise method. No additional averaging is necessary to increase the statistical measurement accuracy for each position. However, considering the variability due to the atmospheric conditions, more than one repetition is needed. On the other side, anyone who has performed measurements in ancient open-air theatres knows that a large number of repetitions is rarely feasible, for a number of practical reasons due to the stability of the boundary conditions; thus, the scope of this study was to assess the protocols' reliability with fewer measurements.

Table 5 summarises the most salient aspects and recommendations provided in the different sections of the ISO 3382-1, confirms on whether such requirements were met and reports briefly on each circumstance ("notes" column).

Moreover, another limitation of the work derives from the use of GA software. The differences between simulations and measurements are mainly related to the approximations of GA with respect to the real wave effects, which result to be important for an open-air environment where the number of surfaces is limited and the generation of a diffuse field becomes critical. The GA principals are valid above the Schroeder frequency, which is not easy to estimate for an ancient theatre. The limits of GA are related to large rooms, low absorption coefficients, and broadband signals [48]. Furthermore, they neglect phase. As shown in different Round Robin tests [16,17], the GA based software differ between

each other even when the same input data of absorption coefficients are given to the surfaces. Therefore, the major drawback, for the state of the art modelling software, is that the different simulation tools require different input data [53]. In practice, the absorption and scattering coefficient values are calibrated, i.e., varied within the range of their measurement uncertainty, in order to match the simulation results to the measured values. This may result in different values of these coefficients for the different software.



#### **6. Concluding Remarks**

This work deals with the accuracy of acoustical measurements and prediction models related to the ancient open-air theatre of Syracusae. Measurements based on ISO 3382-1 were conducted in unoccupied conditions. Firecrackers were used, because of the relatively high background noise level. The acoustical parameters described in the ISO 3382-1 standard, that is, Reverberation Time (T20), Clarity (C80), and Sound Strength (G), were obtained from the IRs measured at each receiver position. The uncertainty contributions due to the input values of sound absorption and scattering coefficients, α<sup>w</sup> and s, have also been calculated with two simulation tools, that is, Odeon, version 13.02, and CATT-Acoustic, version 9. The models have been calibrated on the basis of the best match between the simulated and measured parameter values. Other sources of uncertainties, that is, the run-to-run variations and number of rays, have also been analysed and the obtained results have all been found to be under or at commonly accepted limit values of the Just Noticeable Differences (JNDs). The variability of the results is related to the algorithms used to approximate the acoustic phenomenon of the absorption and scattering. This kind of software are based on geometric acoustic principles, which rely on a statistical approach used to include diffuse sound scattering and predict the reverberant tail of an impulse response [22,23].

The following main results have been found from the uncertainty analysis that was conducted on the simulations of the Syracusae theatre:


Future studies will be conducted on a larger number of case studies, considering the influence of the architectural state of conservation, completeness, and dimensions on the acoustic field. Moreover, more suitable parameters for the acoustical characterisation of the open-air theatres than those described in ISO 3382-1 standard are the subject of continuous research [49].

**Author Contributions:** E.B. and A.A. conceived and designed the data collection campaigns and simulations; E.B. collected data on site and performed the simulations with Odeon together with L.S., D.P.G. performed the simulations with CATT. L.S. and G.B. collaborated for the statistical analysis of the uncertainties. F.A. offered support on the applicability of the ISO standard. All authors wrote and revised the paper.

**Funding:** This research was funded through a Ph.D. scholarship awarded to the first author by the Politecnico di Torino (Turin, Italy).

**Acknowledgments:** The authors are grateful to Fabrizio Bronuzzi, Rocco Costantino, and Maurizio Bressan from the Acoustics Laboratory of Politecnico di Torino for their technical contribution to this project, as well as to George Koutsouris and Claus Lynge Christensen from ODEON support team. The authors would like to thank also Soprintendenza BB.CC.AA di Siracusa, Istituto Nazionale Dramma Antico and Andrea Tanasi for their support during the measurements in the theatre. Finally, the authors are grateful to Nicola Prodi and Andrea Farnetani from the Univeristy of Ferrara, Angelo Farina from the University of Parma, and Monika Rychtarikova, from KU Leuven, for their suggestions and advice.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Acoustic Localization for a Moving Source Based on Cross Array Azimuth**

#### **Junhui Yin 1, Chao Xiong <sup>1</sup> and Wenjie Wang 2,\***


Received: 5 June 2018; Accepted: 28 July 2018; Published: 1 August 2018

#### **Featured Application: the work introduced in this paper can be used for wildlife conservation, health protection, and other engineering applications.**

**Abstract:** Acoustic localization for a moving source plays a key role in engineering applications, such as wildlife conservation and health protection. Acoustic detection methods provide an alternative to traditional radar and infrared detection methods. Here, an acoustic locating method of array signal processing based on intersecting azimuth lines of two arrays is introduced. The locating algorithm and the precision simulation of a single array shows that such a single array has good azimuth precision and bad range estimation. Once another array of the same type is added, the moving acoustic source can be located precisely by intersecting azimuth lines. A low-speed vehicle is used as the simulated moving source for the locating experiments. The length selection of short correlation and moving path compensation are studied in the experiments. All results show that the proposed novel method locates the moving sound source with high precision (<5%), while requiring fewer instruments than current methods.

**Keywords:** acoustic localization; cross array; moving sound source; discrete sampling; error analysis

#### **1. Introduction**

The localization of moving sources represents a major issue in engineering applications. Similar to other detection technologies, acoustic-localization methods have been developed rapidly over the years. Meanwhile, the noise generated by low-speed vehicles (LPVs) is a key issue, especially in connection with acoustics mitigation, where noise pollution continues to be a major health problem, with a whole host of health effects, such as: sleep disorders with awakenings [1], learning impairment [2,3], hypertension ischemic heart disease [4], and especially annoyance [5], a widely used indicator to study the effect of different noise sources on wellbeing. In this context, the main effort has been done to mitigate the main sources of noise: road traffic [6–8], railway traffic [9,10], airport [11,12], and industrial [13]. Specifically about road noise, the most important interaction producing noise, more than just the engine noise used for the LPV, was also road/tire interaction [14,15] and aerodynamic noise for high-speed vehicles. Furthermore, a relatively new noise source is impacting modern society in areas where background noise is low. Wind farms are being installed continuously every year to supply energy demand, but people are being affected by its noise, which is more disturbing than other sources [16,17] and the scientific community is moving towards its assessment [18].

In this paper, the LPV is the research object. As for all moving vehicles, exhaust systems and chain tracks are the main noise sources of LPVs, with exhaust systems representing the dominant factor. Therefore, exhaust systems could be chosen as the moving noise source. The most common localization methods for noise sources are Nearfield Acoustic Holography (NAH), beamforming, and array signal processing [19]. The sound field of a moving vehicle is effectively measured based on NAH with a moving acoustic plane [20] and coordinate compensation [21,22]. Far-field measurements of a moving source can be achieved by the short-time beamforming method, but these require extensive computational resources for processing the acquired data for the acoustic plane frame at every moment. The false noise source (ghost image) would also be easily generated [23,24]. For array signal processing, the required computations are fast and can be performed to high precision [25,26]. This is so since the necessary calculations to be performed on the signals are only one-dimensional and, therefore, substantially less demanding than those for a whole acoustic plane.

The localization method of a moving sound source for the new method described here is achieved by intersecting the azimuth lines of cross arrays. Initial testing of the localization algorithm and the data analysis were performed for a single array and revealed a good performance. Therefore, a second array was added to cross the azimuth lines. The locating experiments were conducted with the engine noise of an LPV as a moving noise source. The data length determined by short-time correlation and path compensation were also introduced. The new method succeeded in effective localization of moving vehicles, requiring less expensive instrumentation than existing methods. Moreover, it was found that it continues to perform properly even under adverse ambient conditions, such as bad weather or at low light levels at night.

#### **2. Localization Analysis of Single Array**

#### *2.1. Localization Algorithm*

The LPV used for the current study travelled on level ground such that its height remained constant relative to the array sensors. The height of the vehicle was about 2.0 m, which was approximately 1.5 m higher than the arrays themselves. Compared to the range of about 100 m or more, the constant height difference between the vehicle and the arrays had little influence on the localization performance and accuracy. Therefore, the localization was operated in the *x-y* coordinate system while height difference was ignored. The five-element cross array was taken as a basic array pattern, as illustrated in Figure 1.

**Figure 1.** Model of single array.

The coordinates are defined within the plane of the array. The center acoustic sensor is located at *O* (0, 0), while the remaining four were *M*<sup>2</sup> (D, 0), *M3*(0, D), *M*<sup>4</sup> (−D, 0), and *M*<sup>5</sup> (0, −D), where D represents the distance from *Mi* to *O*. The noise source is assumed to be located at *T* (*x*, *y*), with an angle *ϕ* between *OT* and *x* axis as indicated in Figure 1. The time delays between the arrival time of noise at the center sensor and the other four sensors are referred to as *τ*1*i*. Similarly, *d*1i (*i* = 2, 3, 4, 5) represents the distance between the center sensor and the neighboring sensors, such that *d*1i = c × τ1i (*c* is current sound velocity). R is the distance from *O* to *T*.

From the simple geometry in Figure 1, the distances can be expressed.

$$\begin{cases} \begin{aligned} \mathbf{x}^2 + \mathbf{y}^2 &= R^2\\ \left(\mathbf{x} - D\right)^2 + \mathbf{y}^2 &= \left(R + d\_{12}\right)^2\\ \mathbf{x}^2 + \left(\mathbf{y} + D\right)^2 &= \left(R + d\_{13}\right)^2\\ \left(\mathbf{x} + D\right)^2 + \mathbf{y}^2 &= \left(R + d\_{14}\right)^2\\ \mathbf{x}^2 + \left(\mathbf{y} - D\right)^2 &= \left(R + d\_{15}\right)^2 \end{aligned} \tag{1}$$

The solution of Equation (1) is

$$\begin{cases} \text{ x = } \frac{2R(d\_{14} - d\_{12}) + d\_{14}^2 - d\_{12}^2}{2D} \\ \text{ y = } \frac{2R(d\_{13} - d\_{15}) + d\_{13}^2 - d\_{15}^2}{2D} \end{cases} \tag{2}$$

$$\begin{cases} \tan\varphi = \frac{y}{x} = \frac{(\tau\_{15} - \tau\_{13})[2R - c(\tau\_{15} - \tau\_{13})]}{(\tau\_{14} - \tau\_{12})[2R - c(\tau\_{14} - \tau\_{12})]}\\ R = \sqrt{x^2 + y^2} = \frac{4D^2 - d\_{12}^2 - d\_{13}^2 - d\_{14}^2 - d\_{15}^2}{2(d\_{12} + d\_{13} + d\_{14} + d\_{15})} \end{cases} \tag{3}$$

when *R* >> *c* × *τ*1*i*,

$$
tan\varphi \approx \frac{\left(\tau\_{15} - \tau\_{13}\right)}{\left(\tau\_{14} - \tau\_{12}\right)}\tag{4}$$

then Equation (3) can be simplified:

$$\begin{cases} \; \rho = \arctan \frac{(\tau\_{15} - \tau\_{13})}{(\tau\_{14} - \tau\_{12})}\\\; R = (4D^2 - c^2 \sum\_{i=2}^5 \tau\_{1i} ^2) / 2c \sum\_{i=2}^5 \tau\_{1i} \end{cases} \tag{5}$$

The location of the noise source is given by Equations (2) and (5) and, with reference to their derivation, it is evident that the localization algorithm is based on the time delays between the arrival times of noise at the sensors in the array.

#### *2.2. Precision Analysis for Localization*

The algorithm for localizing the noise source, as described by Equations (2) and (5) in Section 2.1, and the associated accuracy depend on sound velocity *c*, array size *D* and, in particular, the error involved in estimating the time delay *στ*. Since *D* and *c* remain constant, for any particular array and measurement environment, the dominant factor affecting the precision of the proposed method is associated with the error involved in measuring στ. Due to the symmetric arrangement of the sensors with regard to the central sensor, the standard errors for the time delay of all sensors were assumed to be equal, such that *στ = στ*1*i*.

In Equation (5), quadratic function was included in the expression of coordinates (*x*, *y*), which makes it different to calculate the transmission. Then, after the precision calculation of coordinates was transferred into angular coordinates, the localization was described with azimuth *ϕ* and range *R* as illustrated in Equation (5).

#### *2.3. Azimuth Precision*

According to Equation (5), azimuth Φwas a function of time delay *τ*.

$$\Phi = \mathcal{F}(\tau) = \mathcal{F}(\tau\_{12\prime}\ \tau\_{13\prime}\ \tau\_{14\prime}\ \tau\_{15}) \tag{6}$$

The transmission form of azimuth error *σϕ* can be expressed.

$$\left(\sigma\_{\varphi}\right)^{2} = \left(\frac{\partial\varrho}{\partial\tau\_{12}}\sigma\tau\right)^{2} + (\frac{\partial\varrho}{\partial\tau\_{13}}\sigma\tau)^{2} + (\frac{\partial\varrho}{\partial\tau\_{14}}\sigma\tau)^{2} + (\frac{\partial\varrho}{\partial\tau\_{15}}\sigma\tau)^{2} \tag{7}$$

Taking derivative of *τ* in Equation (5):

$$\begin{cases} \frac{\partial \varphi}{\partial \tau \chi} = -\frac{\partial \varphi}{\partial \tau \chi} = \frac{1}{1 + \tan^2 \varphi} \cdot \frac{\tau \mathbf{1} \xi - \tau \mathbf{1} \chi}{\left(\tau\_{14} - \tau\_{12}\right)^2} \\\frac{\partial \varphi}{\partial \tau \chi} = -\frac{\partial \varphi}{\partial \tau \chi} = -\frac{1}{1 + \tan^2 \varphi} \cdot \frac{1}{\tau\_{14} - \tau\_{12}} \end{cases} \tag{8}$$

So the expression of azimuth error *σϕ* is

$$\sigma\_{\varphi} = \frac{\sigma \tau}{1 + \tan^2 \varphi} \sqrt{\frac{2(\tau\_{14} - \tau\_{12})^2 + 2(\tau\_{15} - \tau\_{13})^2}{(\tau\_{14} - \tau\_{12})^4}} \tag{9}$$

Solving Equations (2) and (5):

$$\begin{cases} \left(\tau\_{14} - \tau\_{12}\right)^2 + \left(\tau\_{15} - \tau\_{13}\right)^2 = \frac{D^2}{v^2} \\ \left(\tau\_{14} - \tau\_{12}\right)^2 = \frac{D^2}{v^2(1 + \tan^2\phi)} \end{cases} \tag{10}$$

Substituting Equation (10) into Equation (9):

$$
\sigma\_{\varphi} = \frac{\partial \varphi}{\partial \tau\_{l}} = \frac{\sqrt{2}c}{D} \sigma\_{\tau} \tag{11}
$$

Thus, the azimuth error is determined by *c*, *D*, and *στ*. We assume a value of *c* = 343 m/s for the sound velocity and employ a sampling rate of 5000 Hz. The sampling interval is 200 μs and the distribution of azimuth error is shown in Figure 2.

**Figure 2.** Distribution of azimuth error of *c* and *στ*.

Figure 2 shows that the relationship of *σϕ* and *c* was linear, as well as *στ*. However, the one between *σϕ* and *D* was inverse. In the condition of *D* ≥ 2 m, *σϕ* stays at an optimal level as 0.03◦ in Figure 2a; when *στ* = 100 μs in Figure 2b it stays 0.1◦ when *c* was set as 343 m/s.

#### *2.4. Range Precision*

The range is also a function of time delay *τ*, and the transmission error is:

$$
\sigma\_R ^2 = (\frac{\partial R}{\partial \tau\_{12}} \sigma \tau)^2 + (\frac{\partial R}{\partial \tau\_{13}} \sigma \tau)^2 + (\frac{\partial R}{\partial \tau\_{14}} \sigma \tau)^2 + (\frac{\partial R}{\partial \tau\_{15}} \sigma \tau)^2 \tag{12}
$$

*Appl. Sci.* **2018**, *8*, 1281

Evaluating the partial derivatives of *τ* in Equation (5) gives:

$$\frac{\partial R}{\partial \tau\_{1i}} = [2c^2 \tau\_{1i} \sum\_{j=2}^{5} \tau\_{1j} - (c^2 \sum\_{j=2}^{5} \tau\_{1j}^2 - 4D^2)] / 2c(\sum\_{j=2}^{5} \tau\_{1j}) \tag{13}$$

Substituting Equation (5) into Equation (13) yields:

$$\frac{\partial R}{\partial \tau\_{1i}} = (c\tau\_{1i} - R) / \sum\_{j=2}^{5} \tau\_{1j} \tag{14}$$

According to the geometric relation of array and target:

$$\begin{array}{ll} \pi\_{1i} &= \frac{1}{\varepsilon} \{ \mathbf{R} - \sqrt{\mathbf{R}^2 + \mathbf{D}^2 - 2\mathbf{R}\mathbf{D}\cos[\varphi - (\mathbf{i} - 1)\frac{\pi}{2}]} \} \\ &= \frac{\mathbf{R}}{\varepsilon} - \frac{\mathbf{R}}{\varepsilon} \sqrt{1 + [\frac{D}{\mathcal{R}}]^2 - 2[\frac{D}{\mathcal{R}}] \cos[\varphi - (\mathbf{i} - 1)\frac{\pi}{2}]} \end{array} \tag{15}$$

then the Taylor expansion of Equation (15) is:

$$\begin{array}{ll} \pi\_{\text{li}} &= \frac{\underline{R}}{v} - \frac{\underline{R}}{v} \{ 1 + \frac{1}{2} [(\frac{D}{\mathcal{R}})^2 - 2(\frac{D}{\mathcal{R}}) \cos[\underline{\rho} - (\text{i} - 1)\,\frac{\pi}{2}] \} \\ &- \frac{1}{\underline{\mathfrak{F}}} \{ \left(\frac{D}{\mathcal{R}}\right)^2 - 2\left(\frac{D}{\mathcal{R}}\right) \cos^2[\underline{\rho} - (\text{i} - 1)\,\frac{\pi}{2}] \} \\ &\approx \frac{\underline{R}}{v} \left\{ \frac{1}{2} [(\frac{D}{\mathcal{R}})^2 - \left(\frac{D}{\mathcal{R}}\right) \cos[\underline{\rho} - (\text{i} - 1)\,\frac{\pi}{2}] \\ &- \frac{1}{2} (\frac{D}{\mathcal{R}})^2 \cos^2[\underline{\rho} - (\text{i} - 1)\,\frac{\pi}{2}] \right\} \end{array} \tag{16}$$

$$\begin{aligned} \sum\_{i=2}^{5} \tau\_{1i} &= \quad -\frac{2D^2}{\mathcal{R}c} + \frac{D}{c} \sum\_{i=2}^{5} \cos[\varphi - (i-1)\frac{\pi}{2}] \\ &+ \frac{D^2}{2\mathcal{R}c} \sum\_{i=2}^{5} \cos^2[\varphi - (i-1)\frac{\pi}{2}] \end{aligned} \tag{17}$$

$$\text{Substituting}\\
\begin{cases}
\frac{5}{1-2}\cos[\varphi - (i-1)\frac{\pi}{2}] = 0\\ \frac{5}{1-2}\cos^2[\varphi - (i-1)\frac{\pi}{2}] = 1
\end{cases}
\text{into Equation (17):}
$$

$$\sum\_{i=2}^5 \tau\_{1i} \approx -\frac{2\mathcal{D}^2}{\text{Re}}\tag{18}$$

and then substituting Equation (18) into Equation (14) yields:

$$\frac{\partial R}{\partial \tau\_{1i}} \approx -\frac{2Rc(c\tau\_{1i} - R)}{3D^2} \tag{19}$$

Therefore, Equation (14) then becomes

$$
\sigma\_R = \frac{4Rc\sqrt{D^2 + R^2}}{3D^2} \sigma\_\tau \tag{20}
$$

Equation (20) reveals that the range error *σ<sup>R</sup>* is determined by range *R*, array size *D*, sound velocity *c*, and the delay error. Compared to the azimuth error, the influence of *R* here is an additional effect on the error. Assuming values of 5 kHz for the sampling rate, *R* = 100 m, and array size changed from 1 m to 5 m, the distribution of range error is shown in Figure 3.

**Figure 3.** Distributions of the range error of *R* and *στ*.

In Figure 3, the range error was quite big. In both cases, the error reduces with increasing array size, and it increases with both the range *R* and the time delay *στ*. In general, however, the error is overall at a fairly high level. In Figure 3a, the relative error was almost 40% under the optimal condition, and the biggest error was 950%. The error distribution in Figure 3b is the same as in Figure 3a, with higher level, the optimal error was 50%, and the biggest one is twenty times.

In summary, a single five-element cross array has good directional ability. The azimuth error can stay below 0.1◦ under reasonable conditions. However, the range ability is rather bad. The error is nearly 40% even under best conditions, which makes it impossible to achieve satisfactory sound source localization.

#### **3. Localization Analysis of Double Arrays**

#### *3.1. Localization Principle*

Although the single array has poor range-detection ability, its good directional ability ensures that the direction of the sound source is accurately determined. In order to improve the range-detection ability, a second array was added to the setup by means of intersecting the azimuth lines.

The array in Figure 1 remained positioned as shown in the figure and is referred to as Array 1. The second array, with identical characteristics, was added to the X-Y plane as Array 2. The centre of Array 2 is located at *O*<sup>1</sup> (*L*, 0). The angle between the line *OT* (sound source *T* to origin *O*) and the axis-X is referred to as *ϕ*1, while the angle between *O*1*T* and X is *ϕ2*. The time delay when the sound signal reaches the sensors in Array 2 is *τ*1*<sup>i</sup>* . The range differences are *d*1*<sup>i</sup>* (*i* = 2, 3, 4, 5), so *d*1*<sup>i</sup>* = *c* × *τ*1*<sup>i</sup>* . The geometry of the double-array setup is shown in Figure 4.

**Figure 4.** Model of double array.

Since the structure of both arrays is the same, the form of the azimuth formula is the same and the relevant expressions follow from Equation (3) as:

$$\begin{cases} \begin{array}{c} \tan \varphi\_1 \approx \frac{(\tau\_{15} - \tau\_{13})}{(\tau\_{14} - \tau\_{12})} \\ \tan \varphi\_2 \approx \frac{(\tau\_{15} \circ - \tau\_{13} \circ)}{(\tau\_{14} \circ - \tau\_{12} \circ)} \end{array} \tag{21}$$

From the geometric relationship,

$$\begin{cases} \begin{array}{c} k\_1 = \frac{y}{x} = \tan \varphi\_1\\ k\_2 = \frac{y}{x-L} = \tan \varphi\_2 \end{array} \end{cases} \tag{22}$$

The simplification of Equation (22) is:

$$\begin{cases} \begin{array}{c} \text{x} = \frac{Lk\_2}{k\_2 - k\_1} \\ \text{y} = \frac{Lk\_1k\_2}{k\_2 - k\_1} \end{array} \end{cases} \tag{23}$$

$$\begin{cases} \ \varphi = \arctan(k\_1) \\ \ R = \frac{Lk\_2\sqrt{1+k^2\_1}}{k\_2-k\_1} \end{cases} \tag{24}$$

Equations (23) and (24) represent two alternative expressions for the localization of sound source. In these two expressions, the variables are array distance *L* and slopes *k*<sup>1</sup> and *k*2. The slopes can be inferred from the time delay at each one of the two sensors by means of Equation (21). In the experiments, localization was obtained from time delay *τ*1*<sup>i</sup> ( )* and array distance *L.*

#### *3.2. Precision Analysis for Localization*

Compared to the single array, the variable *L* has been added to the localization expression for double arrays. However, the time delay remains the key variable. As the structure and sensors of the two arrays are identical, the standard time delay errors of both are equal (*στ = στ*<sup>12</sup> *= στ*<sup>13</sup> *= στ*<sup>14</sup> *= στ*<sup>15</sup> *= σ'τ*<sup>12</sup> *= σ'τ*<sup>13</sup> *= σ'τ*<sup>14</sup> *= σ'τ*15).

As the direction was determined by Array 1, the azimuth error was analyzed according to Equation (24). Meanwhile, range error *σ<sup>R</sup>* is influenced by azimuth error *σϕ* and array distance *L*. Range precision is obviously determined by the azimuth precision according to Equation (24). Range precision can be expressed with error transmission as

$$
\sigma\_R = \frac{\partial R}{\partial \tau\_i} = \frac{\partial R}{\partial \rho} \frac{\partial \rho}{\partial \tau\_i} = (\frac{\partial R}{\partial \rho\_1} + \frac{\partial R}{\partial \rho\_2}) \frac{\partial \rho}{\partial \tau\_i} \tag{25}
$$

$$
\sigma\_R = \frac{\sqrt{2}cr(\sin(\varphi\_1 + \varphi\_2) + \sec\varphi\_1 \cos\varphi\_2)}{\sin(\varphi\_1 - \varphi\_2)}\sigma\_\tau \tag{26}
$$

Applying the sin theorem on ΔTO1O2 gives:

$$\frac{\sin\angle TO\_2O\_1}{R} = \frac{\sin\angle O\_2TO\_1}{L} \tag{27}$$

$$\frac{\sin \varphi\_2}{R} = \frac{\sin(\varphi\_2 - \varphi\_1)}{L} \tag{28}$$

$$R = \frac{\mathcal{L}\sin\varphi\_2}{\sin(\varphi\_2 - \varphi\_1)}\tag{29}$$

Taking partial derivative in Equation (29) yields:

$$
\sigma\_{\mathbb{R}} = \frac{\sqrt{2}cL\cos\varphi\_2}{D\sin(\varphi\_2 - \varphi\_1)}\sigma\_{\mathbb{r}}\tag{30}
$$

Substituting in Equation (29) gives

$$
\sigma\_R = \frac{\sqrt{2}cR}{D \tan \varphi\_2} \sigma\_\tau \tag{31}
$$

In Equation (31), range precision is determined by sound velocity *c*, array size *D*, azimuth *ϕ*<sup>2</sup> of Array 2, error of time delay *στ*, and range *R*. The distribution of the range error is shown in Figure 5. In Figure 5a, *R* = 100 m, *c* = 343 m/s, *στ* = 100 μs. In Figure 5b, *D* = 3 m, *R* = 100 m, *c* = 343 m/s.

Figure 5 reveals that *ϕ*<sup>2</sup> affected range precision substantially. In (1◦, 20◦) and (160◦, 179◦), the range error remains very high. In (20◦, 160◦), the error was much lower and acceptable. In Figure 5a, array size *D* significantly affected precision when *D* < 3 m. Error was 10.78% when *ϕ*<sup>2</sup> was 8.12◦. The error reduced as angle *ϕ*<sup>2</sup> is increased. When *ϕ*<sup>2</sup> was 15.24◦, range error was 5.65% and it reduced to 3.78% as *ϕ*<sup>2</sup> was increased to 22.36◦. In Figure 5b, distribution was the same to Figure 5a, and the error stayed below 5% when *ϕ*<sup>2</sup> is above 20◦.

**Figure 5.** Distributions of the range error of *R* and *D*.

Since array distance *L* is independent to time *t*, the error expression of *L* is:

$$
\sigma\_{\overline{R}}^{L} = \frac{\overline{\partial R}}{\overline{\partial L}} \tag{32}
$$

Take partial derivatives of *L* in Equation (24):

$$
\sigma\_R^L = \frac{k\_2 \sqrt{1 + k\_1^2}}{2(k\_2 - k\_1)} = \frac{k\_2 k\_1 \sqrt{1 + \frac{1}{k1^2}}}{2(k\_2 - k\_1)}\tag{33}
$$

Substituting *y* = *R* sin *ϕ*<sup>1</sup> *k*<sup>1</sup> = tan *ϕ*<sup>1</sup> into Equation (33) gives

$$
\sigma\_R^L = \frac{R}{L} \tag{34}
$$

Equation (34) was the expression of the range error with factor *L*. The error was affected by range *R* and array distance *L*. The relative error is 1/*L*; therefore, the relative error theoretically stays constant when *L* was designated. The error is below 6.67% when *L* ≥ 15 m. The distribution is shown in Figure 6.

**Figure 6.** Distributions of the range error of *L*.

In summary, the locating ability of the double array is good. The range error stays below 5 m under a range condition of 100 m in most areas, and the azimuth error remains below 0.2◦ for all conditions. Considering the environmental factor (*c*) and calculated factor (*στ*), it is advisable to choose large array sizes to improve localization precision. However, larger array sizes result in higher costs and increased complexity of the system.

#### **4. Experiments**

#### *4.1. Experiment Setting*

The locating experiments were conducted in the natural environment. The test area was open with a size of 150 × 150 m2, and there were no tall reflectors along the boundary of the measurement domain. According to the empirical sound speed formula, the velocity of sound propagation was 343 m/s for an air temperature of 21 ◦C. During the experiments, wind speed was very low and the localization range was about 100 m, such that the influence of wind can be assumed negligible. Since it's difficult to keep an LPV going straight and travel at a constant speed, a simulated sound source with a smaller size was used to replace the vehicle noise.

The simulated source consisted of a 0.1 kW loudspeaker and a power amplifier. The biggest noise sources of LPV were the exhaust system and track system. The track noise was random and nonstatistical. So, the actual measurement of periodic exhaust noise was the sound signal that was collected during the running of the vehicle engine with rotating speed *r* = 1200 rpm. The sound signal is shown in Figure 7.

**Figure 7.** Sound signal for moving source.

The moving sound source travelling at a constant speed was achieved by dragging the loudspeaker with a fixed pulley at a constant rotating speed. The source was traveling along a straight line. By monitoring the distance travelled as a function of time, the speed of motion of the source was obtained.

In Figure 8, two five-element cross arrays were set according to Figure 4. The distance *L* between two central sensors was 10 m, and 2 m for array size *D*. A NI-PXI system with 10 channels and a sample rate of 248 kS/s was the testing instrument. The array microphones used G.R.A.S with the sensitivity of 40 mv/pa.

**Figure 8.** Microphone array.

The sound source started moving from point A (−28.8, 97.92) to B (28.8, 97.92) in the X-Y plane, and then returned back to point A.

#### *4.2. Data Length for Correlation*

Since the signal collected was not even during the interval while the source moved, it is necessary to extract part of the whole signal periodically for short-time correlation. To ensure efficiency, the extraction must cover one whole period in each short-time correlation.

As the sound source was being actuated by a pulley, the speed of motion was relatively slow, which should be less than 5 m/s.

$$f\!f = (\frac{v\_0 \pm v\_t}{v\_0 \mp v\_s})f\tag{35}$$

According to the Doppler effect formula, the difference in value between Doppler frequency *f* and original frequency *f* was about 0.01 *f*. Meanwhile, sound velocity *v0* was 343 m/s, velocity *vt* of the observer was zero and velocity of source *vs* took the maximum velocity 5 m/s. Therefore, data bias resulting from the Doppler effect can be ignored.

The longest distance that one acoustic wave travels in the single array is 2 *D*, such that the maximum travel time is 2 *D*/*c*. The sampling interval is *TN* = 1/*F* when the sampling frequency is assumed to be *F*. The data length *n* for short-time correlation describes the theoretical length of each correlation in Equation (36):

$$m \ge \frac{1/f + 2D/c}{1/F} \tag{36}$$

#### *4.3. Time Compensation during Signal Transmission*

Since the acoustic signal travels a long distance before it is detected by the test system, there exists a time delay between this signal and the instant when it was emitted by the noise source. Point *x(t)* is the position of the moving source at the instant *t*. The signal as used for the data analysis has been generated at the source at point *x(t*0*),* which is located at a distance *r* from *x(t)*, as was illustrated in Figure 9. Thus, the identified location at the instant time *t* is in fact the position of noise source at the moment *t*0. It is essential to compensate for the difference.

**Figure 9.** Path of motion of the sound source.

This constellation, as graphically illustrated in Figure 8, can be expressed as:

$$\begin{cases} R\_0 = \sqrt{\mathbf{x} \left(t\_0\right)^2 + \mathbf{y} \left(t\_0\right)^2} \\ R = \sqrt{\mathbf{x} \left(t\right)^2 + \mathbf{y} \left(t\right)^2} \\ r = \left|\mathbf{x}(t\_0) - \mathbf{x}(t)\right| = vR\_0/c \end{cases} \tag{37}$$

To locate the noise source in actual conditions, point *x(t*0*)* moves with velocity *v* and sound velocity *c* can be calculated. Therefore, compensation *r* is available and needs be taken into consideration in the actual localization procedure.

#### *4.4. Experiment Results*

Additional environmental noise cannot be avoided either. This superposed additional noise will negatively affect the correlation of the signals. Therefore, preprocessing of the measured signals is required before obtaining the correlations. Since there were no other obvious sound sources and since the superposed additional noise is of high frequency, the wavelet-filtering method was chosen to remove unwanted noise, whereby the wavelet basis was "db10"(No. 10 of Daubechies Series Basis [27]). As the signal was relatively simple, it was decomposed into three layers. Then, the lower part in frequency was taken to perform short-time correlation. The signal-filtering process of one channel is shown in Figure 10.

**Figure 10.** Filter process of signal.

Reference to Figure 10 reveals the noise contained in the test signal. There were two kinds of noise. One is the high-frequency noise for which the amplitude is about 1/20 of test signal. The other is the impulse with about 1/3 of the amplitude. Both of these noise contributions detrimentally affect the correlations of the array signal. Hence, the signal form without noise, as shown in the lower left part of Figure 9, was used to operate correlations.

During the path A→B, traveling distance was S = 57.6 m, with associated travel time t = 45.6 s and sampling length N = 228,800. Distance S was divided into 11 elements, while travel time was the same. The localization of the moving source was achieved by locating the central point of the 11 elements. The results obtained are shown in Table 1.


**Table 1.** Point information of path A→B.

In Figure 11, it's obvious that basic frequency was 600 Hz with some doubling frequency component. From Section 3.2, it is known that the period of noise *T* was 1/*f* = 1666 μs. Maximum traveling time of a single wave between array sensors is 2 *D*/*c* ≈ 11.66 ms. The length of signal extracted for one correlation must be bigger than 11.66 ms. As the sampling interval was 200 μs, the minimum length of signal extraction was 58.

**Figure 11.** Frequency spectrum of the signal.

The localization was carried out at the first central point to study the choosing principle of extraction length that participated in one short-time correlation. Based on the signals of Sensor 1 and Sensor 2 in Array 1, the first central was set at point 20,727, and the length of extraction was assigned as 30, 60, 65, 70, 100, 200, and 300. After extraction of the signal and 100 times of interpolations, the correlations of different length were calculated, as shown in Figure 12a.

The sampling interval decreased to 2 μs after 100 times of interpolation. Maximum value is located at N = 3233, while the central point of correlation was N = 3000 when extraction *n* = 30. Then, time delay is *τn*=30 = (3223 − 3000) × 2 = 446 μs and delays of other lengths can be calculated in the same way, as shown in Figure 12b.

Figure 12 illustrated that it is not possible to obtain correct time delays as time delay varies randomly when *n* ≤ 60. When *n* ≥ 65, delay value only varies marginally and remains steady at a level of about 520 μs, which represents the correct value of time delay.

**Figure 12.** Correlation performance of different lengths.

The locating results of different signal lengths are shown in Table 2.

**Table 2.** Locating result of different lengths.


Table 2 illustrates that localization has failed with a rather high error when the length of the signal involved in the correlation was *n* ≤ 60. When *n* ≥ 65, the test point was located nearly around the actual point. Therefore, the length of the signal extraction was assigned as *n* = 65, participating the calculation after interpolation.

All the locating results are summarized in Table 3, and the moving paths are shown in Figure 13.


**Table 3.** Locating result of moving source.

Both Table 3 and Figure 13 depict that the discrete points along the moving path were obtained accurately in the localization experiment with only small associated error.

**Figure 13.** Actual and tested path of source.

According to Section 3.2, after compensation was considered, the relative locating error distribution is shown in Figure 14 before and after fixing.

**Figure 14.** Distribution of relative error.

The lines in Figure 14 represent the original error of the localization, while dotted lines for error after fixing. To some extent, the locating result improved. In summary, the accurate localization of a moving acoustic source was achieved and the error stayed below 5%. In further application, the complete moving path could be obtained by increasing the number of parts divided.

#### **5. Conclusions**

The precision analysis of a locating method of a moving sound source based on intersecting azimuth lines was studied in this paper. Simulations showed that, after another single array was added, it had better precision and lower error. The experiments were conducted outdoors after choosing principle of signal length in correlation. Accurate localization of a moving source was achieved with the associated error for locating the source staying below 5%.

The work in this paper indicates applications for low-speed noise sources, such as wildlife conservation, health protection, wind turbine noise, and other engineering applications in the wild. However, the property change of the acoustic signal was ignored with low velocity and assumption of point source during the simulated source localization. Further research is required in the actual application. Further developments should focus on improvements of array size and shape. Meanwhile, the localization of high-speed moving sources and long-distance sources is not just an extension of this research.

**Author Contributions:** Conceptualization, J.Y. and C.X.; Methodology, C.X. and W.W.; Writing-Review & Editing, W.W.; Supervision, Project Administration and Funding Acquisition, C.X. and W.W.

**Funding:** This paper is supported by the China Scholarship Council (No. 201809110025).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Automatic Bowel Motility Evaluation Technique for Noncontact Sound Recordings**

#### **Ryunosuke Sato 1,\*, Takahiro Emoto 2,\* ID , Yuki Gojima <sup>1</sup> and Masatake Akutagawa <sup>2</sup>**


Received: 9 May 2018; Accepted: 13 June 2018; Published: 19 June 2018

**Abstract:** Information on bowel motility can be obtained via magnetic resonance imaging (MRI)s and X-ray imaging. However, these approaches require expensive medical instruments and are unsuitable for frequent monitoring. Bowel sounds (BS) can be conveniently obtained using electronic stethoscopes and have recently been employed for the evaluation of bowel motility. More recently, our group proposed a novel method to evaluate bowel motility on the basis of BS acquired using a noncontact microphone. However, the method required manually detecting BS in the sound recordings, and manual segmentation is inconvenient and time consuming. To address this issue, herein, we propose a new method to automatically evaluate bowel motility for noncontact sound recordings. Using simulations for the sound recordings obtained from 20 human participants, we showed that the proposed method achieves an accuracy of approximately 90% in automatic bowel sound detection when acoustic feature power-normalized cepstral coefficients are used as inputs to artificial neural networks. Furthermore, we showed that bowel motility can be evaluated based on the three acoustic features in the time domain extracted by our method: BS per minute, signal-to-noise ratio, and sound-to-sound interval. The proposed method has the potential to contribute towards the development of noncontact evaluation methods for bowel motility.

**Keywords:** bowel sound; bowel motility; automatic detection/evaluation; power-normalized cepstral coefficients; noncontact instrumentation

#### **1. Introduction**

The decrease in or loss of bowel motility is a problem that seriously affects quality of life (QOL) and daily eating habits of patients; examples of this include functional gastrointestinal disorders (FGID), in which patients experience bloating and pain when bowel motility is impaired due to stress or other factors. Such bowel disorders are diagnosed by evaluating the bowel motility. Bowel motility is currently measured using X-ray imaging or endoscopy techniques; however, these methods require complex testing equipment and place immense mental, physical, and financial burdens on patients, which make these methods unsuitable for repeated monitoring.

In recent years, the acoustic features obtained from bowel sounds (BS) have been used to evaluate bowel motility. BS are created when transportation of gas and digestive contents through the digestive tract occurs due to peristaltic movement [1]. BS can be easily recorded by applying an electronic stethoscope to the surface of the body. In recent years, a method has been developed for evaluating bowel motility by automatically extracting BS from the audio data recorded using electronic stethoscopes [2–7]. In quiet conditions, BS can be perceived at a slight distance without the use of an electronic stethoscope. As such, our recent research has demonstrated that even when data is acquired using a noncontact microphone, bowel motility can be evaluated based on BS in a manner the same as that when an electronic stethoscope is used [8]. However, in this study, BS were required to be manually extracted from the audio data that was recorded using noncontact microphones, and a large amount of time was spent on carefully labeling the sounds. The sound pressure of BS recorded using noncontact microphones was lower than that of BS recorded with electronic stethoscopes placed directly on the surface of the body. Furthermore, compared to recordings from electronic stethoscopes, there may have been sounds other than BS mixed in at higher volumes. As such, a BS extraction system that is robust against extraneous noise must be developed to reduce the time- and labor-intensive work of BS labeling.

To resolve these issues, this study proposes a new system for evaluating bowel motility on the basis of results obtained by automatically extracting BS from the audio data recorded with a noncontact microphone. The proposed method is primarily made up of the following four steps: (1) segment detection using the short-term energy (STE) method; (2) automatic extraction of two acoustic features—mel-frequency cepstral coefficients (MFCC) [9,10] and power-normalized cepstral coefficients (PNCC) [11–14]—from segments; (3) automatic classification of segments as BS/non-BS based on an artificial neural network (ANN); and (4) evaluation of bowel motility on the basis of the acoustic features in the time domain of the BS that were automatically extracted. On the basis of audio data recorded from 20 human participants before and after they consumed carbonated water, we verified (i) the validity of automatic BS extraction by the proposed method and (ii) the validity of bowel motility evaluation based on acoustic features in the time domain.

#### **2. Materials and Methods**

#### *2.1. Subject Database*

This study was conducted with the approval of the research ethics committee of the Institute of Technology and Science at Tokushima University in Japan. A carbonated water tolerance test was performed using 20 male participants (age: 22.9 ± 3.4, body mass index (BMI): 22.7 ± 3.8) who had provided their consent to the research content and their participation. The test was conducted after 12 or more hours of fasting by the participants, over a 25-min period (comprised of a 10-min period of rest before consuming carbonated water and a 15-min period of rest after consuming carbonated water). During the test, sound data was recorded using a noncontact microphone (NT55 manufactured by RODE), an electronic stethoscope (E-Scope2 manufactured by Cardionics), and a multitrack recorder (R16 manufactured by ZOOM). The primary frequency components of BS have generally been reported to be present between 100 Hz and 500 Hz [15]. Based on these reports, sound data was stored at a sampling frequency of 4000 Hz and digital resolution of 16 bits. Furthermore, sound data was filtered by a third-order Butterworth bandpass filter with a cutoff frequency of 100–1500 Hz. The participants were in a supine position during testing, with an electronic stethoscope positioned 9 cm to the right of the navel and a microphone 20 cm above the navel [8].

BS present in the sound data obtained using the noncontact microphone were also present in the sound data obtained using the electronic stethoscope. Based on this, as in our previous studies, we used audio playback software to listen carefully to both types of sound recordings, and classified as a BS episode any episode that was 20 ms or more in duration and could be distinguished by the ear at the same time position in both recordings [7].

For the analysis, we divided the sound data into sub-segments with a window range of 256 samples and a shift range of 64 samples. The STE method was used to calculate the power of each window range, making it possible to detect sub-segments above a certain signal-to-noise ratio (SNR). SNR, as used in this study, is defined as follows:

$$SNR = 10 \log\_{10} \frac{P\_{\rm S}}{P\_{\rm N}} \tag{1}$$

Here, *PS* represents the signal power and *PN* represents the noise power. *PN* can be calculated based on a one-second interval of silence determined by conducting the abovementioned listening process, and it is a time-averaged value. Sub-segments detected successively using the STE method are treated as a single segment (also called sound episode (SE)). If a detected segment corresponds to a BS episode, then it is defined as a BS segment; otherwise, it is defined as a non-BS segment.

#### *2.2. Automatic BS Extraction on the Basis of Acoustic Features*

The acoustic feature presented to the ANN is either MFCC or PNCC. MFCC is widely used in fields such as speech recognition and analysis of biological sounds such as lung or heart sounds [9,16–18]. MFCC is calculated by performing a discrete cosine transformation on the output from triangular filter banks evenly spaced along a logarithmic axis; this is referred to as a mel scale, and it approximates the human auditory frequency response. PNCC is a feature value developed to improve the robustness of voice recognition systems in noisy environments [11–14]. Because BS captured using noncontact microphones are generally low in volume and have degraded SNR, PNCC can be expected to be effective; it improves the process of calculating MFCC to make it more similar to certain physiological aspects of humans. Moreover, PNCC differs from MFCC primarily in the following three ways: First, instead of the triangular filter banks used in MFCC, PNCC uses gamma-tone filter banks based on an equivalent rectangular bandwidth to imitate the workings of the cochlea. Second, it uses bias subtraction based on the ratio of the arithmetic mean to the geometric mean (AM-to-GM ratio) for the sound that undergoes intermediate processing, which is not done in the MFCC calculation process. Third, it replaces the logarithmic nonlinearity (used in MFCC) with power nonlinearity. Owing to these differences, PNCC is expected to provide sound processing with excellent resistance to noise. For BS extraction in this work, a SE is divided into frames with a frame size of 200 samples and a shift size of 100 samples. Considering the number of dimensions often used in the field of voice recognition, we use 13-dimension MFCC and PNCC obtained from 24-channel filter banks, averaged over all the frames in each episode.

On the basis of these acoustic features, an artificial neural network (ANN) is used as a classifier to categorize segments detected with the STE method into BS segments and non-BS segments. The ANN is structured as a hierarchical neural network made up of three layers: namely, the input, intermediate, and output layers. The number of units in the input, intermediate, and output layers are, respectively, 13, 25 and 1. The output function of the intermediate layer units is a hyperbolic tangent function, and the transfer function of the output layer units is a linear function. As a target signal, the value of 1 is assigned to analysis sections in which sound is present if the sound is BS, whereas 0 is assigned if it is non-BS. The ANN learns from this categorization using an error back-propagation algorithm based on the Levenberg–Marquardt method [19,20]. To calculate sensitivity and specificity based on the post-training ANN output, a receiver operating characteristic (ROC) curve can be drawn. Through the analysis of the ROC curve, an optimum threshold (Th) is estimated for use when classifying testing data sets. The optimum threshold used at this point is the threshold that is the shortest Euclidean distance from the positions at which sensitivity = 1 and specificity = 1 on the ROC curve [21]. Using this threshold for the ANN test output ˆ *b*, it is possible to calculate the classification accuracy using sensitivity (Sen), specificity (Spe), positive predictive value (PPV), negative predictive value (NPV), and accuracy (Acc).

As shown in Figure 1, automatic BS extraction performance in this ANN-based method is evaluated by dividing the BS and non-BS segments obtained from the 20-person sound database at a ratio of 3:1, and using them respectively as training and testing data. This study calculated the average classification accuracy by performing multiple trials of ANN training and testing, in which (1) initial values of combined load were randomly assigned or (2) test data was randomly assigned.

**Figure 1.** Block diagram showing the proposed method for automatic BS extraction based on acoustic features. SE: sound episode; MFCC: Mel Frequency Cepstral Coefficients; PNCC: Power Normalized Cepstral Coefficients; ANN: artificial neural network; ROC: receiver operating characteristic; BS: bowel sound; *b*: ANN test output; Th: threshold obtained via ROC analysis.

#### *2.3. Evaluation of Bowel Motility Based on Automatically Extracted BS*

Our past research demonstrated significant differences in the following time domain acoustic features extracted before and after consumption of carbonated water by the participants: BS detected per minute, SNR, length of BS, and interval between BS (sound to sound (SS) interval). These differences suggest that bowel motility can be evaluated on the basis of these acoustic features [8]. As such, this study examines whether bowel motility can be automatically evaluated based on these acoustic features, as investigated in the previous study. To evaluate bowel motility from the data of one participant, the acoustic features of time domains were extracted based on multiple BS automatically extracted by performing leave-one-out cross validation for the proposed method. As in past studies, the differences between the previously mentioned acoustic features before and after the participant consumed carbonated water was evaluated using a Wilcoxon signed-rank test. The block diagram in Figure 2 shows the process leading up to the evaluation of bowel motility.

**Figure 2.** Block diagram showing the proposed method for automatic evaluation of bowel motility.

#### **3. Results**

To investigate the effect of SNR thresholds used in the STE method on the automatic evaluation performance and evaluation of bowel motility by the method, experiments were performed in which the SNR thresholds used in the STE method were 0, 0.5, 1 and 2 dB.

#### *3.1. Automatic Bowel Sound Detection*

Table 1 lists the number and length of BS and non-BS segments obtained at each SNR threshold used in the STE method.

**Table 1.** Number and length of BS and non-BS segments obtained at each SNR threshold used in the STE method.


Table 1 reveals the following pattern for both cases (before and after consumption of carbonated water by participants): As the SNR threshold decreases, the numbers of both BS and non-BS segments increase until a certain threshold, after which the numbers of segments decrease. Additionally, the values in the table confirm that the lengths of both segments also increase with decrease in SNR. The values of length and number of both segments were larger after consumption of carbonated water than those before consumption, and BS segments were longer than non-BS segments.

To evaluate the automatic extraction performance of the proposed method, the respective segments were divided in a ratio of 3:1 for training data and testing data. Tables 2 and 3, respectively, present the results of 100 ANN-based approach trials that used MFCC and PNCC as acoustic features to derive the average classification accuracy.

Table 2 reveals that for the case before consumption of carbonated water, accuracy slightly degraded with decrease in the SNR threshold, whereas the accuracy increased with decrease in SNR threshold in the case after consumption. Table 3 demonstrates that when PNCC is used, classification accuracy increases as SNR threshold decreases, for cases both before and after consumption of carbonated water. Furthermore, we can see that the highest accuracy is obtained when the SNR threshold is 0 dB. Figure 3 shows the results of a comparative analysis of extraction accuracy before and after consumption of carbonated water when using MFCC and PNCC, respectively. Table 3 shows that PNCC is more accurate than MFCC for all SNR thresholds. When the SNR threshold is 0 dB before the consumption of carbonated water, the average of PNCC becomes sufficiently larger compared to that of MFCC. In general, a BS with lower sound-pressure occurs before consumption of carbonated water than after consumption. This suggests that PNCC is effective in classifying such sounds. On the basis of the abovementioned observation, a subsequent automatic evaluation of bowel motility was conducted using PNCC with an ANN-based approach.


**Table 2.** Results of automatic BS extraction using an ANN-based approach based onMFCC (using performance evaluation through random sampling). Sne: sensitivity; Spe: specificity; PPV: positive predictive value; NPV: negative predictive value; Acc: Accuracy.

**Table 3.** Results of automatic BS extraction using an ANN-based approach based on PNCC (using performance evaluation through random sampling).


**Figure 3.** Comparison of accuracies of ANN-based approaches based on MFCC and PNCC, respectively.

#### *3.2. Bowel Motility Evaluation*

In this study, leave-one-out cross validation was performed for each participant, and the classification accuracy of an ANN-based approach using PNCC was verified. Table 4 presents the average classification accuracies for which the corresponding accuracy was the highest for each participant after leave-one-out cross validation was performed 50 times.



As was noted in a prior study [8], Table 5 shows that the acoustic features—BS detected per minute, SNR, and SS interval—can capture the differences in bowel motility before and after a participant consumes carbonated water, up to a point at which the SNR threshold decreases to nearly 0 dB. Note that these results are related to the accuracy of automatic BS extraction. However, unlike in the prior study [8], no significant difference in BS length before and after consumption of carbonated water was found. This suggests that when the SNR threshold reduces to 0 dB, the acoustic features of BS detected per minute, SNR, and SS interval can evaluate the bowel motility without being affected by the reduction in SNR threshold.

**Table 5.** Results of automatic bowel motility evaluation using acoustic features in four time domains: BS detected/min, SNR (dB), length of BS (s), and SS interval (s).


#### **4. Discussion and Conclusions**

This study proposes a system for automatic evaluation of bowel motility on the basis of acoustic features in BS time domains obtained by automatically extracting BS from sound data recorded using a noncontact microphone. Although studies related to bowel motility using BS have been conducted previously [2–7], those studies used electronic stethoscopes that were applied to the surface of the body. Our recent research has demonstrated that bowel motility can be evaluated from sound data recorded using a noncontact microphone the same way as it can be evaluated using data recorded with a stethoscope [8]. However, the extraction of BS from sound data performed in this study was based on manual labeling. The sound pressure of BS recorded using noncontact microphones is lower than that of BS recorded using electronic stethoscopes applied to the surface of the human body, and there are fewer perceptible BS. As such, using sound data recorded without contact requires an automatic BS extraction method that is resistant to extraneous noise. Even so, the results suggest that the system proposed herein—which uses PNCC and has excellent noise resistance—is able to automatically extract BS with approximately 90% accuracy if the SNR threshold is 0 dB. Furthermore, even when the SNR threshold drops to 0 dB, results suggest that bowel motility can be evaluated using the acoustic features other than those from the BS length time domain, such as BS detected per minute, SNR, and SS interval.

The proposed method could extract more sound by decreasing the SNR threshold used in the STE method, further extending segment length to increase the information provided to the system for BS/non-BS differentiation. We believe that as a result of this extension, we could improve the performance of automatic BS extraction. However, this also suggests that proper BS length cannot be obtained because of the extension in BS segment length caused by the decrease in the SNR threshold used in the STE method.

Compared to the results of the performance evaluation based on random sampling, the results based on leave-one-out cross validation tended to have a larger standard deviation and decreased sensitivity in the proposed method, particularly before the consumption of carbonated water by participants. The cause of this was thought to be the small number of participants, meaning that sufficient BS segments were not available for use in leave-one-out cross validation. As such, we expect an improvement with increase in the number of subjects. To further improve system performance, a combination of the following two measures would likely be useful: (1) replacing the STE method with another method for detecting segments having sound; and (2) selecting acoustic features with excellent resistance to extraneous noise.

In this study, we have provided new knowledge for noncontact automatic evaluation of bowel motility. It is hoped that the foundations of the system developed in this study can assist in the further development of the evaluation of bowel motility using noncontact microphones and research related to diagnostic support for bowel disorders.

**Author Contributions:** T.E., R.S., and Y.G. conceived and designed the experiments; R.S. and Y.G. performed the experiments; R.S. analyzed the data; R.S. and M.A. contributed materials/analysis tools; T.E. and R.S. wrote the paper.

**Acknowledgments:** This study was partly supported by the Ono Charitable Trust for acoustics.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Multi-Frame PCA-Based Stereo Audio Coding Method**

#### **Jing Wang \*, Xiaohan Zhao, Xiang Xie and Jingming Kuang**

School of Information and Electronics, Beijing Institute of Technology, 100081 Beijing, China; jonestorrons@gmail.com (X.Z.); xiexiang@bit.edu.cn (X.X.); jmkuang@bit.edu.cn (J.K.)

**\*** Correspondence: wangjing@bit.edu.cn; Tel.: +86-138-1015-0086

Received: 18 April 2018; Accepted: 9 June 2018; Published: 12 June 2018

**Abstract:** With the increasing demand for high quality audio, stereo audio coding has become more and more important. In this paper, a multi-frame coding method based on Principal Component Analysis (PCA) is proposed for the compression of audio signals, including both mono and stereo signals. The PCA-based method makes the input audio spectral coefficients into eigenvectors of covariance matrices and reduces coding bitrate by grouping such eigenvectors into fewer number of vectors. The multi-frame joint technique makes the PCA-based method more efficient and feasible. This paper also proposes a quantization method that utilizes Pyramid Vector Quantization (PVQ) to quantize the PCA matrices proposed in this paper with few bits. Parametric coding algorithms are also employed with PCA to ensure the high efficiency of the proposed audio codec. Subjective listening tests with Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) have shown that the proposed PCA-based coding method is efficient at processing stereo audio.

**Keywords:** stereo audio coding; Principal Component Analysis (PCA); multi-frame; Pyramid Vector Quantization (PVQ)

#### **1. Introduction**

The goal of audio coding is to represent audio in digital form with as few bits as possible while maintaining the intelligibility and quality required for particular applications [1]. In audio coding, it is very important to deal with the stereo signal efficiently, which can offer better experiences of using applications like mobile communication and live audio broadcasting. Over these years, a variety of techniques for stereo signal processing have been proposed [2,3], including M/S stereo, intensity stereo, joint stereo, and parametric stereo.

M/S stereo coding transforms the left and right channels into a mid-channel and a side channel. Intensity stereo works on the principle of sound localization [4]: humans have a less keen sense of perceiving the direction of certain audio frequencies. By exploiting this characteristic, intensity stereo coding can reduce the bitrate with little or no perceived change in apparent quality. Therefore, at very low bitrate, this type of coding usually yields a gain in perceived audio quality. Intensity stereo is supported by many audio compression formats such as Advanced Audio Coding (AAC) [5,6], which is used for the transfer of relatively low bit rate, acceptable-quality audio with modest internet access speed. Encoders with joint stereo such as Moving Picture Experts Group (MPEG) Audio Layer III (MP3) and Ogg Vorbis [7] use different algorithms to determine when to switch and how much space should be allocated to each channel (the quality can suffer if the switching is too frequent or if the side channel does not get enough bits). Based on the principle of human hearing [8,9], Parametric Stereo (PS) performs sparse coding in the spatial domain. The idea behind parametric stereo coding is to maximize the compression of a stereo signal by transmitting parameters describing the spatial image. For stereo input signals, the compression process basically follows one idea: synthesizing one signal

from the two input channels and extracting parameters to be encoded and transmitted in order to add spatial cues for synthesized stereo at the receiver's end. The parameter estimation is made in the frequency domain [10,11]. AAC with Spectral Band Replication (SBR) and parametric stereo is defined as High-Efficiency Advanced Audio Coding version 2 (HE-AACv2). On the basis of several stereo algorithms mentioned above, other improved algorithms have been proposed [12], which causes Max Coherent Rotation (MCR) to enhance the correlation between the left channel and the right channel, and uses MCR angle to substitute the spatial parameters. This kind of method with MCR reduces the bitrate of spatial parameters and increases the performance of some spatial audio coding, but has not been widely used.

Audio codec usually uses subspace-based methods such as Discrete Cosine Transform (DCT) [13], Fast Fourier Transform (FFT) [14], and Wavelet Transform [15] to transfer audio signal from time domain to frequency domain in suitably windowed time frames. Modified Discrete Cosine Transform (MDCT) is a lapped transform based on the type-IV Discrete Cosine Transform (DCT-IV), with the additional property of being lapped. Compared to other Fourier-related transforms, it has half as many outputs as inputs, and it has been widely used in audio coding. These transforms are general transformations; therefore, the energy aggregation can be further enhanced through an additional transformation like PCA [16,17], which is one of the optimal orthogonal transformations based on statistical properties. The orthogonal transformation can be understood as a coordinate one. That is, fewer new bases can be selected to construct a low dimensional space to describe the data in the original high dimensional space by PCA, which means the compressibility is higher. Some work was done on the audio coding method combined with PCA from different views. Paper [18] proposed a novel method to match different subbands of the left channel and the right channel based on PCA, through which the redundancy of two channels can be reduced further. Paper [19] mainly focused on the multichannel procession and the application of PCA in the subband, and it discussed several details of PCA, such as the energy of each eigenvector and the signal waveform after PCA. This paper introduced the rotation angle with Karhunen-Loève Transform (KLT) instead of the rotation matrix and the reduced-dimensional matrix compared to our paper. The paper [20] mainly focused on the localization of multichannel based on PCA, with which the original audio is separated into primary and ambient components. Then, these different components are used to analyze spatial perception, respectively, in order to improve the robustness of multichannel audio coding.

In this paper, a multi-frame, PCA-based coding method for audio compression is proposed, which makes use of the properties of the orthogonal transformation and explores the feasibility of increasing the compression rate further after time-frequency transition. Compared to the previous work, this paper proposes a different method of applying PCA in audio coding. The main contributions of this paper include a new matrix construction method, a matrix quantization method based on PVQ, a combination method of PCA and parametric stereo, and a multi frame technique combined with PCA. In this method, the encoders transfer the matrices generated by PCA instead of the coefficients of the frequency spectrum. The proposed PCA-based coding method can hold both a mono signal and a stereo signal combined with parametric stereo. With the application of the multi-frame technique, the bitrate can be further reduced with a small impact on quality. To reduce the bitrate of the matrices, a method of matrix quantization based on PVQ [21] is put forward in this paper.

The rest of the paper is organized as follows: Section 2 describes the multi-frame, PCA-based coding method for mono signals. Section 3 presents the proposed design of the matrix quantization. In Section 4, the PCA-based coding method for the mono signal is extended to stereo signals combined with improved parametric stereo. The experimental results, discussion, and conclusion are presented in Sections 5–7, respectively.

#### **2. Multi-Frame PCA-Based Coding Method**

#### *2.1. Framework of PCA-Based Coding Method*

The encoding process can be described as follows: after time-frequency transformation such as MDCT, the frequency coefficients are used in the module of PCA, which includes the multi-frame technique. Several matrices are generated after PCA is quantized and encoded to bitstream. The decoder is the mirror image of the encoder, after decoding and de-quantizing, matrices are used to generate frequency domain signals by inverse PCA (iPCA). Finally, after frequency-time transformation, the encoder can export audio. Flowcharts of encoder and decoder for mono signals are shown in Figures 1 and 2. The part of MDCT is used to concentrate energy of signal on low band in frequency domain, which is good for the process of matrix construction (details are shown in Section 2.4). Some informal listening experiments have been carried out on the performance applying PCA without MDCT. The experimental results show that without MDCT, the performance of PCA has slight reduction, which means more bits are needed by the scheme without MDCT in order to achieve the same output quality of the scheme with MDCT. Thus, in this paper MDCT is assumed to enhance the performance of the PCA, although it will bring more computational complexity.

**Figure 1.** Flowchart of mono encoder. (TF, Time-to-Frequency; PCA, Principle Component Analysis).

**Figure 2.** Flowchart of mono decoder. (iPCA, inverse Principle Component Analysis; FT, Frequency-to-Time).

#### *2.2. Principle of PCA*

The PCA's mathematical principle is as follows: after coordinate transformation, the original high-dimensional samples with certain relevance can be transferred to a new set of low-dimensional samples that are unrelated to each other. These new samples carry most information of the original data and can replace the original samples for follow-up analysis.

There are several criteria for choosing new samples or selecting new bases in PCA. The typical method is to use the variance of new sample *F*<sup>1</sup> (i.e., the variance of the original sample mapping on the new coordinates). The larger Var (*Fi*) is, the more information *Fi* contains. So, the first principal component should have the largest variance *F*1. If the first principal component *F*<sup>1</sup> is not qualified to replace the original sample, then the second principal component *F*<sup>2</sup> should be considered. *F*<sup>2</sup> is the principal component with the largest variance except *F*1, and *F*<sup>2</sup> is uncorrelated to *F*1, that is, Cov(*F*1, *F*2) = 0. This means that the base of *F*<sup>1</sup> and the base of *F*<sup>2</sup> are orthogonal to each other, which can reduce the data redundancy between new samples (or principal components) effectively. The third, fourth, and *p-*th principal component can be constructed similarly. The variance of these principal components is in descending order, and the corresponding base in new space is uncorrelated to other new base. If there are *m n*-dimensional data, the procession of PCA is shown in Table 1.

**Table 1.** PCA ALGORITHM.


The contribution rate of the principal component reflects the proportion that each principal component accounts for the total amount of data after coordinate transformation, which can effectively solve the problem of dimension selection after dimensionality reduction. In PCA application, people often use the cumulative contribution rate as the basis for principal components selection. The cumulative contribution rate *Mk* of the first *k* principal components is

$$M\_k = \frac{\sum\_{i=1}^k \lambda\_i}{\sum\_{i=1}^n \lambda\_i} \tag{1}$$

If the contribution rate of the first *k* principal components meets the specific requirements (the contribution rates are different according to different requirements), the first *k* principal components can be used to describe the original data to achieve the purpose of dimensionality reduction.

PCA is a good transformation due to its properties, as follows:


It is worthwhile noting that PCA does not simply delete the data of little importance. After PCA transformation, the dimension-reduced data can be transformed to restore most of the high-dimensional original data, which is a good character for data compression. In this paper, as is shown in Figure 3, the spectrum coefficients of the input signal are divided into multiple samples according to specific rules; then, these samples will be constructed to the original matrix *X*. After the principal component analysis, matrix *X* is decomposed into reduced-dimensional matrix *Y* and rotation matrix *P*; the process of calculating matrix *Y* and *P* is shown in Table 1. The matrix *Y* and *P* are transmitted to the decoder after quantization and coding. In decoder, the original matrix can be restored by multiplying reduced-dimensional matrix and transposed rotation matrix. There is some data loss during dimension reduction, but the loss is much less, so we can ignore it. For example, we can recover 99.97% information through a 6-dimension matrix, when the autocorrelation matrix has the 15th dimension. Ideally the original matrix *X* can be restored by reduced-dimensional matrix *Y* and rotation matrix *P* with

$$X \approx X\_{\text{reset}} = Y \times P^T \tag{2}$$

in which *Xrestore* is the matrix restored in decoder and *P<sup>T</sup>* is the transposition rank of matrix *P*. Then, *Xrestore* is reconstructed to spectral coefficients.

**Figure 3.** Scheme of PCA-based coding method. (PCA, Principle Component Analysis).

#### *2.3. Format of Each Matrix*

In encoder, when the sampling rate is 48 kHz, the frame has 240 spectral coefficients after MDCT (in this paper, the MDCT frame size is 5 ms with 50% overlap). There are many forms of matrices like 6 × 40, 12 × 20, 20 × 12, and so on; each format of matrix brings different compression rates. In a simple test, several formats of original matrix were constructed. Then, a subjective test was devised using those different dimensional rotation matrices. 10 listeners recorded the number of dimensions when the restored audio had acceptable quality. Then, the compression rate was calculated by the number of dimensions. As is shown in Figure 4, the matrix has the largest compression rate when it has 16 rows. So, the matrix *<sup>X</sup>*[ 16 15 ] with 16 rows and 15 columns is selected for transient frame in this paper. That means a 240-coefficient-long frequency domain signal is divided into 16 samples, each sample having 15 dimensions.

**Figure 4.** Compression rate for different format of matrix.

#### *2.4. Way of Matrix Construction*

An appropriate way to obtain the 16 samples from frequency domain coefficients is necessary. This paper proposes one method as follows: suppose the coefficients of one frame in frequency domain are *<sup>a</sup>*1,*a*<sup>2</sup> ... *<sup>a</sup>*240. *<sup>a</sup>*<sup>1</sup> is filled in the first column and the first row *<sup>X</sup>*[ 1 1 ] , *a*<sup>2</sup> is filled in the first column and the second row *<sup>X</sup>*[ 2 1 ] , and *<sup>a</sup>*<sup>16</sup> is filled in the first column and the 16th row *<sup>X</sup>*[ 16 1 ] . Then, *<sup>a</sup>*<sup>17</sup> is filled in the first row and second column *<sup>X</sup>*[ 1 2 ] , *a*<sup>18</sup> is filled in the second row and second column *<sup>X</sup>*[ 2 2 ] , and so on, until all the coefficients have been filled in the original matrix *<sup>X</sup>*[ 16 15 ] ; that is,

$$X\_{\lfloor \cdot 16 \rfloor} \quad \text{(15. 15)} \quad = \left[ \begin{array}{cccc} a\_{1,} & \cdots & a\_{225} \\ \vdots & \ddots & \vdots \\ a\_{16,} & \cdots & a\_{240} \end{array} \right] \tag{3}$$

This method has two obvious advantages, which can be find in Figure 5:

**Figure 5.** Example for matrix construction ("value" means the value of cells in original matrix, "column" means the column of original matrix, and "row" means the row of original matrix).

(i) This method takes advantage of the short-time stationary characteristic of signals in the frequency domain. Therefore, the difference between different rows in the same column of the matrix constructed by this sampling method is small. In other words, the difference between the same dimensions of different samples in the matrix is small, and different dimensions have similar linear relationships, which is very conducive to dimensionality reduction.

(ii) This method allows signal energy to gather still in the low-dimensional region of the new space. The energy of the frequency domain signal is concentrated in the low frequency region; after PCA, the advanced column of reduced-dimensional matrix still has the most signal energy. Thus, after dimensionality reduction, we can still focus on the low-dimensional region.

#### *2.5. Multi-Frame Joint PCA*

In the experiment, a phenomenon was observed that the rotation matrices of adjacent frames are greatly similar. Therefore, it is possible to do joint PCA with multiple frames to generate one rotation matrix, that is, multiple frames use the same rotation matrix. Therefore, the codec can transmit fewer rotation matrices, and bitrate can be reduced.

Below is one way to do joint PCA with least error. First, frequency domain coefficients of n sub-frames are constructed as *<sup>n</sup>* original matrices *<sup>X</sup>*1[ 16 15 ] , *<sup>X</sup>*2[ 16 15 ] ... *Xn*[ 16 15 ] , respectively; then, the original matrices of each sub-frame are used to form one original matrix *<sup>X</sup>*[ <sup>16</sup>*<sup>n</sup>* <sup>15</sup> ] . This matrix is used to obtain one rotation matrix and *n* reduced-dimensional matrices.

If too many matrices are analyzed at the same time, the codec delay will be high, which is unbearable for real-time communication. Besides, the average quality of restored audio signal decreases with the increase in the number of frames. Therefore, the need to reduce bitrate and real-time communication should be comprehensively considered. A subjective listening test was designed to find the relationship between the number of frames and the quality of restored signal. 10 audio materials from European Broadcasting Union (EBU) test materials were coded with multi-frame PCA with different numbers of frames. The Mean Opinion Score (MOS) [22] of the restored music was recorded by 10 listeners. The statistical results are shown in Figure 6.

**Figure 6.** Subjective test results for different number of frames.

As is shown in Figure 6, when the number of frames is less than 6 or 8, the decrease of audio quality is not obvious. A suitable number of frames is then subjected to joint PCA. Taken together, when 8 sub-frames are analyzed at the same time, the bitrate and the delay of encoder is acceptable, that is, for every 40 ms signal, 8 sub-frame reduced-dimensional (Rd) matrices and one rotation matrix are transferred. Main functions of the mono encoder and decoder combined with multi-frame joint PCA are shown in Figures 7 and 8. In encoder, 40 ms signal is used to produce 8 Rd matrices and 1 rotation matrix. In decoder, after receiving 8 Rd matrices and 1 rotation matrix, 8 frames are restored to generate 40 ms signal.

**Figure 7.** Multi-frame in encoder. (PCA, Principle Component Analysis; Rd, reduced-dimensional).

**Figure 8.** Multi-frame in decoder. (iPCA, inverse Principle Component Analysis; Rd, reduced-dimensional).

#### **3. Quantization Design Based On PVQ**

According to the properties of matrix multiplication, if the error of one point in matrix *Y* or *P* is large, the restored signal may have a large error. Therefore, uniform quantization cannot limit the error of every point in the matrix in the acceptable range with bitrate limitation. So, it is necessary to set a series of new quantization rules based on the properties of the dimensionality matrix and the rotation matrix. It is assumed that the audio signal obeys the distribution of Laplace [23], and both PCA and MDCT in the paper are orthogonal transformations. Thus, the distribution of matrix coefficients is maintained in Laplace distribution. Meanwhile, we have observed the values in reduced-dimensional matrix and rotation matrix. It is shown that most values of cells in matrix are close to 0, and the bigger the absolute value, the smaller the probability is. Based on the above two statements, the distribution of coefficients in reduced-dimensional matrix and rotation matrix can be regarded as Laplace distribution. Lattice vector quantization (LVQ) is widely used in the codec because of its low computational complexity. PVQ is one method of LVQ that is suitable for Laplace distribution. Thus, this section presents a design of quantization for reduced-dimensional matrix and rotation matrix combined with PVQ.

#### *3.1. Quantization Design of the Reduced-Dimensional Matrix*

In the reduced-dimensional matrix, the first column is the first principal component, the second column is the second principal component, etc. According to the property of PCA, the first principal component has the most important information of the original signal, and information carried by other principal components becomes less and less important. In fact, more than 95% of the original signal energy, which can be also called information, is restored only by the first principal component. That means if the quantization error of the first principal component is large, compared with the original signal, the restored signal also has a large error. Therefore, the first principal component needs to be allocated more bits, and the bits for other principal components should be sequentially reduced. For some kinds of audio, 4 principal components are enough to obtain acceptable quality, while for other kinds of audio 5 principal components may be needed. We choose 6 principal components, because they can satisfy almost all kinds of audio. In fact, the fifth and sixth principal components

play a small role in the restored spectral; therefore, little quantization accuracy is needed for the last two principal components.

Based on the above conclusion, the reduced dimensional matrix can be divided into certain regions, as is shown in Figure 9. Different regions have different bit allocations: the darker color means more bits needed.

**Figure 9.** Bits allocation for reduced-dimensional matrix (darker color means more bits needed).

A PVQ quantizer was used to quantify the distribution of different bits in each principal component of the reduced-dimensional matrix. Several subjective listening tests have been carried out, and the bits assignments policy is determined according to the quality of the restored audio under different bit assignments. Finally, the bits that need to be allocated for each principal component are determined. Table 2 gives the number of bits required for each principal component of non-zero reduced-dimensional matrix under the PVQ quantizer.


**Table 2.** Quantization bit for reduced-dimensional matrix.

#### *3.2. Quantization Design of the Rotation Matrix*

According to *Y* = *XP* in encoder and *Xrestore* = *YP<sup>T</sup>* in decoder, some properties of the rotation matrix can be found:


multiplies with the second column (second principal component) of the reduced-dimensional matrix, and so on. According to the above properties of the rotation matrix, the quantization distribution of the rotation matrix has been made clearer, that is, the larger the row number is, and the larger the column number is, the fewer allocation bits there are.

In addition to the above two properties of the rotation matrix, there is another important property. Generally, the data in the first four rows around the diagonal are bigger than others. The thinking of this characteristic in this paper is as follows: common audio focuses more energy on low-band in frequency domain, and the method of matrix construction described in Section 2.4 can keep the coefficients of low-band stay in low-column. Thus, the first diagonal value that is calculated from the first column must be the largest one of overall values in rotation matrix or autocorrelation matrix. The second diagonal value could quite possibly be the second-largest value, and so on. That means these data are more important for decoder, so the quantization accuracy of these regions with larger absolute values can determine the error between the restored signal and the original signal. Therefore, the data around the diagonal need to be allocated with more bits. Figure 10 shows the "average value" rotation matrix of a piece of audio as an example to show this property more clearly.

**Figure 10.** An example rotation matrix ("value" means the average value of cells in rotation matrices, "column" means the column of rotation matrix, and "row" means the row of rotation matrix).

The rotation matrix also has the following quantization criterion:


According to the above quantization criteria, the rotation matrix that is divided into the following regions according to bit allocation is shown in Figure 11. The darker the color is, the more bits should be allocated.

**Figure 11.** Bit allocation for rotation matrix (darker color means more bits needed; white color means no bits).

The same test method as the one for reduced-dimensional matrix was used to determine the number of bits needed in each region in rotation matrix.

In Table 3, the first region corresponds to the region with the darkest color in Figure 11; the second corresponds to the area with the second-darkest color, and so on. The white color means there are no bits allocated to that area.


**Table 3.** Quantization bits for rotation matrix.

#### *3.3. Design of the Low-Pass Filter*

The noise generated from quantization and matrix calculation is white noise. There are two ways to reduce it. The first way is introducing noise shaping to make noise more comfortable for human hearing, and the second way is introducing a filter in decoder.

For most signals, the energy concentrates on low frequency domain, therefore the noise in low frequency domain does not sound obvious because of simultaneous masking. While in the high frequency part, if the original signal does not have high frequency components, the noise signal will not be masked and can be heard. So, a low-pass filter can be set to mask the high frequency noise signal, without affecting the original signal. The key point of the filter design is to determine the cut-off frequency.

Given the original matrix *<sup>X</sup>*[ 16 15 ] <sup>=</sup> ⎡ ⎢ ⎣ *a*<sup>1</sup> ... *a*<sup>225</sup> . . . ... . . . *a*<sup>16</sup> ... *a*<sup>240</sup> ⎤ ⎥ <sup>⎦</sup>, there are 15 subbands in

*X*, in which the first subband is the first row, the second subband is the second row, and so on. When C = <sup>1</sup> *<sup>m</sup> <sup>X</sup>TX* is calculated in PCA, the first value *<sup>e</sup>*<sup>1</sup> on the diagonal line *<sup>e</sup>*1,*e*<sup>2</sup> ...*e*<sup>15</sup> is calculated by

$$\begin{aligned} \varepsilon\_1 &= \left( (a\_1 - \overline{a}) \ast (a\_1 - \overline{a}) + (a\_2 - \overline{a}) \ast (a\_2 - \overline{a}) + \dots \ast (a\_{16} - \overline{a}) \ast (a\_{16} - \overline{a}) \right) / 16 \\ &= (a\_1^2 + a\_2^2 + \dots a\_{16}^2 + 16\overline{a}^2 - 2\overline{a}(a\_1 + a\_2 + \dots a\_{16})) / 16 \\ &= (a\_1^2 + a\_2^2 + \dots a\_{16}^2 - 16\overline{a}^2) / 16 \end{aligned} \tag{4}$$

in which *a*<sup>1</sup> <sup>2</sup> + *a*<sup>2</sup> <sup>2</sup> + ... *a*16<sup>2</sup> is equal to the energy of the first subband *E*1, and *a* is the average value of the first subband. Therefore, the relationship between *E*<sup>1</sup> and *e*<sup>1</sup> is

$$E\_1 = 16(\varepsilon\_1 + \overline{a}^2) \tag{5}$$

Actually, the value of *a*<sup>2</sup> is far less than *e*1, so *E*<sup>1</sup> is equal to 16*e*1, and the relationships between *E*<sup>2</sup> ... *E*<sup>15</sup> and *e*<sup>2</sup> ...*e*<sup>15</sup> can be gotten by analogy. Therefore, through PCA, the energy of each subband is calculated, and the filter can be determined by the energy of each band. Considering the proportion of energy accumulation, *Ak* is

$$A\_k = \frac{\sum\_{i=1}^k c\_i}{\sum\_{i=1}^{15} c\_i} \tag{6}$$

According to some experiments, when *Ak* = 99.6%, k is the proper cut-off band. When the signal passes through the filter, the noise signal will be filtered out, and the signal itself will not be too much damaged.

Considering the frequency characteristics of the audio signal, the stop band setting is not low, and the signal with more than 20,000 Hz is often ignored by default, so each band of the above 15 bands will not be transmitted. Taken together, *e*1,*e*2,*e*3,*e*12,*e*13,*e*14,*e*<sup>15</sup> will not be transmitted, and the index of the left 8 bands are quantized by 3 bits, so the bitrate for cut-off band is 75 bps.

#### **4. PCA-Based Parametric Stereo**

The stereo coding method proposed in this paper, as the extension of mono coding method mentioned before, is shown in Figures 12 and 13. The encoder and decoder for stereo audio use the same module of PCA and quantization as mono audio. The differences between mono coding and stereo coding are elaborated in the following sections. In encoder, the two channels' signal carries out MDCT and the two channels' coefficients gather to generate an original matrix to do PCA; then, an improved parametric stereo module is used to downmix and calculate parameters of the high-band. Finally, a module based on PVQ is used for quantizing coefficients of matrix, and so on. In decoder, coefficients of mid downmix matrix and rotation matrix are used to generate mid channel; then, spatial parameters and other information are introduced to restore stereo signals. After inverse MDCT (iMDCT) and filtering, the signal can be regarded as the output signal.

**Figure 12.** Flowchart of stereo encoder. (MDCT, Modified Discrete Cosine Transform; PCA, Principle Component Analysis; IC, Interaural Coherence; ILD, Interaural Level Difference).

**Figure 13.** Flowchart of stereo decoder. (MDCT, Modified Discrete Cosine Transform; iPCA, inverse Principle Component Analysis; IC, Interaural Coherence; ILD, Interaural Level Difference).

#### *4.1. Procession of Stereo Signal*

Since the signals in two channels of the stereo tend to have high correlation. The signal of the left and right channels can be constructed into one original matrix. Firstly, the coefficients from left channel and right channel construct original matrices *Xl*[*m n*] and *Xl*[*m n*] respectively. Then, matrices *Xl*[*m n*]. and *Xr*[*m n*] are used to form a new matrix *X*, in which = \* *Xl*[*m n*] *Xr*[*m n*] + . Matrix *X* is used to obtain one rotation matrix *P*[*n k*] by PCA, and *P*[*n k*] can handle both left and right channel signals. That is,

$$Y\_{l[m\ k]} = X\_{l[m\ n]} \times P\_{[n\ k]}\tag{7}$$

$$Y\_{r[m\ k]} = X\_{r[m\ n]} \times P\_{[n\ k]} \tag{8}$$

If the first six principal components are preserved, most mono audio signals can be well restored. At this time, we keep the first six bases in principal component matrix and obtain rotation matrix *<sup>P</sup>*[ 15 6 ] . The reduced-dimensional matrices of each sub-frame are *Y*1[156],*Y*2[156], ... ,*Y*8[156]. Experiments were done to verify the design for stereo signals: 10 normal audio files and 5 artificial synthesized audio files (the left channel and right channel have less correlation) were chosen as the test materials. Results of the subjective listening experiments are shown in Figures 14 and 15. We can consider that for most stereo signals, in which two channels have high relevance with each other, the proposed method for stereo signals perform as well as for mono signals.

**Figure 14.** Subjective MOS of high-relation stereo signal. (MOS, Mean Opinion Score).

**Figure 15.** Subjective MOS of low-relation stereo signal.

#### *4.2. Parameters in Parametric Stereo*

In parametric stereo, Interaural Level Difference (ILD), Interaural Time Difference (ITD), and Interaural Coherence (IC) are used to describe the difference between two channels' signals. In MDCT domain, the above parameters in subband *b* are calculated by:

$$\text{IILD}[\mathbf{b}] = 10 \log\_{10} \frac{\sum\_{k=A\_{b-1}}^{A\_b-1} X\_l(k)X\_l(k)}{\sum\_{k=A\_{b-1}}^{A\_b-1} X\_r(k)X\_r(k)} \tag{9}$$

$$\text{IC}[\mathbf{b}] = \mathbb{R}(X\_{bl}(k), X\_{br}(k)) = \frac{\langle X\_{bl}(k), X\_{br}(k) \rangle}{|X\_{bl}(k)||X\_{br}(k)|} \tag{10}$$

While in MDCT domain, calculating ITD must introduce Modified Discrete Sine Transform (MDST) to calculate Interaural Phase Difference (IPD) instead of ITD, in which MDST is:

$$Y(k) = \sum\_{n=0}^{N-1} x(n)w(n)\sin\left[\frac{2\pi}{N}\left(n + \frac{1}{2} + \frac{N}{4}\right)\left(k + \frac{1}{2}\right)\right], k = 0, 1, 2\ldots, \frac{N}{2} - 1\tag{11}$$

in which *Y*(*k*) is the spectrum coefficients, *x*(*n*) is the input signal in time domain, and *w*(*n*) is the window function. Then, a new transform MDFT is introduced, *Z*(*k*) = *X*(*k*) + *jY*(*k*), in which *X*(*k*) is the MDCT spectral coefficients, *Y*(*k*) is the MDST spectral coefficients, and IPD can be calculated by

$$\text{IPD}[\mathbf{b}] = \angle \left( \sum\_{k=A\_{b-1}}^{A\_b-1} Z\_l(k) Z\_r \, ^\*(k) \right) \tag{12}$$

Traditional decoder uses these parameters and a downmix signal to restore left channel's signal and right channel's signal. Compared with formula (4, 9, 10), when the method described in Section 4.1 is used to deal with stereo signals, *Ab*−<sup>1</sup> ∑ *<sup>k</sup>*=*Ab*−<sup>1</sup> *Xl*(*k*)*Xl*(*k*) and *Ab*−<sup>1</sup> ∑ *<sup>k</sup>*=*Ab*−<sup>1</sup> *Xr*(*k*)*Xr*(*k*) can be calculated in the processing of PCA; therefore, parametric stereo and PCA have high associativity. After PCA, we can get ILD and IC only by calculating *Xbl*(*k*), *Xbr*(*k*). In addition, we also need to calculate IPD by Formula (12); however, introducing MDST will bring computational complexity, and ITD or IPD mainly works on signals below 1.6 kHz that play smaller roles in high frequency domain. Thus, some improvements can be made to the parametric stereo according to the nature of the PCA.

#### *4.3. PCA-Based Parametric Stereo*

Given that the original matrix is *X* = ⎡ ⎢ ⎣ *a*<sup>1</sup> ... *a*<sup>225</sup> . . . ... . . . *a*<sup>16</sup> ... *a*<sup>240</sup> ⎤ ⎥ <sup>⎦</sup>, and the rotation matrix is *<sup>P</sup>* <sup>=</sup> ⎡ ⎢ ⎣ *p*<sup>1</sup> ... *p*<sup>76</sup> . . . ... . . . *p*<sup>15</sup> ... *p*<sup>90</sup> ⎤ ⎥ <sup>⎦</sup>, the reduced-dimensional matrix is *<sup>Y</sup>* <sup>=</sup> *XP* <sup>=</sup> ⎡ ⎢ ⎣ *b*<sup>1</sup> ... *b*<sup>49</sup> . . . ... . . . *b*<sup>16</sup> ... *b*<sup>64</sup> ⎤ ⎥ ⎦ . For the

coefficients in the reduced-dimensional matrix *Y*,

$$b\_1 = a\_1 p\_1 + a\_{17} p\_2 + \dots \ a\_{225} p\_{15} \tag{13}$$

$$b\_2 = a\_2 p\_1 + a\_{18} p\_2 + \dots \ a\_{226} p\_{15} \tag{14}$$

$$b\_{16} = a\_{16}p\_1 + a\_{33}p\_2 + \dots \cdot a\_{240}p\_{15} \tag{15}$$

The first column is only related to the first column of *P* (the first base). As Figure 9 shows, main energy of the first base in the rotation matrix is entirely concentrated on the data in the first column of the first row. Therefore, the matrix *Y* can be approximated as

$$b\_1 = a\_1 p\_1 \tag{16}$$

$$b\_2 = a\_2 p\_1 \tag{17}$$

$$b\_{16} = a\_{16} p\_1 \tag{18}$$

While *p*<sup>1</sup> in the matrix *P* is approximately equal to 1. Therefore the first column in the matrix *Y* is equal to the first column originally in matrix *X*. When the sampling rate is 48 kHz, the first column in *X* indicates the coefficients from 0 to 1.6 kHz, which means that when calculating the restored signal, the points below 1.6 kHz in the frequency domain happen to be the first principal component. So, the first principal component can be used to restore signals below 1.6 kHz in frequency domain instead of introducing MDST and estimating binaural cues. In decoder, the spectrum of the left and right channels above 1.6 kHz can be restored according to the downmix reduced-dimensional matrix, rotation matrix, and spatial parameters. The spectrum of the left and right channels below 1.6 kHz can be restored according to the first principal component and the downmix reduced-dimensional matrix.

#### *4.4. Subbands and Bitrate*

The spectrums of signal are divided into several segments based on Equivalent Rectangular Bands (ERB) model. The subbands are shown in Table 4.


**Table 4.** Subband division.

The quantization of space parameters uses ordinary vector quantization. The codebook with different parameters is designed based on the sensitivity of the human ear and the range of the parameter fluctuation of the experimental corpus. The codebooks of ILD and IC are shown in Tables 5 and 6, respectively.

**Table 5.** Codebook for ILD. (ILD, Interaural Level Difference).


**Table 6.** Codebook for IC. (IC, Interaural Coherence).


According to the above codebooks, the ILD parameters of each subband are quantized using 4 bits, and the IC parameters of each subband are quantized using 3 bits. According to the above sub-band division, the number of sub-bands higher than 1.6 kHz accounts for half of the total number of sub-bands in the whole frequency domain, which is 13, so the number of bits needed for each frame's spatial parameter is 13 × 7 = 91. For frequencies above 1.6 kHz, the rate of quantitative parameters is about 4.5 kbps. In the frequency domain less than 1.6 kHz, the first principal component is used to describe the signal directly. The rate of transmission of the first principal component is around 10 kbps, so the parameter rate of PCA-based parametric stereo is around 15 kbps. In traditional parametric stereo [24], IPD of each subband is quantized by 3 bits, so the parameter rate of the traditional parameter stereo is about (4 + 3 + 3 + 3) × 25 × 50 = 16.25 kbps. Therefore, compared with traditional parametric stereo, the rate of PCA-based parametric stereo is slightly reduced.

Figure 16 shows the results of a 0–1 test for spatial sense. In this test, 12 stereo music from EBU test materials is chosen. Score 0 means the sound localization is stable, and score 1 means there are some unstable sound localization in test materials. The ratio in Figure 16 is calculated from the times of unstable localization, and lower ratio means better performance in the quality of spatial sense. Experiments show that compared with the traditional parametric stereo encoding method, the spatial sense of the audio source has been obviously improved through the PCA-based parametric stereo. Through the use of PCA, almost half of the amount of parameter estimation can be reduced, while the computational complexity still rises because of the increasing complexity of PCA.

**Figure 16.** Test results for spatial sense.

#### **5. Test and Results**

The method proposed in this paper performs significantly better with stereo signals compared to mono signals. Thus, this section only presents the results for stereo signals. In order to verify the encoding and decoding performance of the PCA-based stereo coding method, some optimized modules such as DTX, noise shaping, and other efficient coding tools in the codec were not used in testing

#### *5.1. Design of Test Based on MUSHRA*

The key points of the MUSHRA [25] test are as follows:

#### 5.1.1. Test Material

(i) Several typical EBU test sequences were selected: piano, trombone, percussion, vocals, song of rock, multi sound source background and mixed voice, and so on.

(ii) Contrast test objects: PCA-based codec signal that transmits two channels separately, PCA-based codec signal with traditional parametric stereo, PCA-based codec signal with improved parametric stereo, G719 codec signal with traditional parametric stereo [24], HE-AACv2 codec signal, anchor signal, and original signal. In the algorithm proposed in this paper, the relationship between the quality of the restored signal and bitrate is not linear, as Figure 17 shows, which uses a simple subjective test with different bitrate allocation; therefore, the test chooses a case in which the qualities of restored signal and bitrate are both acceptable.

**Figure 17.** Relationship between quality and bitrate.

The bits allocations of each module in PCA–based codec for stereo signal are shown in Table 7.


**Table 7.** Bitrate allocation in encoder.

(iii) In order to eliminate psychological effects, the order and the name of each test material in each group are random. The listener needs to select the original signal from the test signals and score 100 points, and the rest of the signals are scored by 0–100 according to overall quality, including sound quality and the spatial reduction degree.

#### 5.1.2. Listeners

10 people with certain listening experiences were selected for the listening test, of which 5 were male, 5 were female, and each listener has normal hearing.

#### 5.1.3. Auditory Environment

All 10 listeners use headphones connected to a laptop in quiet environments.

#### *5.2. Test Results*

After the test is finished, we calculated average value and the 95% confidence interval based on the listeners' scores. The average confidence interval of each test codec is [77.2, 87.0], [74.4, 84.2], [70.5, 80.7], [65.8, 76.6], [56.1, 66.9], and [78.6, 86.2]. After removing three outlier data (data beyond confidence interval), the test results of MUSHRA are shown in Figures 18 and 19.

**Figure 18.** Results of MUSHRA test. (PCA\_2 represents the PCA-based codec signal that is transmitted over two channels separately (75 kbps), PCA\_PS+ represents PCA-based codec signal with improved parametric stereo (55 kbps), PCA\_PS represents PCA-based codec signal with traditional parametric stereo (56 kbps), G.719 represents G.719 codec signal with traditional parametric stereo (56 kbps), anchor represents anchor signal, HE\_AACv2 represents HE-AACv2 signal (55 kbps), and reference represents hidden reference signal).

**Figure 19.** MUSHRA score of per item test. (PCA\_2 represents the PCA-based codec signal that is transmitted over two channels separately (75 kbps), PCA\_PS+ represents PCA-based codec signal with improved parametric stereo (55 kbps), PCA\_PS represents PCA-based codec signal with traditional parametric stereo (56 kbps), G.719 represents G.719 codec signal with traditional parametric stereo (56 kbps), anchor represents anchor signal; HE-AACv2 represents HE-AACv2 signal (55 kbps), and hidden reference material has been removed. 1–6 represents different test materials).

Compared with traditional parametric stereo, the PCA-based parametric stereo has less bitrate, higher quality, and better spatial sense. Compared with G719 with traditional parametric stereo with the same bitrate, PCA-based codec signal has better quality. Compared with HE-AACv2 signal, the average score of the PCA-based parametric stereo is slightly less than HE-AACv2. HE-AACv2 is a mature codec that uses several techniques to improve the quality, including Quadrature Mirror Filter (QMF), Spectral Band Replication (SBR), noise shaping and so on. The complexity of PCA is less than the part of the 32-band QMF in HE-AACv2. Considering the high complexity and maturity of HE-AACv2, the test results are optimistic. Conclusions can be drawn that the PCA-based codec method possesses good performance, especially for stereo signal in which the audio quality and spatial sense can be recovered well.

#### *5.3. Complexity Analysis*

The module of principal component analysis can be regarded as a part of the singular value decomposition (SVD): the calculate procession of the right singular matrix and the singular value of original matrix *<sup>X</sup>*[ *m n* ] , therefore the algorithm complexity of principal component analysis module is O(nˆ3). According to the properties of SVD, when *n* < *m*, the computation complexity of the right singular matrix is half of the computation complexity of SVD for *<sup>X</sup>*[ *m n* ] . Therefore, the algorithm complexity and delay of PCA are far less than those of SVD. In the Intel i5-5200U processor, 4 GB memory, 2.2 GHz work memory, it takes 20 ms to finish one part of PCA. Given the time reduction of parametric stereo, the delay of PCA-based codec algorithm is in the acceptable range. In the part of multi-frame joint PCA, the forming of the original matrix takes 40 ms. When the first frame finishes MDCT, the process of forming original matrix will begin. Besides, the thread of PCA is different from matrix construction, and MDCT windowing also belongs to the calculating thread. Suppose the time for MDCT of first frame is *t*1; the whole delay can be regarded as around 40 + *t*<sup>1</sup> ms, which is around 50 ms. The delay of the algorithm proposed in this paper still has space to be improved, and we can make the balance of delay and bitrate better by adjusting the number of multi frames using a more intelligent strategy in the future.

#### **6. Discussion**

This paper just presents a preliminary algorithm. There is still much space for improvement in real applications. One question worth further study is how to eliminate the noise. In the experiment, when the number of bits or the number of principal components is too small, the noise spectrum has special nature, as Figures 20–22 show. Signal in Figure 20 is restored by three components; compared with signal in Figures 21 and 22, the spectrum of noise in high-frequency domain has obvious repeatability, which occurs once every 1.6 kHz. Therefore, low pass filter mentioned in Section 3.3 is not the best way to get rid of this noise: the damage of original signal is unavoidable. Ideally, an adaptive notch filter can filter the spectrum of noise clearly and not damage original signal. However, the design of such an adaptive notch filter needs to be studied more in the future.

**Figure 20.** The spectrogram of the signal restored by three components.

**Figure 21.** The spectrogram of the signal restored by four components.

**Figure 22.** The spectrogram of the signal restored by five components.

#### **7. Conclusions**

The framework of proposed multi-frame PCA-based audio coding method has several differences compared to other codecs; therefore, there are lots of barriers to the design of an optimal algorithm. This paper proposed several ways to remove those barriers. For mono signal, the design of PCA-based coding method in this paper, including multi-frame signal processing, matrix design, and quantization design can hold it efficiently. As to stereo signal, PCA has high associativity with parametric stereo, which makes PCA-based parametric stereo certainly feasible and significant. Experimental results show satisfactory performance of the multi-frame PCA-based stereo audio coding method compared with the traditional audio codec.

In summary, research on the multi-frame PCA-based codec, both for mono and stereo, has certain significance and needs further improvement. This kind of stereo audio coding method has good performance in processing different kinds of audio signals, but further studies are still needed before it can be widely applied.

**Author Contributions:** J.W. conceived the method and modified the paper, X.Z. performed the experiments and wrote the paper, X.X. and J.K. contributed suggestions, and J.W. supervised all aspects of the research.

**Funding:** National Natural Science Foundation of China (No. 61571044).

**Acknowledgments:** The authors would like to thank the reviewers for their helpful suggestions. The work in this paper is supported by the cooperation between BIT and Ericsson.

**Conflicts of Interest:** The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*
