1. Introduction
Environmental noise is a growing health hazard worldwide. Around 100 million people are exposed to road traffic noise above 55 dB L
den (day-evening-night equivalent level) in the European Union [
1]. In China, approximately 26% of monitoring points exceed the noise limits of corresponding environmental noise function zones at night [
2]. Environmental noise can cause a series of health problems, such as sleep disturbance [
3,
4], learning impairment [
5,
6,
7], hypertension ischemic heart disease [
8,
9,
10], etc.
Annoyance is a widely used indicator to study the effect induced by different noise sources on well-being [
11]. Harris’ research showed that the annoyance caused by road traffic noise influenced health-related quality of life [
12]. Licitra surveyed the dose-effect relationship between the percentage of high annoying (%HA) and L
den of railway noise in Pisa, Italy [
13]. Recently, researchers have paid more attention to the combined effect of different noise sources on annoyance [
14,
15,
16].
Several alternative ways are available to evaluate noise annoyance. Zwicker [
17] put forward the psychoacoustic annoyance (PA) model in 1999. Then Di [
18] improved the PA model further, considering the tonality of noise. Using this model, relative annoyance degrees of different noises could be calculated directly by acoustical parameters through Equations (1)–(4):
where PA is psychoacoustic annoyance; N
5 is the percentile loudness in sone; w
S describes the effect of sharpness S (acum), w
FR describes the influence of fluctuation strength F (vacil) and roughness R (asper), and w
T describes the effect of tonality T (tu).
Actually, environmental noise annoyance is influenced by both acoustical and non-acoustical factors [
19]. Acoustical factors, such as environmental noise levels, etc., contribute only a part to the variance of environmental noise annoyance. PA is an objective quantity calculated by acoustical parameters, which ignores the influence of non-acoustical factors. Moreover, the value of annoyance calculated by PA model has no upper bound and can increase endlessly with the increase of acoustical parameters such as loudness, etc. Hence, field surveys and listening experiments are used more often by researchers to obtain noise annoyance. The annoyance ratings obtained in field surveys (a long-term response to environmental noise in context conditions) may be more valid than the ones in laboratory (a short-term response to recorded noise in a laboratory condition), considering the exposure time and context. However, field surveys are usually disturbed by background noise in researching the effect induced by certain noise source [
20]. Hence listening experiments are usually used in the research where field surveys cannot be carried out or the research focusing on the effect of single noise source. In a listening experiment, a stimulus including several noise samples will be recorded in advance. Then the stimulus (experimental sample set) will be played to participants who will be asked to give the annoyance rating after listening to each noise sample. The average value of all ratings from different participants for each noise sample (i.e., mean annoyance, MA) will be calculated after listening experiments.
MA is widely used in research on environmental noise [
21,
22,
23]. However, the comparability of MA values between different studies is poor. For instance, for two similar transformer noises at about 55 dB(A), participants tended to scale a higher rating (MA > 8) in the experiment conducted on the noise sample set ranging from 30 dB(A) to 57 dB(A) [
24], while a much lower rating was obtained (MA < 4) in another experiment conducted on the noise sample set ranging from 50 dB(A) to 75 dB(A) [
23]. This indicated that MA values obtained in listening experiments could only evaluate the relative annoyance degrees among noise samples in the same experimental sample set. To compare relative annoyance degrees of any other noise samples, even for those that have already been evaluated in different experimental sample sets, an additional listening experiment should be conducted. This poor comparability makes it difficult for researchers to use the experimental data in published studies to carry out further relevant research.
The poor comparability may be related to the lack of reference sound samples in different experimental sample sets. As there were no reference sound samples, participants evaluated the annoyance rating of each noise sample only according to its relative annoyance degree among all samples in each experimental sample set. To determine annoyance ratings of noise samples in each experimental sample set better, the relative magnitude estimation method, which provided a reference sound sample with known annoyance rating as an anchor for participants, was developed and used [
25,
26,
27]. If the reference sound sample is identical, the comparability of annoyance ratings of noise samples from different experimental sample sets could be good. However, it is almost impossible to find a reference sound sample which is suitable for all listening experiments.
Nilsson has ever focused on improving the comparability of annoyance ratings from different studies [
28]. He put forward the concept of the pink noise equivalent sound level (PNE
annoy) which used the sound level of an equally annoying pink noise to represent the annoyance rating of one noise sample. The annoyance ratings of noise samples were all indicated by PNE
annoy so that those from different experimental sample sets could be compared directly. However, the annoyance magnitude of noise samples could not be showed directly when PNE
annoy was used as the indicator of annoyance rating. It would be better to transform the PNE
annoy into the traditional MA value further as the annoyance ratings of noise samples.
This study proposed an improved method which can amend the comparability of annoyance ratings of noise samples from different studies (experimental sample sets). Furthermore, as a case study, several different types of noise sample sets were selected to conducted listening experiments using this method to examine the applicability of it.
3. Case Study
3.1. Stimuli
The loudness range of noises used in listening experiments may vary. To assess whether our calibration method was effective in such research, six sets of noise samples (sample sets 1–6) with different loudness ranges were selected from a large database of recordings made with the Artificial Head Measurement System HMS IV.0 (HEAD acoustics GmbH, Herzogenrath, Germany). Each sample set had 12 five-second samples of noise. Half the sets (sample sets 1–3) were homogenous (transformer noise) and the others (sample sets 4–6) were heterogeneous (each set was composed of several kinds of noises).
The difference between annoyance ratings of an identical sample in different experimental sample sets is a good indicator to judge the comparability of experimental results; the smaller the difference, the better the comparability. Hence, several identical samples (samples A–E) were put into different noise sample sets (the identical samples were included in the 12 noise samples of each sample set).
Table 1 shows the sources, loudness levels and energy distribution in different frequency ranges of the six identical noise samples. The energy distribution was calculated by Equation (8) in low-frequency range (20–200 Hz), middle-frequency range (200 Hz–2 kHz) and high-frequency range (2–20 kHz) [
29]
where η
k is the sound energy proportion of low-, mid- or high-frequency range in the total sound energy; E
k, p
k, and L
k are the sound energy, sound pressure and sound pressure level of the corresponding frequency range, respectively; and E, p and L are total sound energy, total sound pressure, and total sound pressure level of noise sample, respectively.
As presented in
Table 1, transformer noise and boiler noise are low-frequency noises, heat pump noise is mid-frequency noise, and the noise recorded in a workshop is high-frequency noise due to their dominant sound energy at the corresponding frequency ranges [
30].
Additionally, seven pink noise samples were added into each sample set (sample sets 1–6) as reference sound samples. In each sample set, the range of loudness level of the added pink noise samples was a little wider than that of the 12 noise samples. The interval of loudness levels of two adjacent reference sound samples was equal. Considering that auditory discriminating thresholds of intensity were about 0.4 dB [
31], the minimal interval of two adjacent reference sound samples was set to 0.5 phon. Thus, when the loudness levels of noise samples were identical, or the range of these loudness levels was smaller, the calibration method could also work well.
Table 2 gives a detailed description of sample sets 1–6.
Another sample set (sample set 7) was composed of nine pink noise samples whose LN ranged from 55 phon to 95 phon (A-weighted equivalent sound pressure level ranging from 38 dB(A) to 78 dB(A)). The interval of loudness levels between two adjacent pink noise samples was 5 phon. This sample set was used to establish the standard curve in this study. The pink noise samples used above were all generated automatically by ArtemiS 10.00 analysis software (HEAD acoustics GmbH, Herzogenrath, Germany).
In each sample set, all the noise samples were arranged randomly, and an interval of five s was inserted into every two noise samples, forming an evaluation sequence of noise samples. Three evaluation sequences with different orders were grouped together to be an experiment stimulus. Thus, seven sets of experiment stimuli were finally formed.
3.2. Apparatus and Setting
The binaural audio playback system consists of a digital equalizer (Head Acoustics PEQ V, HEAD acoustics GmbH, Herzogenrath, Germany), a distribution amplifier (Head Acoustics HDA IV. 1, HEAD acoustics GmbH, Herzogenrath, Germany) and four headphones (Sennheiser HD 600, Sennheiser electronic GmbH & Co. KG, Wedmark, Germany), which had already been calibrated at the calibration laboratory of Head Acoustics GmbH. All experiments were conducted in a soundproof room (3 m × 2 m × 3 m), where background noise was lower than 25 dB(A).
3.3. Procedure of Listening Experiments
The listening experiments were conducted separately for seven sample sets. In each experiment, 60 college students (22 males, 38 females, mean age of 24 years) with normal hearing condition were recruited randomly as participants. Due to the number of headphones in the binaural audio playback system, at most four participants could receive noise exposure at the same time. Before the experiment, participants were required to sit calmly on a chair, put on the headphones, and be ready for noise exposure. Then, the corresponding experiment stimulus was played back after previewing several pink noise samples. An 11-point numerical scale with continuous labels equally spaced from 0 (“not annoying at all”) to 10 (“extremely annoying”) was used for the annoyance evaluation of each noise sample in the interval. Since all participants in this study were Chinese college students with good English competence, the evaluation sheet was printed in both English and Chinese.
3.4. Statistical Analysis
Misjudgment is inevitable when the participants make a decision on the evaluation scores. Thus, it is necessary to examine the validity of the data. In this study, each participant was supposed to give three evaluation scores for each noise sample. The examination rule was that if the difference between any two of these three evaluation scores was within two, this result was accepted; otherwise, the three evaluation scores would be deleted.
According to the valid evaluation scores, mean annoyance (MA), as an indicator of annoyance response, was calculated by Equation (9) for each sound sample:
where j is a certain annoyance rating (0–10) in the numerical scale; and n
j is the total times of choosing j-th annoyance rating.
Psychoacoustic annoyance (PA) can well estimate the relative annoyance ratings of noise samples [
17,
18], so the relative magnitude of PA between noise samples is well consistent with the relative magnitude of MA obtained from listening experiments (i.e., the consistency of PA and MA is good). For this reason, PA of all noise samples was calculated by Equations (1)–(4) in this study. Then linear fit was performed between PA and MA for both individual sets (sample sets 1–6) and mixed sets (a set including the experimental results of several individual sets, e.g., a set including the experimental results of sample sets 1–3 was called mixed sets 1–3 for simplicity) before and after calibration. The determination coefficient (
R2) was considered as a judgment index for the comparability of MA of noise samples. After calibration, if
R2 increases, it means that the comparability of MA is improved. Contrarily, the comparability is reduced.