1. Introduction
Speech audiometry is one of the basic examination methods used for hearing diagnosis. From the point of view of the test routine [
1], sound stimuli play a key role, which can have different forms (nonsense syllables [
2,
3], numbers or digits [
4,
5], words [
6,
7], or sentences [
8,
9]), as well as presentation level, type of interfering noise, range of the signal-to-noise ratio, and other details of the test procedure [
1]. Adaptive matrix tests have their justification not only in audiometry, which primarily deals with measuring the hearing ability of individuals, but the methodology of their development and evaluation of results find application in various scientific fields (perceptual phonetics, comprehension testing, communication systems, and psychoacoustics) where there is a requirement for accurate and repeated measurements. Tests created on the basis of an adaptive matrix contain a phonetically balanced, simple, and frequently used vocabulary and are further characterized by the same level of difficulty, so it does not matter which test list is used because the difficulty of the test is always the same. Of course, the test can be adaptively changed according to the specific needs of a particular patient. These key features also make such tests an important supportive diagnostic tool in the long-term therapy/monitoring of patients with hearing impairment. Adaptive test matrices have their origins in the work of Hagerman [
10].
Matrix tests are characterized by relevant vocabulary, fast and reliable measurement, and the possibility of practically unlimited testing using a set of 50 words. The predictability of the test content is minimal compared to a test containing “everyday life” sentences. This type of test is suitable for use with any degree of hearing loss and, unlike tone audiometry, reflects the real ability of speech perception. Noise plays a significant role in testing as normal communication rarely takes place in ideal acoustic conditions (without distracting background noises). The presence of noise considerably complicates speech understanding, especially in the case of hearing-impaired people, so diagnosis and subsequent rehabilitation should include hearing measurement in the presence of noise. Matrix tests were first compiled by Hagerman for Swedish language [
10]. Later, they were created for other languages, e.g., German [
11,
12], Danish [
13], Mandarin Chinese [
14], English [
15], Polish [
16], Spanish [
17], French [
18], Dutch [
19], Finnish [
20], Italian [
21,
22], Russian [
23], Turkish [
24], etc. The international matrix tests are now created in 19 different languages (see [
25] for a complete list), covering over 60% of the world’s population.
The stable syntactic structure of five-word sentences and the method of creating and evaluating tests help to compare the achieved results with results in other languages. Such tests can be potentially beneficial in the development and testing of compensatory aids due to a uniform test type available in multiple language mutations.
The process of introducing new tests into practice is usually preceded by testing on healthy listeners in order to obtain reference data (about how an average healthy person hears), and then the measurement is carried out on hearing-impaired people.
Our motivation behind this work is to evaluate the tests based on the Slovak adaptive matrix and find out if there are tests with suitable statistical properties, thanks to which they could find practical application in hearing examination for Slovak-speaking patients.
2. Slovak Adaptive Matrix Test
The main difference between audiometric testing using a large database of test sentences (e.g., in the range of 600 sentences) and using a matrix composed of 50 words is that, in the first method of testing, a high recognition accuracy is achieved and the wide-spectrum properties of the language are consistently respected. However, these positive factors can be negated by the limited vocabulary of the patient and his/her intellect. In the second method—in matrix tests—the measurement is more focused on speech recognition and the special characteristics of the test subject are suppressed [
17]. Matrix tests are particularly suitable for cross-language comparisons in audiometric and clinical research [
26].
The matrix presented in
Table 1 contains 10 proper names, 10 verbs in the present tense, 10 numerals, 10 adjectives, and 10 objects. By choosing one word from each named category, a total of 100,000 unique sentences can be constructed. The category of proper names consists of five female and five male names that are very frequent in Slovak according to the Slovak National Corpus [
27]. In the study [
28] dedicated to the proposal of an adaptive matrix in the Slovak language, the verbs were “waits”, “holds”, and “takes”; in this new version of the matrix, more suitable alternatives have been found in the form of the words “buys”, “finds”, and “doesn’t have”. In the category of numerals, there was no change compared to the original version of the matrix. In the category of adjectives, the original word "bad" was replaced by the word “cheap” as we wanted to eliminate the expressive undertone of this word. In the last category of words, there were several replacements. The words originally used, “bridges”, “rooms”, “lamps”, and “buildings”, were replaced by new ones: “apartments”, “bowls”, “newspapers”, and “gifts”. The new objects correspond better with the selected verbs and thus make it possible to create more plausible sentence constructions. In the process of matrix creation, we tried as much as possible to apply the criterion of logic so that, upon listening and subsequent comprehension, confusing or surprising sentence constructions would not arise, which from the listener’s point of view could be evaluated as false or strange. Therefore, he/she could prefer not to mark the content heard. Thus, new words logically suit the content of sentences more, they are more neutral in meaning, and, at the same time, they have a satisfactory acoustic content from the point of view of phoneme distribution.
Due to the nature of the matrix test, some deviations were expected, but they are easily justified. In the Slovak language, there are five short vowel phonemes /i, e, a, o, u/, five long vowel phonemes /i:, e:, a:, o:, u:/, four diphthongs /ia, ie, iu, uo/, and 27 consonant phonemes /p, b, m, f, v, t, d, n, l, r, s, z, ts, dz, c, Ɉ, ɲ, ʎ, ʃ, ʒ, ʧ, ʤ, j, k, g, x, h/ [
28]. The graphic representation of the frequency of the occurrence of the Slovak vowel and consonant phonemes is depicted in
Figure 1.
Figure 1 indicates higher occurrence of the phonemes /i:/, /v/, and /x/ in the matrix sentences. This discrepancy can be explained by the structure of the sentences: each adjective is in the plural accusative form ending with the suffix -ý(í)ch /i:x/ and all four masculine nouns are in the plural genitive form ending with the /ov/ (“darov”, “bytov”, “nožov”, and “domov”). This contributed to the overuse of the /i:/, /v/, and /x/ phonemes.
Recording and Editing of the Speech Material
When constructing the basic set of sentences, we focused on the need to record all combinations of inter-word transients to ensure the fluency and naturalness of the constructed sentences. Hence, a basic set consisting of 100 sentences was recorded. Each word appears exactly 10 times in this set of sentences. The recording was carried out in the recording studio of LICOLAB (Faculty of Arts, Pavol Jozef Šafárik University in Košice, Slovakia) using professional recording equipment. The speaker was a 27-year-old woman whose native language is Slovak. Her speech was characterized by naturalness, without any abnormalities that would affect the resulting quality of the recorded signal. The recording lasted about 5 h, during which there was time for the speaker to rest and also to listen to the emerging recordings of the sentences and to re-record, if needed, in case of suspicion or the occurrence of an unwanted phenomenon (e.g., a change in speech tempo, vocal timbre, non-observance of quantity, imprecise pronunciation, etc.). The recording was performed at a sampling frequency of 48 kHz, a resolution of 16 bits per sample, and in single-channel mode.
The recorded set of one hundred sentences was subsequently cut into individual words, taking into account the non-violation of the inter-word transient, which thus became part of the word cut. All the words were normalized to the same volume. Subsequently, a set of 300 new sentences was created from these words by concatenation, which maintained fluency and naturalness thanks to intact inter-word transients. During the construction of these sentences, the intensity of the volume was also adjusted at the onset or offset of words, such that the given sentences sounded as natural as possible. Editing and management of audio content was carried out in Adobe Audition software (see
Figure 2).
The resulting set of three hundred sentences was then evaluated by three independent quality assessors who followed the quality assessment methodology using MOS (Mean Opinion Score). The threshold for editing a sentence recording was set to 4; i.e., sentences with an MOS score less than or equal to 4 were additionally edited to remove unwanted phenomena.
Noise plays an important role in the process of testing auditory competence. Its level significantly affects the accuracy of word recognition. In our experiments, we used one of the most challenging noises for speech perception [
1,
21]. This noise is referred to as babble noise [
16]. It is created by combining, mixing the speech signal, thus achieving a spectrum that is similar to the useful speech signal. In our case, a total of 90 sentences were used to create the noise. They were divided into 30 tracks, with different time shifts. By mixing them into one final recording and then removing its beginning and end, we obtained a sufficiently long recording with babble noise.
The resulting stereo recordings contained a sentence recording on one channel and a noise recording on the other (see
Figure 3). The noise with a gradual fade-in and fade-out exceeded the length of the useful recording at the beginning and end by 500 ms.
3. Experimental Setup
3.1. Participants
A total of 68 young people aged from 21 to 32, with a mean age of 23, took part in the test. Prior to testing, they were asked to fill out a questionnaire, in which we ascertained their general state of health, taking into account any current or previous diseases or hearing impairments. We also investigated the current state of health in order to obtain information about discomfort that could affect the results of the tests (headache, flu, etc.). Next, we focused on the environment in which the listeners could be. We investigated whether they were exposed to excessive noise on the day of the test, or whether they were in a noisy environment, for example at work or during leisure activities, or whether they played sports that could affect the test results (e.g., swimming). We also asked about their subjective assessment regarding hearing, whether there are situations in which they prefer a higher volume of sounds compared to other people. We also investigated additional information that could help to understand possible unexpected results. These questions included, e.g., their preference when listening to music, whether they use headphones and what volume they prefer, or whether they can play a musical instrument or whether they think they have a musical sense. These questions allowed us to identify one participant who was assumed to have worse results regarding perception tests. She responded positively to the use of headphones with increased volume and the preference for louder sounds when watching TV or listening to the radio, but at the same time she denied the diagnosed hearing disorder. This assumption was finally confirmed and her results were excluded from the evaluation in order to avoid distorting the resulting data in a negative way.
3.2. Equipment Used
The testing room was equipped with 15 identical computers (Win 10, 64-bit, Intel Core i7-7700 CPU, 16 GB RAM) with an external sound card (Creative Sound Blaster X-Fi HD) and closed headphones (AKG K77). The sound chain was calibrated by G.R.A.S. 90AB (artificial ear type IEC 60318-2, connected with microphone type 1″ 40EN, and preamplifier type 26AB). G.R.A.S. Audiometer Calibration Analyzer HW1001 was connected by G.R.A.S. AA0008 cable to the artificial ear. The mentioned setup ensured that all participants were in the same conditions.
4. Optimization Phase
The set of 300 sentences was divided into 10 tests, so-called triplets. The triplet therefore contained 3 × 10 sentences. The purpose of the optimization process was to find the SNR [dB] values that correspond to the word recognition score (WRS) at the levels of 20%, 50%, and 80%. In the optimization phase, an SNR ranging from −20 dB to 4 dB (in +1 dB steps) was used, while, in each triplet, three different SNR levels were tested at a constant noise level of 70 dB SPL. A total of 30 native speakers, students (aged from 21 to 32, with a mean age of 24), took part in the test in the optimization phase. Due to laboratory capacity constraints, the test was carried out during two afternoons, 15 students each day. The testing room is located in a quiet part of the building, with a controlled entrance.
Prior to testing, the participants were instructed about the method of testing and the organization of the whole process. As part of the testing, two breaks were organized, the first after Triplet 3 and the second after Triplet 7. During these breaks, the participants had the opportunity to rest and refresh themselves. The breaks lasted approximately 40 min. However, the participants were instructed about the possibility to take a break whenever it was deemed necessary, but they used this possibility to a minimal extent.
The testing was carried out through a designed interface, which was used to play prepared recordings (play button), select the words heard, and record answers (confirm button). Throughout the test, the participants had the opportunity to see all the words, clearly arranged in the form of a matrix (see
Figure 4). The interface (GUI) was created in MATLAB R2021b software.
Results
The results recorded by the GUI were then evaluated and processed using an intelligibility curve with a typical shape similar to the letter S (see
Figure 5). This red marked curve is sometimes also referred to as an S-curve.
From the results of perception tests, it is possible to identify SNR values for the Word Recognition Score (WRS) equal to 20%, 50%, and 80%. Based on the sigmoid psychometric function [
20], the following threshold values of −8.3 dB, −6.3 dB, and −4.3 dB were determined, which will be used as part of testing in the evaluation phase.
As part of this optimization phase, in addition to the WRS at the 20%, 50%, and 80% levels, we also obtained an overview of the recognition of each individual word and the average recognition of all words. Based on the obtained values, we optimized the words in the range of max. ±3 dB [
1,
20] in such a way that, for easy-to-recognize words, the intensity was reduced towards the average, and vice versa for harder-to-recognize words: the intensity was increased towards the average WRS value; see
Figure 6.
After adjusting the volume of individual words, new sentences were constructed from these words, which were used in the next phase, i.e., the evaluation phase, which is discussed in detail in the next section.
5. Evaluation Phase
In the optimization process, the WRS was equalized by adjusting the volume, which should result in an equalization of the perceptual difficulty. Testing was performed at three threshold levels (20%, 50%, and 80%), which correspond to SNR values equal to −8.3 dB, −6.3 dB, and −4.3 dB, respectively. In each triplet, the mentioned three SNR levels were used.
In the evaluation phase, a total of 63 individuals participated in the testing, while one person was excluded due to suspicion of hearing impairment. Her results were not included in the evaluation analyses. A total of 40 individuals participated in the GUI testing (30 new, 10 experienced). The minimum age of the participants was 20, the maximum age was 25, and the mean age was 22 years. Twenty-two individuals participated in the written test (eight new, fourteen experienced). The testing was carried out in the same way (briefing, questionnaire, and breaks), in the same premises, and on the same equipment, including calibration performed prior to each test. In the case of using the GUI, the content heard was recorded directly by selecting words from the matrix and confirming the chosen words.
Results
The test results are displayed using a boxplot, where it is possible to observe the minimum, maximum, median, first and third quartile, as well as outliers for each triplet made up of three tests (SNR = −8.3 dB, −6.3 dB, and −4.3 dB); see
Figure 7.
A phenomenon called the training effect can be observed at the beginning of testing, when the performed tests achieve a relatively low score compared to other tests. The statistical values of individual test scores gradually become stable as a result of mastering of the test routine, eliminating random errors due to inattention or insufficient initial concentration. In order to eliminate this phenomenon, which is otherwise normally present in perceptual testing, the results of the first triplet were not taken into account.
The results presented in the previous figure can also be represented using the curve of the psychometric function with the estimated and actual measured values marked (see
Figure 8).
As a result of optimization, there was an offset in the psychometric curve to the left and word recognition threshold values were changed from −8.3 dB to −9.2 dB, from −6.3 dB to −6.9 dB, and from −4.3 dB to −4.6 dB for WRS = 20%, 50%, and 80%, respectively. In the mentioned evaluation, 9 triplets were taken into account, i.e., 27 tests.
6. Comparison of Test Routine
The comparison of test scores was also evaluated from the point of view of the test routine, i.e., the test format (GUI or written form). The first way of testing was based on using an interface (GUI) with a depicted matrix of words. Participants played the recording with stimulus and then selected heard words directly through the testing interface. The second form of tests consisted of writing the answer in the answer sheets without the possibility of a visual hint—a depicted word matrix. The task of the participants was to write down the content heard (the whole sentence or words). Immediately after listening to each recording, they wrote down the content heard on the prepared answer sheets. The playback speed of the recordings was within the competence of each participant. The other aspects of the testing conditions remained unchanged (identical tests, same equipment, calibration process, and breaks). After completing all the tests, the participants were asked to fill in a short feedback form, which ascertained their opinions on the course of the testing, as well as the content of the sentences heard, and the frequency of occurrence of certain words. The results of 40 listeners (30 new, 10 experienced) who performed tests with the use of visual support (GUI) and 22 listeners (8 new, 14 experienced) who performed tests in the written form were included in the comparison of the test formats.
The review of the achieved results in relation to the used test routine for each triplet and the corresponding SNR value of the given test are depicted in
Figure 9.
In the first triplet, the format of testing with writing the answer on the answer sheet clearly dominates. This form of testing is easy and straightforward, which is probably the reason for the more accurate results of the first triplet. In tests with significant noise (SNR = −8.3 dB) with WRS at around 20%, better results were achieved for the routine with written answers. This finding probably indicates better concentration regarding listening due to the absence of visual cues and efforts to quickly label the word heard. On the other hand, in tests in which the noise level did not significantly affect the quality of the information heard, the test format with an illustrated matrix appears to be a more suitable method of testing. At the same time, it can be observed that the routine of the performed tests (from the first to the last triplet) helps both forms to achieve comparable results. Mean GUI vs. mean WRITE from Triplet 2 to Triplet 10 are 56.66% and 56.43% (see
Figure 9—right), respectively, in favor of the GUI form of testing (with visual support).
7. Equivalence of Tests
Ensuring the same difficulty regarding the tests is crucial for their use in practice. In this section, we will therefore statistically evaluate the test parameters from nine triplets, a total of twenty-seven tests (nine tests for each monitored SNR level, see
Table 2), and identify unsatisfactory tests. In the remaining tests, we will evaluate their statistical independence using analysis of variance—ANOVA.
When determining unsatisfactory tests, the technique of data cleaning and identification of outliers was applied (MATLAB). The algorithm used a detection method based on the mean value with a threshold of 0.75. The following figures show the result of the procedure used, and at the same time it is possible to identify the promising test sets within the triplets tested. These are the tests whose scores fell within the area defined by the red threshold lines (see
Figure 10). A total of 17 tests were identified, for which we used statistical analysis of variance to determine the independence of the tests and thus whether there were significant differences between the selected 17 tests.
Based on the applied cleaning routine, the tests T2, T4, and T10 were omitted and analysis of variance (ANOVA) was performed only on selected tests T3, T5, T6, T7, T8, and T9 (see
Figure 10a). The ANOVA showed that there were significant differences between mean of individual tests [F(5, 234) = 3.605,
p = 0.004]. Pairwise comparisons (with Bonferroni correction) showed that the p value of the
t-test for T3 vs. T7, T8, and T9 had a statistically significant difference. We excluded the T3 test for the mentioned reason. Other tests (T5, T6, T7, T8, and T9) demonstrated the desired statistical features.
Similarly, ANOVA was applied in the second group of tests, i.e., T3, T4, T5, T6, T7, and T8 (see
Figure 10b), with the result indicating the existence of significant differences between this group of tests [F(5, 234) = 2.691,
p = 0.0219]. Pairwise comparisons (with Bonferroni correction) were performed showing one significant dependence between T3 and T6. We excluded the T3 test from further processing and kept the remaining tests, T4, T5, T6, T7, and T8.
In the last group of tests, i.e., T4, T5, T6, T7, and T10 (see
Figure 10c), the ANOVA did not reveal statistically significant differences between the mentioned tests [F(4, 195) = 1.863,
p = 0.118].
Results
After performing the ANOVA, 15 out of a total of 17 tests remained, which means that we have 15 test sets available (each set containing 10 sentences). These tests are equivalent and have desired statistical features. The corresponding psychometric curve is shown in
Figure 11.
The blue curve shows the approximation of the testing progress using the psychometric curve in the range from −20 to 4 dB before optimization (the results of the first triplet were not included); the red curve shows the progress after optimization, which is the same for all 27 tests; and the green curve shows the final psychometric curve for the selected set of 15 tests, which show appropriate statistical features identified by ANOVA. The recognition threshold values with WRS at the level of 20% (SRT20) changed from −8.3 dB to −9.2 dB and −9.43 dB due to the adjustments made in the optimization phase and selection of appropriate tests. The value of WRS at the level of 50% or SRT50 changed from −6.3 dB to −6.9 dB and subsequently to −7.03 dB. For WRS of 80% (SRT80), it changed from −4.3 dB to −4.6 dB and −4.62 dB.
8. Comparison of the Effect of New vs. Experienced Participants
To determine the effect of knowing the test routine (GUI), we performed an ANOVA, taking into account the results of the first triplet, which were omitted in the rest of this study. We expect the results of the first triplet to be strongly dependent on the fact that the given test participant had already completed such a test before or whether it was his/her first time. The ANOVA showed that there was a statistically significant effect on test scores with respect to previous knowledge or unfamiliarity with the test routine, not only for the first triplet [F(1, 78) = 379.6907, p = ] but also for the other triplets.
9. Analysis of Words Recognition from Adaptive Matrix
The best score was achieved by words from the category of name appearing in the first position within the sentence. From the point of view of the prosodic composition of the sentence, its onset is realized with a stronger stress accent and often with emphasis. The most accurately recognized names included “Mária” and “Peter”, and, on the contrary, the names “Jana” and “Ján” achieved the lowest scores. Their phonetic proximity probably caused the confusion regarding the correct form. The mean success rate for the name category is 72.8%, STD = 12.7%.
In the second sentence position, there were verbs with a mean success rate of 52.6%, STD = 15.1%. The most accurately recognized verbs included the words “knows (pozná)” and “has (má)”, and, on the contrary, the words “buys (kúpi)” and “finds (nájde)” achieved the lowest score.
In the third sentence position, there were numerals with a mean success rate of 58.9% and STD = 15.3%. The numerals “four (štvoro)” and “hundred (sto)” dominated, and, on the contrary, the numerals “many (mnoho)” and “three hundred (tristo)” were the least correctly recognized numerals.
In the fourth position in the sentence, there were adjectives with a mean success rate of 56.2%, STD = 14.1%. The most accurately recognized words included the adjectives “other (ďalších)” and “nice (pekných)”, and, on the contrary, the least correctly recognized adjectives were the adjectives “small (malých)” and “good (dobrých)”.
In the last position in the sentence, there were objects with a mean recognition success rate of 52.5%, STD = 18.1%. The highest scores were achieved by the words “spoons (lyžíc)” and “buckets (vedier)”, and, on the contrary, the least correct recognitions and thus the lowest scores were achieved by the words “houses (domov)” and “bowls (mís)”.
The boxplot illustrates the variability in all word categories present within the proposed adaptive matrix, which provides a good identification of the distribution of data within each category, as well as identifying outliers (a circle) and providing information on the symmetry of the data according to the given word category (see
Figure 12).
From the point of view of the evaluation of word categories, the category of nouns dominated, followed by numerals and adjectives. On the other hand, verbs and objects appeared to be the most problematic. Their positions within a sentence correspond to the second and last positions.
10. Discussion
In this section, the achieved results of the Slovak adaptive matrix will be compared with the results of other studies dealing with the evaluation of adaptive test matrices in other languages.
The average slope of the psychometric curves of the test participants is 13.13 ± 1.60%/dB, while the corresponding SRT 50 value reaches a mean value of −7.03 ± 0.79 dB.
Comparable results were also achieved for tests in other languages; for example, the SRT threshold is at the level of −8.4 ± 1.1 dB and the slope of the psychometric curve is 17.2%/dB for German [
12]; SRT = −8.43 ± 1.75 dB and slope of 13.2%/dB for Danish [
13]; SRT = −10.1 ± 0.1 dB with a slope of the psychometric curve of 16.7 ± 1.2%/dB for the Finnish matrix test [
20]; SRT = −9.3 ± 0.8 dB and slope of 11.2 ± 1.2%/dB for the Mandarin Chinese matrix test [
14]; SRT = −9.6 dB with a curve slope of 17.1%/dB for the Polish matrix test [
16]; and SRT = −8.3 ± 0.2 dB and slope of 14.1 ± 1.0%/dB for the Turkish matrix sentence test [
24]. Similar results (like Slovak matrix) were reported for French, i.e., SRT = −6.0 ± 0.6 dB and slope of 14 ± 1.6%/dB and for the Italian matrix tests with the reference SRT = −7.3 ± 0.2 dB and a slope of the curve of 13.3 ± 1.2%/dB [
21,
22]. The number of tests (lists with 10 sentences) that meet the necessary criteria is, e.g., 12 for Italian [
21], 16 for Russian [
23], and 25 tests for Swedish [
30].
11. Conclusions
Adaptive matrix tests in audiometry represent a specific method that enables accurate and efficient measurement of an individual’s hearing abilities. In this study, we present test results obtained using an adaptive test matrix. To our knowledge, this is the first work of this kind for the Slovak language. The set of 15 tests show suitable statistical features for repeated measurements. The independence of the constructed tests was proven by ANOVA. The tests are characterized by the SRT50 recognition threshold at the level of −7.03 dB, STD = 0.79 dB and the slope of the psychometric curve approximating the test results of healthy listeners at the level of 13.13%/dB, STD = 1.60%/dB. In this study, we also focused on the impact of the testing format on the final test scores. By comparing both testing formats, comparable results were achieved, slightly in favor of the visual supported form of the test (GUI). We also investigated whether initial experience or, on the contrary, inexperience with the implementation of this type of test influenced the results of tests in the GUI format. The results show that experienced participants have a higher chance of achieving a better result on the test. The presented tests in the Slovak language represent the first step towards standardizing hearing assessment using adaptive test matrices in Slovakia. The data also enable a comparison of achieved results between similar tests in other languages.