**Advances in Management of Voice and Swallowing Disorders**

Editor

**Ren ´ee Speyer**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Renee Speyer ´ University of Oslo Norway

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Journal of Clinical Medicine* (ISSN 2077-0383) (available at: https://www.mdpi.com/journal/jcm/ special issues/Voice Swallowing Disorders).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-4083-2 (Hbk) ISBN 978-3-0365-4084-9 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


#### **Ren´ee Speyer, Anna-Liisa Sutt, Liza Bergstrom, ¨ Shaheen Hamdy, Bas Joris Heijnen, Lianne Remijn, Sarah Wilkes-Gillan and Reinie Cordier**


### *Editorial* **Advances in Management of Voice and Swallowing Disorders**

**Renée Speyer 1,2,3**


Dysphagia (swallowing disorders) and dysphonia (voice disorders) are both common disorders within the area of laryngology. Recent research has focused on instrument development and psychometrics, and the development of methods with robust measurement properties (i.e., validity, reliability and responsiveness). In addition, newly developed interventions are waiting to be evaluated to objectify treatment effects. The outcomes of both instrument development and intervention studies will support evidence-based clinical practice and research [1]. This current Special Issue of the *Journal of Clinical Medicine* (*JCM*) describes both ongoing instrument development and intervention studies targeting people with dysphagia and dysphonia.

A reliability study by Kim et al. [2] confirmed that computer analysis using a deep learning model could detect laryngeal penetration or aspiration in recordings of videofluoroscopic swallowing studies (VFSS) as reliably as human examiners. These results provide further evidence to support the clinical application of deep learning technology in addition to the visuoperceptual evaluation of videofluoroscopic and possibly endoscopic recordings of swallowing. A second study on VFSS by Swan et al. [3] reported on the development of the Visuoperceptual Measure for Videofluoroscopic swallow studies (VMV). The authors piloted their newly developed measure to determine its validity and reliability using classical test theory analysis, informed by the consensus-based standards for the selection of health measurement instruments (COSMIN) guidelines [4]. The results are promising and validation will be continued using larger sample sizes and an item response theory paradigm approach.

Two studies refer to assessment in dysphonia. The study by Caffier et al. [5] determined the test–retest reliability of the nine-item Voice Handicap Index (VHI-9i), a self-reported questionnaire on the subjective impact of voice disorders on patients' daily lives. The authors found high reliability and, as presented here, revised the VHI-9i severity levels based on receiver operating characteristic (ROC) curve analysis. The second study, by Nguyen et al. [6], used pitch discrimination as a key index of auditory perception, to discriminate between people with and without a voice disorder. The authors advocate the use of pitch discrimination testing during comprehensive voice assessment.

Three studies report on behavioural interventions in people with voice and swallowing problems. Madill et al. [7] describe the efficacy of active ingredients in the treatment of muscle-tension voice disorders, whereas Sinkiewicz et al. [8] present the results of a rehabilitation program for occupational voice disorders in teachers. A third study by Park et al. [9] on lingual strengthening training in older adults compares a new progressive resistance exercise with a conventional isometric tongue strengthening exercise. Two other intervention studies by Song et al. [10] and Novakovic et al. [11] report on CO2 laser microsurgery in patients with unilateral vocal fold cancer [10] and supraglottic botulinum toxin injection in laryngeal sensory dysfunction [11], respectively. All five of these intervention studies contribute to evidence-based clinical practice by objectifying the effects of distinct interventions in laryngology.

**Citation:** Speyer, R. Advances in Management of Voice and Swallowing Disorders. *J. Clin. Med.* **2022**, *11*, 2308. https://doi.org// 10.3390jcm11092308

Received: 15 April 2022 Accepted: 20 April 2022 Published: 21 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

This Special Issue includes three more studies by Speyer et al. [12–14]: three systematic reviews and meta-analyses of interventions in people with oropharyngeal dysphagia. All three reviews use the same study methods. The reviews follow the PRISMA guidelines [15,16], and include the highest level of evidence only, thus excluding any other study designs except for randomised controlled trials. Two reviews report on neurostimulation: (1) pharyngeal and neuromuscular electrical stimulation; and (2) brain neurostimulation. Although describing promising results, protocol heterogeneity, potential moderators and inconsistent reporting of the methodology resulted in conservative generalisations and interpretations of the meta-analyses. Both reviews confirmed the need for further randomised controlled trials with larger population sizes using standard protocols and reporting guidelines as achieved by international consensus. The third review reports on behavioural interventions. Again, although behavioural interventions show promising effects in people with oropharyngeal dysphagia, due to high heterogeneity between studies, generalisations of meta-analyses must be interpreted with care.

In summary, the studies included in this Special Issue contribute to instrument development and psychometrics, and to objectifying the effects of interventions in the area of laryngology. Future studies will continue to contribute to evidence-based clinical practice and research.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


## *Article* **A Visuoperceptual Measure for Videofluoroscopic Swallow Studies (VMV): A Pilot Study of Validity and Reliability in Adults with Dysphagia**

**Katina Swan 1, Renée Speyer 1,2,3, Martina Scharitzer 4, Daniele Farneti 5, Ted Brown <sup>6</sup> and Reinie Cordier 1,7,\***


**Abstract:** The visuoperceptual measure for videofluoroscopic swallow studies (VMV) is a new measure for analysing the recordings from videofluoroscopic swallow studies (VFSS). This study evaluated the reliability and validity of the pilot version of the VMV using classical test theory (CTT) analysis, informed by the consensus-based standards for the selection of health measurement instruments (COSMIN) guidelines. Forty participants, diagnosed with oropharyngeal dysphagia by fibreoptic endoscopic evaluation of swallowing, were recruited. The VFSS and administration of bolus textures and volumes were conducted according to a standardised protocol. Recordings of the VFSS were rated by three blinded raters: a speech-language pathologist, a radiologist and a phoniatrician. Inter- and intra-rater reliability was assessed with a weighted kappa and resulted in 0.889 and 0.944 overall, respectively. Structural validity was determined using exploratory factor analyses, which found four and five factor solutions. Internal consistency was evaluated with Cronbach's alpha coefficients, which found all but one factor scoring within an acceptable range (>0.70 and <0.95). Hypothesis testing for construct validity found the expected correlations between the severity of dysphagia and the VMV's performance, and found no impact of gender on measure performance. These results suggest that the VMV has potential as a reliable and valid measure for VFSS. Further validation with a larger sample is required, and validation using an item response theory paradigm approach is recommended.

**Keywords:** classic test theory; dysphagia; measure; psychometrics; videofluoroscopic swallow studies; VMV

#### **1. Introduction**

Oropharyngeal dysphagia (OD) is a disorder that disturbs the sensory and physical processes of swallowing [1]. As not all aspects of OD can be observed externally, investigation of OD often necessitates the use of specialised instrumental examination procedures. The videofluoroscopic swallow study (VFSS) is an instrumental exam that uses recordings of dynamic fluoroscopies in an assessment of swallowing physiology and kinematics. VFSS is recognised as a gold-standard instrumental swallowing assessment and is widely used in

Scharitzer, M.; Farneti, D.; Brown, T.; Cordier, R. A Visuoperceptual Measure for Videofluoroscopic Swallow Studies (VMV): A Pilot Study of Validity and Reliability in Adults with Dysphagia. *J. Clin. Med.* **2022**, *11*, 724. https://doi.org/ 10.3390/jcm11030724

**Citation:** Swan, K.; Speyer, R.;

Academic Editor: Michael Setzen

Received: 22 December 2021 Accepted: 26 January 2022 Published: 29 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

clinical and research settings around the world [2]. However, the video recordings require skilled analysis for meaningful interpretation. Clinicians typically examine the videos by visuoperceptual means to make judgments about impairments and to plan and trial interventions [3]. Measures suitable for visuoperceptual analysis of dynamic images with robust psychometric properties are therefore essential for the assessment and treatment of OD.

Several visuoperceptual analysis measures have been developed for VFSS. Some target a single construct, such as aspiration, while others attempt to measure multiple constructs, such as lingual and pharyngeal movement, residue, cough and upper oesophageal sphincter (UES) function [4]. The constructs included in measures are just one facet the clinician must consider when choosing an appropriate tool for OD analysis. Measures must be reliable, valid and responsive, with key psychometric properties that describe whether a measure evaluates what it claims to assess and whether it does so in a consistent, repeatable manner [5].

Understanding the psychometric properties of OD measures is important given the complexity of OD as a clinical and diagnostic construct, where a phenomenon viewed on VFSS may be interpreted in multiple ways. For example, the presence of pharyngeal residue may be explained by any of the following: the contrast material preparation in the oral phase (weak tongue squeeze), poor pharyngeal constriction, anatomical abnormalities, surgeries obstructing bolus flow, impaired upper oesophageal sphincter functioning, and other dysfunctions [6]. Analysis of psychometric properties provides statistical evidence about the relationships between the items in the measure, the precision of the scale, and the association between the measure and the construct(s) of interest [7].

In recent years, the science of psychometric analyses applied to outcome measures has been scrutinised through the consensus-based standards for the selection of health measurement instruments (COSMIN) initiative [8]. The COSMIN initiative applied international multi-disciplinary expertise in psychometrics, research and measure development to formulate a methodology for evaluating outcome measures [5]. In a series of Delphi studies, consensus was reached on standardised definitions of psychometric properties, quality criteria for which properties should be reported, and recommended statistical methods to be used to investigate them. The COSMIN taxonomy encompasses nine psychometric properties, divided into three domains: reliability, validity and responsiveness [9–13]. The COSMIN checklist is an inventory of recommended criteria and statistical methods for studies on measurement properties [8].

The checklist was applied to VFSS visuoperceptual measures in a 2018 study, where psychometric properties were assessed in a combination of COSMIN ratings and quality criteria [4]. The authors found that visuoperceptual VFSS measures had overall indeterminate, limited or conflicting evidence of psychometric quality and concluded that there was insufficient evidence to recommend any of the VFSS measures reviewed [4]. Unclear or inadequate psychometric properties risk misapplication of the measure, while inaccurate measurement wastes resources and undermines the evidence base for clinical practice [14]. Thus, there is an urgent need for studies that focus on the development of VFSS measures that utilise sound statistical methods.

A new measure, the visuoperceptual measure for videofluoroscopic swallow studies (VMV) was created to address this gap. The process of developing a measure involves conceptualisation of the construct of interest, item/response scale generation (content validity), piloting the measure, preliminary evaluation, item refinement and reduction, and finally a large trial [15]. The VMV's content validity was established in an international Delphi study involving more than 50 experts in OD and VFSS from 27 countries. The constructs to be included in the VFSS analysis, the conversion of these constructs to items, and the operationalisation of these items were established via consensus across three Delphi rounds. The Delphi identified 32 constructs recommended for analysis, and between one and four items per construct [16]. These findings were used to create the pilot version of the VMV, which comprised 97 items. As a new measure, its psychometric properties are not established. Therefore, the aim of this study is to conduct a pilot evaluation of the VMV's

psychometric quality. Specifically, the objectives are to evaluate the following psychometric properties:


#### **2. Methods**

#### *2.1. Participants*

This study was granted ethical approval by the Human Research Ethics Committees of The Medical University of Vienna And Curtin University (HRE2018-0151, April 2018 and March 2019). Adults with OD, diagnosed by fibreoptic endoscopic evaluation of swallowing (FEES) and referred for VFSS as part of their assessment plan, were recruited from the Medical University of Vienna between July 2019 and March 2020. As FEES and VFSS are complementary instrumental assessments, diagnosis of OD by FEES supported appropriate participant selection [17]. Informed consent was obtained from all participants.

In- and out-patients accessing services for OD were eligible if they satisfied the following inclusion criteria: (1) adults (>18yo) with a diagnosis of OD, (2) had been deemed by their treating clinician to be medically and cognitively appropriate for VFSS, and (3) to require VFSS to assess or manage their OD. Participants who had radical surgery of the head and/or neck were excluded.

A total of 40 participants were recruited. One patient was excluded due to data loss on the medical archive imaging system. Of the 39 remaining, 64% (*n* = 25) of participants were men and 36% (*n* = 14) were women. Participants ranged in age from 21 to 91 years (mean = 63.0 years, SD ± 17.0 years). The onset of OD (defined as the date of first symptoms) ranged from six days to five years prior to the VFSS, with onset less than 1 month prior to VFSS for 33% (*n* = 13), 1–6 months prior for 36% (*n* = 14), 7–12 months prior for 13% (*n* = 5), and more than 13 months prior for 18% (*n* = 7). Medical diagnoses in the participant group, while heterogeneous, can be grouped into four categories: cancer (primarily of head and neck), neurological disorder, surgery, and anatomical abnormality (Table S1—Aetiology of Oropharyngeal Dysphagia). Participants were assigned to groups based on the diagnosis which appeared most strongly associated with their OD diagnosis, based on consensus of the three authors (KS, RS, RC). For example, a participant with a history of lung cancer 20 years prior to VFSS who had a stroke the month before the VFSS was classified as 'Neurological disorder'. The anatomical abnormality group was comprised of conditions in which physical changes to bodily structures adjacent to and involved in the process of swallowing were the most likely cause of the OD (e.g., cervical spine abnormalities).

#### *2.2. Equipment and Materials*

The VFSS were performed by a radiologist on a fluoroscopy unit (Axiom Sireskop S3 fluoroscopy system, Siemens Healthineers; Siemens AG, Erlangen, Germany) with a Siricon high-dynamic-range image intensifier, spot film device and analog/digital acquisition at an image rate of 30 pulses/s. Patients were placed upright in a sitting position. The oropharynx and the proximal oesophagus were viewed in lateral projection and anterior– posterior positions. Boluses consumed was comprised of a non-ionic low-osmolar contrast (Omnipaque™), thickener (Nutilis Clear®), water and a cracker (Mini Toast, Delhaize®; Delhaize Group SA, Brussels, Belgium).

#### *2.3. Protocol*

Participants had part of their VFSS conducted according to a standardised protocol (Supplementary Materials, Figure S1). The protocol was designed to ensure participant safety by starting with small volumes of each texture (5 mL) and including cessation points if severe aspiration or residue was observed by the radiologist. As swallowing is affected by volume, texture and verbal instructions [18,19], the protocol included four different textures. These were administered using a standardised order, method and set of verbal instructions to maximise the variety of swallowing behaviours related to textures/volumes elicited, while controlling for the influence of the administrator. Textures/volumes were as follows: three trials of Thick (L3 International Dysphagia Diet Standardisation Initiative (IDDSI)), three trials of Thin (L0 IDDSI), four trials of pudding (L4 IDDSI), and one cracker (L7 IDDSI) [20] in four different volumes (5 mL, 10 mL, 20 mL, bite sized cracker). The average number of trials completed by the participants was seven (range 1–11), with the most common trial completed being 5 mL thick (completed by all 39 participants; however, data were lost from one due to technical issues, resulting in data from 38 participants for this trial). The VFSS was conducted by an experienced radiologist (>10 years' experience). The non-ionic low-osmolality contrast was mixed with food and fluids according to a standard recipe (Supplementary Materials, Table S2: Administration protocol [21]) and each was kept at room temperature prior to the procedure.

In addition to the VFSS, assessment of the participants' self-reported and observed functional health status (FHS) and clinician-perceived symptoms on VFSS were completed. Measures of FHS assess severity of OD symptoms from the perspective of daily functioning and impacts on participation in daily activities [22]. Participants who were referred for VFSS completed the Deglutition Handicap Index, Symptom subscale (DHI-S) [23] (a selfreport FHS measure), and clinicians scored the Functional Oral Intake Scale (FOIS) [24] (an observational FHS measure) along with a 5-point ordinal scale to indicate the radiologist's overall impression of OD severity based on viewing VFSS. To overcome the pragmatic limitation of having a single rater scoring OD severity, data were triangulated by using two separate measures of OD severity. (Supplementary Materials, Table S3. Functional Health Status and Severity measures and Table S4. Functional Health Status and Severity measures scoring.).

#### *2.4. Manual*

A manual was constructed based on Delphi study results. This manual includes: detailed instructions on contrast preparation, administration, patient positioning, items with descriptions, response scales, instructions for rating items, and anchor images [16].

#### *2.5. Raters, Consensus Meetings, Training and Rating*

Three raters used the draft VMV, informed by the Delphi results [16]. One rater had qualifications in speech-language pathology, the others were physicians with qualifications in radiology and phoniatrics, respectively. All three raters had over 10 years of experience with OD and VFSS. In a consensus meeting, the raters scored one patient through a full protocol, including all 11 trials, working item by item as a group. Each draft VMV item was discussed, and the manual regularly referenced by the raters on the first trial (5 mL Thick). As the raters progressed through the trials in the protocol, only new items were discussed in detail unless there was disagreement in scoring. Adjustments were made to items and the manual based on this feedback. These adjustments included removing ambiguous language, adding additional anchor images and expanding response options. After six hours, a 100% group consensus was reached for each item. The raters then scored three VFSS recordings independently and convened for an additional two-hour consensus meeting to discuss questions about measure use and resolve any differences in ratings. This consensus process led to the development of the pilot version of the VMV. An overview of measure development and versions of the VMV is depicted in Figure 1.

All of the VFSS recordings were deidentified. Ratings were completed on 100% of recordings using the pilot VMV on Qualtrics (www.qualtrics.com accessed on 3 July 2020). The raters referred to the manual as needed. At least two weeks after initial rating, repeated ratings were completed on an additional six (15%) randomly selected participants' recordings by all three raters.

**Figure 1.** Overview of measure development and versions of the VMV.

#### *2.6. Item Reduction*

The pilot version of the VMV included 97 items, derived from results of the international Delphi study on visuoperceptual analysis of VFSS [16] and informed by the results of the consensus meetings regarding the draft version. After completing the ratings, the raters and the authors met to review the pilot version of the VMV item by item and reach consensus on whether each item should be kept, modified or rejected in the next iteration of the measure.

Decisions to retain or remove items were first made based on the clinical relevance of the item, where items considered less clinically important by a two-thirds majority of authors were removed. Consideration was then given to feasibility (e.g., items that are excessively time-consuming or difficult to view), redundancy between items, and the potential for multiple items to be consolidated into one (e.g., posterior movement of base of tongue and posterior pharyngeal wall contact with base of tongue). Lastly, all items which existed solely for the purposes of skip logic within the Qualtrics version (i.e., items which directed raters to a point further in the VMV based on their response) were removed and that item's response options consolidated to a related scale (Figure 2—skip logic—original question structure vs. skip logic removed with retained concept.). Skip logic questions contribute to survey structure by allowing only relevant questions to be shown to participants, but their content overlaps with constructs assessed by other items. Removal prevents this overlap from causing issues in statistical analysis. Reducing these items prior to statistical analysis simplified this analysis and allowed analysis to meet statistical assumptions. For example, factor analysis has minimum sample size requirements (100 observations and 5 times the number of cases per items) [25], meaning that factor analysis of 97 items would require a minimum of 485 cases, which was beyond the scope of the current pilot study.

**Figure 2.** Skip logic—original question structure vs. removed skip logic with retained concept.

Item reduction following rater feedback is summarized in Figure 3, Item reduction from rater feedback. Details of items removed and rationales behind these decisions are described in Supplementary Materials, Table S5: Items removed or altered following rater consensus.

Item reduction resulted in the retention of 56 items. One new item, 'clearing swallow efficacy', was created with data derived from items rating the volume residue that remained after clearing swallow/s. An overview of items retained per domain is displayed in Supplementary Materials, Table S6: Included items per domain. The researchers then evaluated whether each of the items was clear (i.e., whether it was evident what the item was assessing, whether the manual clearly described what to examine and when to assess) and whether the response scale was adequate (i.e., whether there were too many/too few options in the response scale or ambiguous wording). Of the 57 items evaluated, one was considered unclear and 30 required revisions to their respective response scales.

#### *2.7. Psychometric Properties*

An analysis of psychometric properties was conducted using the COSMIN taxonomy guidelines [5]. The COSMIN initiative, formulated as a response to the differing terminology found in the literature, developed a unified taxonomy to describe the different measurement properties of instruments [8]. The COSMIN taxonomy was used within this study to define the properties from the domains of reliability and validity, and COSMIN recommendations for the statistical analysis of their quality were also applied [5,11–13].

**Figure 3.** Item reduction from rater feedback.

Psychometric properties were determined if the characteristics of the data were appropriate for the intended statistical analysis (i.e., if the assumptions for statistical processing could be met) or if the analyses were feasible for the scope of a pilot study. Psychometric properties included in these analyses were:


Psychometric properties omitted from this analysis were excluded if analysis was not possible, relevant or appropriate for the scope of a pilot. Criterion validity (Diagnostic performance) refers to the degree to which scores adequately reflect a gold-standard measure [8]. This measure for assessment of OD is generally considered to be instrumental assessment [2,26]; however, both FEES and VFSS require the use of visuoperceptual

measures to analyse them. There is currently no measure with sufficient evidence of psychometric quality for it to be recommended as a gold-standard for VFSS or FEES [4]. Therefore, criterion validity could not be determined. Cross-cultural validity describes how well a translation of a culturally adapted measure replicates the original [8]. As this is a novel measure, developed only in English and tested in a single geographical location and cultural group, this property was irrelevant. Content validity is the degree to which the content of a measure is an accurate reflection of the construct of interest based on cognate literature and expert opinion [8]. Although not examined in this study, the VMV's content validity was developed via a Delphi study that is reported in a separate manuscript [16].

Responsiveness refers to a measure's ability to detect clinically important change over time [8]. Repeated VFSS procedures and assessment pre-post intervention were beyond the scope of the current pilot study. Systematic and random errors in scores that are due to rater or measure errors rather than a true representation of patient change are classed as measurement errors [8]. Statistical analysis requires a total score (summed score) to examine this property, which the pilot measure did not include. Finally, interpretability, the degree to which clinically meaningful connotations can be assigned to the numerical scores or to changes in scores [8], was excluded. Although this is not a psychometric property, its importance is recognised in the COSMIN taxonomy due to the clinical relevance of applying qualitative meaning to quantitative data [5]. This property was not included in this analysis due to the relatively small sample size and the preliminary form of the VMV.

#### *2.8. Statistical Analysis*

Reliability was analysed using quadratic weighted kappa. The quadratic weighted kappa assesses the degree of disagreement between raters (scale of difference between ordered scorings). Kappa was computed for each rater pair, then averaged to provide a single index of inter-rater reliability [27]. Cronbach's alpha coefficients were calculated to assess internal consistency for each factor individually as well as for the whole measure. A low Cronbach's alpha value (alpha < 0.70) indicates inadequate internal consistency, whereas a very high Cronbach's alpha value (alpha > 0.95) suggests redundancy of items in the factor, which could mean that there are too many items to assess the target construct [28].

With the exception of the inter- and intra-rater reliability analyses, scores from all three raters for 5 mL Thick were used for all analyses as this volume/texture had the largest case numbers available, allowing for statistical assumptions to be met. In the case of inter- and intra-rater reliability, analyses were performed between and within all raters, but were grouped by texture group (i.e., all volumes of Thick were grouped together for analysis). The grouping allowed for comparison of reliability between textures, as swallow behaviours and kinematics may be altered by texture differences [18].

The normality of the dataset will inform the use of parametric or nonparametric statistics. Structural validity was analysed via exploratory factor analysis (EFA) using principal component analysis. Factor analysis is a multivariate technique which identifies the strength of the relationships between items and the underlying latent constructs in the dataset [25]. These latent constructs are referred to as factors, or dimensions. For example, some items in the measure may demonstrate strong relationships with 'severity' while others appear related to 'aetiology'. In EFA, all items are tested for a relationship to every latent construct. A second analysis, known as confirmatory factor analysis (CFA), may be performed to assess whether the model's factor structure can be replicated. A CFA is only performed if a model of factors is an adequate representation of the theoretical constructs of interest. A CFA was not performed in this study due to the small sample size, which meant that statistical assumptions were not met [25].

Hypothesis testing for construct validity was conducted using Spearman rho correlations and Mann–Whitney U to test the following hypotheses, respectively:

**Hypothesis 1 (H1).** *70% of factors will be significantly positively correlated with the FOIS and 5-point ordinal scale*.

**Hypothesis 2 (H2).** *No significant differences between genders are expected on any of the item scores*.

#### **3. Results**

#### *3.1. Functional Health Status and Severity Scores*

DHI-S scores describe patient severity from self-rating of physical symptoms. Scores ranged from 13–43, with a median of 28.0 (SD ± 7.9, Q1 = 20.0, Q3 = 32.0). Five-point ordinal scale scores ranged from 1–5, with a median of 3.0 (SD ± 1.2, Q1 = 2.0, Q3 = 4.0). FOIS (reversed) scores ranges from 1–7, with a median of 3.0 (SD ± 1.9, Q1 = 2.0, Q3 = 5.0).

#### *3.2. Reliability (Intra- and Inter-Rater Reliability)*

A quadratic weighted Kappa assessed the degree to which raters produced consistency in the scores they applied between participants, and agreement within scores given to participants on repeated measures [29]. Weighted Kappas between pairs of raters ranged from 0.842 to 0.939, with minor differences between consistencies or views (Table 1. Interrater reliability—Weighted Kappa per Texture). The resulting overall average inter-rater weighted Kappa was in the 'strong' range, with an average weighted Kappa of 0.889 [27]. This indicates that raters had a high degree of agreement and suggests that the function or impairment of swallowing as measured by VMV items was coded similarly and consistently across the three raters.

**Table 1.** Interrater reliability—Weighted Kappa per Texture.


Total intra-rater weighted Kappa on repeated measures of six participants showed excellent intra-rater reliability, resulting in a Kappa of 0.944 (Rater One Mean = 0.948, Rater Two Mean = 0.962, Rater Three Mean = 0.921). Agreement was not calculated between textures due to small data sets. Overall, these results suggest that there is a high level of consistency between raters and that a minimal amount of error was introduced by the independent raters. Ratings were therefore deemed to be suitable to conduct hypothesis testing as outlined before.

#### *3.3. Structural Validity*

#### Exploratory Factor Analysis

Fifteen items of the 57 retained for the trial measure were excluded from the EFA, as they pertained solely to textures or views (e.g., solids or anterior/posterior) other than 5 mL Thick (lateral view). This resulted in 42 items being assessed in the EFA. The trial using 5 mL Thick was selected for EFA due to this being the trial with the highest number of cases (*n* = 114), being the first trial in the protocol, and thus being best suited to meeting statistical assumptions for EFA. EFA requires a minimum of fives times the number of cases per item [25], and given 114 cases for 5 mL Thick, the maximum number of items permitted for EFA was 22 (22 × 5 = 110). Therefore, the 42 items were divided into two groups.

Item groupings were initially constructed based on theory and clinical reasoning, with items pertaining to anatomically close regions (e.g., oral and oropharyngeal) and/or impairments or events which are closely related (e.g., aspiration and penetration) being grouped. Initial analysis revealed eight factors in both groups. New groupings were created by moving a single item at a time between groups. This process was informed by clinical reasoning, empirical literature and factor loadings (i.e., if a single item was creating a factor by itself, it was moved to another group to attempt to eliminate a one-item factor). The impact of moving single items was evaluated by examining changes in factor loading

and total variance, and allowing the items to demonstrate relationships to other items. For example, items related to aspiration or penetration loaded on different factors when items related to the UES functioning were included in the factor analysis, whereas if UES items were excluded from the analyses, both aspiration and penetration items loaded on the same factor.

After the total number of factors was reduced as much as possible, clinical reasoning was used to allocate ambiguous items (i.e., those which loaded approximately equally on more than one factor) to a factor. During this process, three items, 'Piecemeal Deglutition', 'Volume Tracheal residue' and 'Coordination of the upper oesophageal sphincter' were removed due to erratic behaviour (creating weak, theoretically inconsistent single or twoitem factors). Finally, the groupings with the combination of items which best represented the most concise and theoretically coherent factors of item loadings were retained.

This process resulted in two EFA models consisting of five and four factors, respectively (Tables 2 and 3, Exploratory Factor Analysis). In Group One, five factors explained 71.8% of the total variance, with most items loading on Factors One and Two. Factor One explained 18.6% of the variance, indicating multidimensionality, as Factor One accounted for <20% of the variability and the ratio of the variance from Factor One to Two is less than four [30]. The Group Two factor loadings reflected similar findings. A fourfactor solution explained 77.4% of the total variance, with most items loading on Factors 6 and 7 (factors are named with sequential numbers continuing from group One to Two), and a ratio of less than four in variance between the two factors. These findings also suggest multidimensionality.


**Table 2.** Exploratory factor analysis—factor loadings of model one.

The bold represent the proposed models for the loading on these factors.

#### *3.4. Internal Consistency*

Cronbach's alpha was calculated per factor and for the whole measure for 5 mL Thick data on the 39 items retained from the EFA results, following the removal of three items creating erratic behaviour (Table 4. Internal Consistency). Scores for all factors and overall were adequate (Cronbach's alpha >0.70 and <0.95), except for Factor Four (0.698) [28].


**Table 3.** Exploratory factor analysis—factor loadings model two.

The bold represent the proposed models for the loading on these factors.

#### **Table 4.** Internal Consistency.


*3.5. Hypothesis Testing for Construct Validity*

The data were not normally distributed, therefore nonparametric correlations were calculated.

*Hypothesis One*, which stated that factor scores will be significantly positively correlated with FOIS and 5-point ordinal scale scores in 70% of factors, was partially supported (Table 5. Factors' correlation with FOIS and 5-point ordinal scale). The hypothesis was partially supported with a weak to moderate positive correlation (FOIS mean: 0.171, range = −0.157–0.415; 5-point ordinal scale mean: 0.199, range −0.055–0.432) that was statistically significant in seven of nine (77%) factors, and positive but non-significant in one (Factor 8: Penetration) [31,32]. The factor containing UES Function items generated weak inverse correlations (−0.157 and −0.055).


**Table 5.** Factors' correlation with FOIS and 5-point ordinal scale.

\* Correlation is significant at the 0.05 level (2-tailed). \*\* Correlation is significant at the 0.01 level (2-tailed).

Hypothesis 2, which stated that there will be no significant difference on item scores between genders, was supported. The hypothesis was supported by a Mann–Whitney U test, which found no significant difference between the scores of male and female patients: Mean RankMale = 2393.28 (Sum of Ranks = 7,237,279.00); Mean RankFemale = 2396.59 (Sum of Ranks = 4,227,587.00); *U* = 2,663,479.00; *p* = 0.932, two-tailed.

#### **4. Discussion**

The psychometric properties of the pilot VMV were evaluated in this study. The analysis was conducted with reference to a classical test theory (CTT) psychometric paradigm and the COSMIN framework [9–13]. CTT is well-suited for initial investigations of psychometric properties [33] and is useful in measure development, as many constructs of interest are not directly observable in health practice. For example, laryngeal vestibule closure may be purported to be assessed by VFSS analysis; however, 'closure' is not directly measured. The clinician's perception of the proximity of pixels produced by digitisation of fluoroscopy is the observable data. Clinicians assign meaning to this 'proxy indicator' to measure the unobserved construct of 'closure'. CTT-informed analysis determines the success of the proxy indicator in measuring the unobservable phenomenon [34]. A key tenet of CTT is that the scores of each item are produced by a combination of the unobservable 'true' score, summed with the unavoidable errors and biases introduced by the use of a proxy indicator. Errors in CTT are assumed to be random and unique to each item [34]. The COSMIN framework was used to define the psychometric terms applied and to guide the statistical methodology used [11–13].

Statistical analysis found that the inter-rater reliability coefficients of the VMV were in the 'strong' range overall and included scores in the excellent range between Raters One and Two. This indicated that the target concepts were clearly and consistently understood between the three raters from different professions—speech-language pathology (SLP), radiology and phoniatrics. This was reflected in the item reduction process, where the majority of items selected for the next version of the measure were considered 'clear/unambiguous' by all raters. Intra-rater reliability was excellent, indicating that the pilot VMV supports a consistent internal schema within raters that is stable across time [27].

Structural validity analysis via EFA produced a 5-factor and 4-factor solution. Group One contained variables primarily relating to swallowing events and kinematics occurring superiorly in the oropharyngeal tract and early in the swallowing process (e.g., hyoid movement). Group Two resulted in items pertaining to laryngeal, hypopharyngeal and late-stage events (e.g., residue post swallow). However, some items behaved erratically (i.e., 'piecemeal deglutition' caused a factor with a single item loading on it) and some items had ambiguous loadings (e.g., 'Oropharynx residue volume' loaded similarly on

two factors). This is likely related to sample size. Items with ambiguous loadings were allocated to a group and factor based on theoretical consistency of the grouping (e.g., oropharynx residue volume was grouped with the factor containing 'oral residue volume' as opposed to the factor containing 'location of material at swallow initiation', as the pairing with another item measuring residue, rather than a temporal event, is more logically consistent). Three items, 'Piecemeal deglutition', 'Volume tracheal residue' and 'Coordination of the upper oesophageal sphincter' were removed as they created single item factors or groupings which were illogical. Therefore, the groups represent preliminary proposals at this time; conclusive evidence of factor structure will require greater numbers of participants.

EFA indicated that the measure is multidimensional, meaning that the construct under assessment has two or more dimensions. In VFSS, a simple construct such as velum movement may be unidimensional (i.e., the underlying dimension of the construct is velum elevation). A multidimensional construct might be aspiration, where the dimensions contributing to the construct include volume of aspirate, time when aspiration occurs, and the patient's awareness of the event. In the context of the pilot VMV, this finding means that visuoperceptual examination of VFSS likely involves multiple underlying dimensions. However, this needs to be confirmed in a larger sample with an EFA that includes all items in a single analysis (as opposed to split into two groups) followed by a confirmatory factor analysis. Total percentage of variance explained was >70% for both models, indicating that random error was not excessive [35].

Internal consistency was good (alpha > 0.7 but < 0.95) for 8 of the 9 factors and overall, with only one factor (Factor Four, which contained items pertaining to premature spillage and swallow initiation) not reaching this zone alpha by only 0.002 [28]. This indicates good content coverage, but item reduction may be possible to streamline the measure. Further analysis of the preliminary measure using the Rasch measurement model (RMM), a type of item response theory, would provide additional information about the dimensionality, differential item functioning, person-ability scores, and item difficulty scores. This would assist in identifying items that do not meet RMM person and item fit criteria and could subsequently be discarded [33].

Hypothesis testing for convergent validity tested two hypotheses. The first, an expected positive relationship between VMV and both FOIS and 5-point ordinal scale scores was partially supported. All but one factor had a weak, positive statistically significant correlation. That is, as the degree of impairment increased (as measured by texture prescription) and the radiologist's perception of overall severity of OD increased, so did scores on the VMV. The factor containing the UES items was negatively correlated with both FOIS and the 5-point ordinal scale. It might be expected that the UES, as the terminal part of the pharynx, would reflect dysfunction from superior abnormalities of the oral cavity, pharyngeal shortening and constriction, cervical spine and hyolaryngeal function [36]. However, the inverse correlation indicates that this was not the case in this pilot. This finding may be a related to the small sample size or the texture/volume analysed; 5 mL Thick may not be ideal to reveal UES deficits because the small volume is less likely to be problematic for passage through the UES, given that larger thick volumes produce greater durations of opening, amplitudes of relaxation and earlier opening onset (i.e., thick volumes induce greater challenges to the swallow system) [37]. The inverse correlation result may also be related to the construct itself. For instance, the UES items were the only items where 'opening' was measured, while other items assess contact with other structures, volumes of material and timing of kinematics. As this was a pilot study, explanation of this finding cannot be conclusive. Further analysis in a larger sample is required.

The second hypothesis, a lack of association between scores on VMV and gender, was supported. This result was expected given that OD severity as perceptually analysed on VFSS should have no association with gender [38]. These two findings indicate that it is likely that the VMV is measuring the target construct. Finally, a review by the authors of the feasibility, clinical relevance and redundancy of the items found that approximately half of the items could be removed. This is expected in measure construction, where multiple items may assess the same construct in the pilot and then the most suitable are retained following initial testing. Removal of items also assists in developing a measure's suitability for clinical use; the pilot iteration of the measure was excessively time consuming, taking over 40 min for analysis. A measure useful for practice must balance adequate content coverage with feasible administration time.

The pilot VMV exhibits evidence of content validity [16], intra- and inter-reliability, structural validity, internal consistency and hypothesis testing. In a psychometric review of current visuoperceptual VFSS measures, only nine measures were found where evidence of the scale's validity and reliability were reported. The quality of the reported psychometric properties was limited, primarily due to unclear reporting and methodological flaws. [4]. The VMV represents the first visuoperceptual measure for VFSS that has been constructed with reference to the international best practice guidelines of the COSMIN initiative [10,39]. The VMV has evidence of its robust content validity, established through an extensive international Delphi process involving 50 experts from 27 countries [16]. In addition, this measure was piloted using raters from three different disciplines (SLP, radiology and phoniatrics) and their expertise informed measure refinement and item reduction. No other measure has utilised such comprehensive and robust methodology [4]. Similarly, initial evidence of the VMV's structural validity and dimensionality was provided through the EFA results.

#### *Limitations and Future Research*

Limitations of this study are the small sample size, which resulted in the reliability and EFA analyses being limited to 5 mL Thick to meet statistical assumptions. The study was conducted at a single site, and while the population was reasonably heterogenous, the sample does not comprehensively reflect all possible aetiologies and comorbidities of the OD population. The analysis was conducted using only a CTT framework, which is known to have a number of limitations. For example, each item's score is comprised of its 'true' score and random error in CTT, and as the distribution of the error is random around a mean of zero, errors from different items will generally negate each other. This means that scales which include many items may yield disproportionately strong reliability [34]. However, the application of CTT represents a first step in the psychometric evaluation of the VMV. The combination of CTT with another theoretical framework, such as the RMM, would yield further valuable insights about the measurement properties of the VMV [12,40]. In addition, some psychometric properties (e.g., test re-test, measurement error) and interpretability were out of the scope of this study. Finally, this study reports on a pilot version of the VMV that is not yet ready for formal clinical use. It is anticipated that future studies involving larger patient populations will allow additional statistical analysis (e.g., EFA and RMM analysis including all items), investigation of additional psychometric properties, and investigation using psychometric paradigms that complement each other (i.e., CTT and IRT). Together, these will help create a refined version of the preliminary VMV which is suitable for clinical use.

#### **5. Conclusions**

The CTT analysis indicates that the initial psychometric properties of a pilot version of the VMV may be adequate for analysing VFSS in a valid and reliable manner. The VMV appears to have good inter and intra rater reliability. The VMV is multidimensional, based on EFA results, and exhibits good internal consistency. Hypothesis testing for construct validity indicates that the relationship between OD severity and population characteristics is as expected, with VMV severity scores increasing as functional severity on other measures increase. Future studies of the preliminary VMV with larger samples and additional statistical analysis using the RMM is recommended as this will add to the psychometric evidence of the VMV. The VMV pilot study represents the first step in

developing a robustly validated measure for visuoperceptual analysis of VFSS which is intended to be suitable for research and clinical purposes in its final version.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/jcm11030724/s1, Table S1: Aetiology of Oropharyngeal Dysphagia; Figure S1: VFSS protocol; Table S2: Administration protocol; Table S3: Functional Health Status and Severity measures; Table S4: Functional Health Status and Severity measures scoring; Table S5: Items removed or altered following rater consensus; Table S6: Included items per domain.

**Author Contributions:** Conceptualization, R.S. and R.C.; methodology, R.S., R.C.; formal analysis, K.S., R.S.; investigation, K.S., M.S., D.F.; resources, M.S.; data curation, K.S.; writing—original draft preparation, K.S.; writing—review and editing, K.S., R.S., T.B., R.C.; supervision, R.S., T.B., R.C.; project administration, K.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors wish to acknowledge Curtin University and the Australian Federal Government for the Curtin University Postgraduate Scholarship (CUPS) and the Australian Postgraduate Award (APA).

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Human Research Ethics Committees of The Medical University of Vienna And Curtin University (HRE2018-0151, 11/04/2018 and March 2019).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study are available on request from the first author. The data are not publicly available due to conditions of approval from the governing Human Research Ethics Committee.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Pitch Discrimination Testing in Patients with a Voice Disorder**

**Duy Duong Nguyen 1,2,\*, Antonia M. Chacon 1, Daniel Novakovic 1,3, Nicola J. Hodges 4, Paul N. Carding <sup>5</sup> and Catherine Madill <sup>1</sup>**


**Abstract:** Auditory perception plays an important role in voice control. Pitch discrimination (PD) is a key index of auditory perception and is influenced by a variety of factors. Little is known about the potential effects of voice disorders on PD and whether PD testing can differentiate people with and without a voice disorder. We thus evaluated PD in a voice-disordered group (*n* = 71) and a non-voice-disordered control group (*n* = 80). The voice disorders included muscle tension dysphonia and neurological voice disorders and all participants underwent PD testing as part of a comprehensive voice assessment. Percentage of accurate responses and PD threshold were compared across groups. The PD percentage accuracy was significantly lower in the voice-disordered group than the control group, irrespective of musical background. Participants with voice disorders also required a larger PD threshold to correctly discriminate pitch differences. The mean PD threshold significantly discriminated the voice-disordered groups from the control group. These results have implications for the voice control and pathogenesis of voice disorders. They support the inclusion of PD testing during comprehensive voice assessment and throughout the treatment process for patients with voice disorders.

**Keywords:** auditory discrimination; voice control; voice assessment; voice disorders

#### **1. Introduction**

Laryngeal muscle control in voice production is affected by auditory feedback and sensorimotor reflexes [1]. There are overlapping anatomical pathways in the brain that encode similar acoustic information presented in both music and voice, such as waveform periodicity and amplitude envelope [2]. Coordination of laryngeal muscles in phonation depends upon motor planning, muscle activation, and feedback provided by auditory systems [1,3]. It has been demonstrated that disturbances in auditory perception/discrimination are related to problems within auditory motor reflexes governing effective laryngeal control. These perception problems lead to abnormal motor control patterns as observed in people with hyperfunctional dysphonia [4,5]. The impairment of temporal auditory function in patients with behavioral dysphonia may affect the success of voice therapy, suggesting the need for auditory processing assessment [6].

A disordered voice is defined as a voice that does not meet the occupational or social needs of the speaker and is inappropriate given the speaker's age, gender, or situation [7]. Voice disorders can be classified according to the aetiology of the voice dysfunction [8]. Functional voice disorders include muscle tension voice disorder (MTVD) and psychogenic voice disorders [9]. Functional voice disorders may result from poor detection of pitch, volume, and voice quality dimensions in the absence of any neurological motor and sensory

**Citation:** Nguyen, D.D.; Chacon, A.M.; Novakovic, D.; Hodges, N.J.; Carding, P.N.; Madill, C. Pitch Discrimination Testing in Patients with a Voice Disorder. *J. Clin. Med.* **2022**, *11*, 584. https://doi.org/ 10.3390/jcm11030584

Academic Editor: Renee Speyer

Received: 15 December 2021 Accepted: 18 January 2022 Published: 24 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

deficit [8,10]. In contrast, in neurological voice disorders, there is damage to the motor and/or sensory pathways. Distinguishing between different types of voice disorders requires not only voice quality assessment but also perception assessment, which allows conclusions to be made about the dependence of patient's perception and vocal production upon specific sensory and motor pathways.

Auditory perception function can be evaluated using pitch discrimination (PD) testing. Pitch is a perceptual attribute of sound that has important roles in the human voice and PD is the ability to correctly detect intervals/differences between pitches of pure or complex tones. This ability to perceive different pitches is reflected in the experience of both perceiving and producing sound. PD also reflects auditory discrimination function. Tonal language speakers show greater pitch perception accuracy than non-tonal language speakers [11] and people with a musical background discriminate pitches more accurately than those with a non-musical background [12,13].

The neural processing of pitch is complicated. It involves hierarchical responses and mainly occurs in the right hemisphere of the brain; including the superior temporal gyrus, lateral Heschl's gyrus, inferior frontal gyrus, insular cortex, and the inferior colliculus [14,15]. People possess variable pitch perception ability with some apparently having more difficulties in PD than others, probably due to their use of sub-optimal brain regions (e.g., left hemisphere) for pitch processing [16]. It was also found that there is differential neural pitch processing in the left and right hemispheres that allows the auditory system to detect temporal and spectral changes in the auditory feedback necessary for voice control [17]. PD is impaired in some congenital and acquired neurological conditions that involve organic neurological dysfunction. Congenital amusia (prevalence of 1.5%) [18] results in impaired pitch processing due to abnormal deactivation of the right inferior frontal gyrus [19]. Given that the auditory cortex shows normal responses to pitch in this condition, the suggestion has been made that the impairments are due to reduced white matter functional connections between the auditory and inferior frontal cortices [19]. Traumatic brain injury is also known to affect pitch perception ability due to damages of the underlying pitch processing regions [20].

Auditory discrimination problems have been shown in people with functional voice disorders. Abur et al. [4] showed that patients with hyperfunctional voice disorders had poorer auditory discrimination and more atypical adaptive responses to fundamental frequency (F0) shifts than those without the condition. Stepp et al. [5] showed that patients with hyperfunctional voice disorders demonstrated different patterns of adaptive responses in pitch perturbation tasks compared with controls. They suggested a disruption between auditory processing and laryngeal motor control. The pitch-shift reflex shows how well individuals can adapt their own pitch according to auditory feedback and has been examined in patients with muscle tension dysphonia (MTD) [21]. Compared with a control group without dysphonia, MTD patients had a significantly larger magnitude adaptive response to changes in auditory feedback, suggesting some type of dysfunction or dysregulation between pitch perception and voice production [21]. There were signs of deficits in temporal auditory processing, auditory discrimination, and adaptive responses in those with voice disorders, compared with those without [6].

Despite some evidence that voice disorders are associated with auditory processing problems, there are several studies which show discrepant results. Davis and Boone [22] compared PD and tonal memory between 30 adult patients with hyperfunctional voice disorders and 30 control participants and showed no significant differences between the two groups. However, there were participants who demonstrated difficulties in PD or remembering a tonal sequence [22]. Another study showed no relationship between PD and voice production in children with and without vocal nodules [23]. The above-mentioned literature has therefore shown conflicting findings related to the association between PD and voice disorders.

In patients with neurological voice disorders, some studies have also reported dysfunctional auditory perception. In spasmodic dysphonia, a neurological voice dystonia of

cortical origin, dysfunctional sensory-motor processing was shown when these patients were presented with altered pitch feedback [24]. Patients with unilateral vocal fold paralysis had reduced auditory-processing ability and vocal motor function compared with healthy controls after surgical vocal fold augmentation procedures, as well as differences in the neural areas associated with vocal motor function [25,26]. These studies provide evidence for the impact of damage to the lower motor neuron pathways, involved in production impacting the upper motor neuron pathways (i.e., cortical) involved in perception. Given that vocal production ability shares some neurological pathways in tasks such as speech and musical processing [27,28], it is reasonable to hypothesize that dysfunction in voice production might have effects on PD. Clarifying whether there is such a link between voice quality and pitch perception would be the basis to deliver relevant/specific perception training in parallel to voice restoration/treatment.

We used the Newcastle Assessment of Pitch Discrimination (NeAP) [29] as part of a routine comprehensive assessment of voice function and auditory discrimination. In a previous study, this tool was shown to be reliable and clinically applicable [30]. The aims of the present study were to (1) examine PD characteristics in patients with voice disorders in comparison with non-voice-disordered speakers; and (2) evaluate the value of PD testing in differentiating voice-disordered patients from non-disordered speakers. The overall purpose was to provide clinical data on the use of PD testing in voice-disordered patients to determine the need to pay attention to patient's auditory perception function for successful voice treatment and provide insight into voice control mechanisms and pathogenesis of voice disorders.

#### **2. Materials and Methods**

#### *2.1. Study Design*

This was a cross-sectional study where both voice and auditory discrimination data were collected at a teaching voice clinic at The University of XX. The clinic performed comprehensive standardized voice assessment, including PD testing.

#### *2.2. Participants*

#### 2.2.1. Voice-Disordered Groups

There were 71 patients (54 females and 17 males) with a confirmed diagnosis of primary or secondary MTVD or a neurological voice disorder. The mean age was 38.5 years (standard deviation, SD = 15.5 years, range = 18–82). Five (7.0%) were vocal performers, 29 (40.8%) were professional voice users, and 37 (52.1%) worked in other occupations. All patients were diagnosed by a laryngologist following conduction of standardized multidimensional voice assessment protocols in the University of XX's voice clinic. Diagnosis was based on patient-reported outcome measures, such as the Voice Handicap Index (VHI-10) [31], speech language pathologist's (SLP) voice assessment, voice recordings for acoustic analysis, and videostrobolaryngoscopy. There were no patients with hearing impairments as confirmed through audiometric screening (i.e., passing 20-decibel threshold in a pure-tone at 500 Hz, 1 kHz, 2 kHz, and 4 kHz). Participants were excluded if there were self-reported symptoms or clinical signs of speech disorders, cognitive impairments, neurodegenerative conditions, or hearing loss.

In the voice-disordered group there were two sub-groups: MTVD and neurological voice disorder. Table 1 shows patient numbers in each group. In the MTVD group, 26 were diagnosed as primary MTVD and 24 had secondary MTVD with lesions deemed related to phonotrauma such as vocal nodules, pre-nodular swellings, and mucosal thickening. There were 21 patients with neurological voice disorders including vocal fold paresis (*n* = 11), vocal fold paralysis (*n* = 4), tremor (*n* = 3), and laryngeal dystonia (*n* = 3). No patient in the neurological disorder group had Parkinson's disease, other neurodegenerative, or neuro-cognitive problems. There was a total of 45 voice-disordered patients with a musical training background and 26 without a musical training background.


**Table 1.** Number of participants by groups. MTVD: muscle tension voice disorder.

#### 2.2.2. Control Group

There were 80 participants, all female, with a mean age of 23.5 years (SD = 4.3 years, range = 18–40). All were speech language pathology students. They self-reported as having no voice problems at the time of the study and underwent voice screening using a case history questionnaire and the VHI-10 [31]. Inclusion criteria included no current voice symptoms, VHI-10 < 7.5 [32], normal hearing, and no current upper respiratory problems. Two certified practicing speech language pathologists perceptually assessed their voices using a standardized protocol and confirmed that their voices were non-dysphonic.

Participants in both groups completed a case history questionnaire to determine history of voice disorders, current voice problems, language backgrounds, musical background, and voice/musical training. Musical background was defined as having formally practiced a musical instrument for at least a year past the 5 years of age.

#### *2.3. Voice Assessment*

Mean VHI-10 score for the voice-disordered group was 20.48 (SD = 10.34, 95% confidence interval, CI = 18.03–22.93), which was above the cut-off score for a voice disorder (>7.5) [32]. Mean VHI-10 score for the control group was 2.28 (SD = 2.03, 95% CI = 1.82–2.73) which fell within the non-disordered range [32].

Acoustic analyses were performed as part of the voice assessment protocols on standardized vocal tasks (middle three seconds of sustained vowel /a/, the third CAPEV phrase [33], and the 2nd and 3rd sentences of the Rainbow Passage [34]). Acoustic measures analyzed for each participant included the harmonics-to-noise ratio (HNR), and cepstral/spectral index of dysphonia (CSID) [35,36]. Acoustic voice data for the groups are presented in Table 2.

**Table 2.** Mean (SD) and 95% confidence intervals for the mean of voice data for voice-disordered and control groups. The *p* value indicates significance level of independent *t*-test comparisons of the voicedisordered group (*n* = 71) and control group (*n* = 80) for each measure. HNR: harmonics-to-noise ration; CSID: cepstral/spectral index of dysphonia.


#### *2.4. Pitch Discrimination Testing*

#### 2.4.1. Pitch Discrimination Testing Tool

We used the NeAP [29], which is a two-tone computer-based PD task, where listeners are required to stipulate which tone of a given pair is higher in pitch, or whether they are the same. One study reported on the use of this tool in assessing PD [30], showing it to be reliable, with a moderate to good prediction value in ascertaining one's musical background.

#### 2.4.2. Protocols

All PD tasks were performed in a sound-protected room with ambient noise measured between 50 and 55 dB sound pressure level (SPL) to avoid effects of noise on auditory discrimination. The NeAP program included 20 tone pairs of sine waves. Each tone pair had a lower frequency tone and a higher frequency tone, with a range of pitch differences between the lower tone and higher tone (Appendix A). The lowest and highest frequency of the lower tones was 123.47 Hz and 293.66 Hz, respectively. The lowest and highest frequency of the higher tones was 130.81 Hz and 311.13 Hz, respectively. The pitch differences between tone pairs ranged from 2.29 Hz to 32.03 Hz (29.98 to 200.01 cents). One semitone is equal to 100 cents.

The tone pairs were played on a Dell computer (Latitude 7280) via two speakers (Harman/Kardon HK645) calibrated to 65.0–65.2 dBA hearing level (HL). Hearing level was measured at 5 cm lateral to the external ear meatus using a lingWAVES sound pressure level meter II model IEC 651. The participant was seated 1 m away equidistantly from the speakers. Participants completed the default protocol of the NeAP program. No training or trial was provided apart from instructions to listen to the tone pairs and to indicate which tone was higher in pitch or if the pitch sounded the same. Participants provided their responses by clicking on one of three buttons on the computer screen. Each button represented 'tone 1 was higher', 'tone 2 was higher' or 'both tones were the same'. The 20 tone pairs were presented a second time in a new random order in the same session for reliability analysis. The duration of each tone was 300 milliseconds (ms) and the pause between any two tones was 500 ms. The procedure lasted on average 6 min. The percentage of accurate responses was calculated for each tone pair by dividing the number of accurate responses by the total responses for that tone pair. Outcome measures included the percentage of accurate responses (%) and the mean PD threshold (cent) of correct responses.

#### *2.5. Statistical Analysis*

Statistical analyses were completed using SPSS 28.0 [38] and MedCalc 20.014 [39]. Data were checked for normal distribution. Intraclass correlation coefficients (ICC) [40] were used to determine the level of agreement between the first and second (repeated) PD responses. ICC was calculated using a two-way mixed model consistency type and single measure analysis [ICC (3,1)]. To help interpret reliability, ICC < 0.5 indicates poor correlation, 0.5–0.75 moderate, 0.75–0.9 good, and >0.9 indicates excellent correlation [41]. Box-Cox transformation was implemented in SPSS for variables with non-normal distribution to obtain a near-normal distribution for parametric tests. A two-way analysis of variance (ANOVA) was used to compare PD scores between groups with a musical background as a fixed factor. Effect sizes are reported as partial Eta squared (η<sup>p</sup> 2). Effect size of 0.01, 0.1, and 0.25 indicated small, medium, and large statistical effects, respectively [42].

A Receiver Operating Characteristic (ROC) Curve Analysis was calculated to evaluate the value of PD testing in differentiating the voice-disordered groups from the control group. Where there were multiple tests, we used Sidak's adjustment to the observed *p* values to minimize Type I error. In all calculations, statistical significance testing was two-tailed, *p* < 0.05.

#### **3. Results**

#### *3.1. Reliability of PD Testing*

Table 3 shows reliability results for PD testing for all groups. There was good to excellent agreement in PD responses between the first and second trials within all groups.

**Table 3.** Reliability of PD testing. ICC: intraclass correlation coefficient; CI: confidence interval, MTVD: muscle tension voice disorders.


#### *3.2. Percentage of Accurate Responses*

3.2.1. Voice-Disordered vs. Non-Voice-Disordered Groups

The percentage of correct responses for the PD test is shown in Figure 1. A twoway ANOVA was calculated to compare the correct scores between the voice-disordered group (*n* = 71) and the control group (*n* = 80). Musical background was included as a factor given previous findings of better PD in people with a musical background than those without [12,13]. There were significant effects of group, (F(1, 147) = 9.97, *p* = 0.002, ηp <sup>2</sup> = 0.064), and musical background, (F(1, 147) = 57.94, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.28), but there was no significant interaction (*p* = 0.31). The mean (95% CI) of the percentage of accurate responses was lower by 10.32% (3.86–16.78) in the voice-disordered group compared with the control group (*p* = 0.002).

**Figure 1.** Percentage of PD accuracy in voice-disordered and control groups. Error bars indicate standard errors.

#### 3.2.2. Sub-Group Comparisons

Figure 2 shows the percentage of correct PD responses for sub-groups. Sub-group comparisons were calculated using a two-way ANOVA, comparing across three groups (control, MTVD, neurological) and the two backgrounds (musical, non-musical). Again there was a significant effect of group (F(2, 145) = 7.632, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.095) and musical background (F(1, 145) = 52.130, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.264) but no significant interaction (*p* = 0.376).

**Figure 2.** Percentage of PD accuracy in sub-groups. Error bars indicate standard errors. MTVD: muscle tension voice disorder; NVD: neurological voice disorder.

Post-hoc test using Sidak's adjustment to the *p* values showed that compared with the control group, the mean of percentage of accurate response was significantly lower by 18.55% (95% CI = 6.77–30.33%) in the neurological group (*p* < 0.001), but not in the MTVD group (mean difference = 6.75%, 95% CI = −1.93–15.43%, *p* = 0.176). The two voice-disordered groups were not significantly different (Mean difference = 11.8%, 95% CI = −0.81–24.41%, *p* = 0.074).

There was no statistical difference (t = 0.153, *p* = 0.879) in the percentage of correct responses (%) between the primary MTVD (n = 26; mean = 68.08, SD = 23.24) and secondary MTVD groups (*n* = 24, mean = 68.96, SD = 17.38).

#### *3.3. Pitch Discrimination Threshold*

#### 3.3.1. Voice-Disordered vs. Non-Voice-Disordered Groups

Figure 3 shows the PD threshold data for the voice-disordered (Mean = 108.08 cents) and control groups (Mean = 98.65 cents) by musical background. For statistical analysis, the mean PD threshold for each participant were Box-Cox transformed due to non-normal distribution. A two-way ANOVAs as reported for PD, showed significant effects of group (F(1, 147) = 16.704, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.102) and musical background (F(1, 147) = 17.212, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.105), but no interaction (*p* = 0.122). Overall, the PD threshold in voicedisordered patients was 9.43 cents higher than that in the control group.

**Figure 3.** Pitch discrimination threshold in voice-disordered group and control group. Error bars indicate standard errors.

#### 3.3.2. Sub-Group Comparisons

Figure 4 shows the data of the mean PD threshold by sub-groups. Descriptively, mean PD threshold (cents) was higher in each voice-disordered group (MTVD = 105.12; neurological voice disorder = 109.18) than in the control group (Mean = 98.65).

**Figure 4.** Mean pitch discrimination threshold of sub-groups. The lower pitch threshold, the better discrimination ability. Error bars indicate standard errors. MTVD: muscle tension voice disorder; NVD: neurological voice disorder.

A two-way ANOVA showed significant sub-group (F(2, 145) = 8.723, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.107) and musical background (F(1, 145) = 12.735, *p* < 0.001, η<sup>p</sup> <sup>2</sup> = 0.081) effects, but there was no interaction (*p* = 0.163). Post hoc comparisons showed that the PD threshold was significantly higher in both the MTVD group (by 8.66 cents, *p* = 0.003) and neurological voice disorder group (by 11.38 cents, *p* = 0.004), than in the control group. The two voice-disordered groups did not differ (*p* = 0.848).

Pair-wise comparison across sub-groups of the same musical background showed that in the non-musical background group, the mean PD threshold was significantly higher for the MTVD group than for the control group (by 13.49 cents, *p* = 0.002), but there were no differences between the neurological voice disorder group and control group (*p* = 0.080). In the musical background group, the mean PD threshold in the neurological voice disorder group was 10.806 cents higher than that in the control group (*p* = 0.047) whilst this measure was not statistically different between MTVD and controls (*p* = 0.570).

The mean (SD) of the PD threshold (cents) of the primary MTVD and secondary MTVD groups was 105.66 (22.59) and 104.53 (9.54). An independent samples *t*-test showed no statistically significant difference in the PD threshold between primary and secondary MTVD groups (t = 0.235, *p* = 0.816).

The Pearson's correlation coefficients calculated using the combined sample size of both voice-disordered and control groups (*n* = 151) showed significant correlations between the percentage of correct responses and mean PD threshold (r = −0.695, *p* < 0.001), median pitch threshold (r = −0.483, *p* < 0.001), and minimal pitch threshold (r = −0.488, *p* < 0.001). These implied that the accuracy of responses was associated with the size of the pitch intervals of tone pairs.

#### *3.4. Predictive Value of PD Testing in Differentiating Voice-Disordered from Control Groups*

An ROC curve (as shown in Figure 5) was analyzed to evaluate the predictive value of PD testing in differentiating the voice-disordered group from the control group. This measure significantly differentiated the two groups (area under the ROC curve, AUC = 0.630, 95% CI = 0.547–0.707, Z-statistic = 2.828, *p* = 0.005). With a Youden index (J) of 0.243 and the associated cut-off value >106.28 cents, this measure differentiated the two groups at a specificity of 86.25% and a sensitivity of 38.03%. At a cut-off of >85.02 cents, sensitivity was

98.59% but specificity was low (2.5%). The cut-off value of >97.45 cents had a balance of both sensitivity (64.79%) and specificity (53.75%).

**Figure 5.** ROC curve for mean pitch discrimination threshold (cents).

#### **4. Discussion**

*4.1. Pitch Discrimination in Voice-Disordered Patients*

As predicted, pitch discrimination accuracy was significantly lower in the voicedisordered group than in the non-voice-disordered group. Based on effect size calculations, the size of this effect was medium. However, with respect to the units of measurement, the difference might be considered small (i.e., 9.43 cents). These differences in pitch discrimination support a previous study [6] showing that patients with behavioral dysphonia had worse pitch perception ability than non-dysphonic speakers. The patients with either MTVD or neurological voice disorder required a larger pitch threshold (above 100 cents) to correctly discriminate the pitch differences compared with the healthy speaker control group (again yielding a medium effect size). Patients with a neurological disorder needed a slightly larger threshold (109.18 cents) than those with a functional voice disorder (105.12 cents), although this difference was not statistically significant.

These results are suggestive of an impairment in auditory discrimination in both functional (MTVD) and neurological voice-disordered individuals. These data are also congruent with work by Abur et al. [4] who showed that the auditory discrimination threshold was significantly larger in patients with hyperfunctional voice disorders (mean = 47 cents, SD = 32 cents) than in control participants (mean = 35 cents, SD = 20 cents). The differences in the auditory discrimination thresholds between our current data and their study likely stemmed from the study design and type of test stimuli. Here, we did not test the just-noticeable-difference (JND) in pitch, but rather used pure tones. In summary, reliable group differences were noted in pitch perception between voice and non-voice-disordered samples, although the clinical relevance of the difference remains to be studied.

In the voice-disordered group, patients with a neurological voice disorder did not show statistically significantly poorer PD than those with MTVD. This suggests voice disorder types and/or dysphonic severity may not be linked to the auditory perception function. This finding appeared to agree with observations by Abur et al. [4] who found no relationship between the overall severity of dysphonia and auditory discrimination threshold. It is important to note that their study [4] only included patients with hyperfunction voice disorders, which might have had smaller range of vocal dysfunction than in our study.

The pitch interval of the pure tones used in the PD testing tool (NeAP) ranged between 29.98 and 200.01 cents. At the smallest pitch interval, the accurate responses for control, MTVD, and neurological group were 29 (19.2%), 18 (11.9%), and 9 (6.0%), respectively. This suggests that the voice-disordered group, particularly the neurological group, had more difficulties discriminating small pitch intervals than controls. We recommend that in future work the JND for different types of voice disorders with different severity should be investigated. This would help to further understand the impact of voice disorders on the minimum pitch difference that a patient can detect, and explore the relationship between voice perception and production.

It is believed that aberrant auditory discrimination plays a role in the pathogenesis of hyperfunctional voice disorder [4]. Current neural models of voice/speech production can be used to explain the poorer PD in those with a voice disorder. In the first place, auditory dysfunction may occur first. The DIVA neural model of phonation [3] states that the control of voice production includes two components: feedforward control (motor components) and feedback control (auditory and somatosensory targets). When the auditory discrimination system is dysfunctional, the ability to detect the mismatch between the expected and real feedback would be decreased. Consequently, this would lead to suboptimal use of the laryngeal motor system in phonation due to the feedforward system failing to update the corrective motor plan provided by the auditory feedback system [4]. This explanation appears to be applicable to MTVD and is supported by previous findings on the mismatch between the auditory-motor control system in patients with hyperfunctional voice disorders [4,5].

In patients with a neurological voice disorder, the model of neural plasticity [43] might explain the poorer PD compared with the non-voice-disordered controls. In patients with neurological voice disorders, neural plasticity may explain the adjustment (increase) in the auditory response threshold to allow for the variability in motor response. Neuroplastic models are well-known when explaining voice and laryngeal syndromes that involve a sensory pathway dysfunction such as the irritable larynx syndrome [44] or the laryngeal hypersensitivity syndromes [45]. A similar neuroplastic process may exist in those with a neurological voice disorder. Increasing the auditory discrimination threshold would benefit the auditory-motor control system in that auditory feedback would be less sensitive to feedback errors and the feedforward system would be less likely to provide motor commands that exceed the capability of the neurologically impaired laryngeal motor system. This model provides an explanation for shifting internal PD thresholds, or other auditory discrimination/perception thresholds to adapt to a worsening voice quality. Over time, if laryngeal coordination is worsened, further feedback would be added to the system, exacerbating the threshold sensitivity. Eventually, there may be more adaptive adjustments in auditory-motor control system, leading to compensatory/suboptimal laryngeal muscle use, or compensatory hyperfunction.

Musical background was factored in the between-subjects analysis due to its known impact on pitch perception. Despite overall differences between individuals in terms of musical background improving pitch discrimination, supporting previous research [12], there were no interactions involving the voice group. Group effects related to musical training background were descriptively similar to those due to voice for pitch accuracy, but for pitch threshold, musical background appeared to have larger and more reliable impacts on pitch perception than voice disorder.

The non-significant interaction effect between groups and musical background in this study was surprising given previous research indicating that both musicians and singers have a greater ability to compensate for pitch disturbances [46,47]. The above-mentioned mechanisms explaining the reasons for poor pitch discrimination in voice-disordered individuals might bypass or over-ride the well-established reflexes or processes formed in those with musical and/or singing training. This non-interaction between voice groups and musical background also implied that training does allow individuals, regardless of pathology, to improve pitch discrimination.

#### *4.2. Predictive Value of PD Testing*

Results of the ROC curve analyses showed that the PD threshold had a predictive ability to discriminate between voice-disordered and control groups. This suggests that it is possible to use PD testing as a method to differentiate a voice-disordered group from non-disordered speakers. It is necessary to develop/revise the PD testing tool to include a wider range of pitch intervals/differences and test its sensitivity and specificity in different levels of dysphonic severity and different voice disorder types. This development will allow validation of its applicability in clinical settings. In the present study, the sensitivity and specificity of this measure were relatively low if a balance between them is used in determining a cut-off value.

Previous research showed that people without musical training required thresholds between 1 and 3 semitones (100–300 cents) to be able to discriminate pitch intervals [48]. In the present study we found that a cut-off of >97.45 cents had a reasonable balance between sensitivity and specificity of testing. However, the relatively low sensitivity and specificity probably resulted from heterogeneity within the voice-disordered groups (i.e., including both functional and neurological voice disorders). We did not perform the ROC analyses separately for the MTVD and neurological voice disorder groups and for the two musical background due to the small sample size of each subgroup. It may also be the case that the current NeAP protocol was not associated with optimal prediction ability given the number of tone pairs used (20) and the range of PD threshold. Smaller thresholds would probably be more likely to differentiate the groups with better sensitivity and/or specificity.

This study had several limitations that should be addressed in future research to help with internal and external validity. Firstly, this was a cross-sectional observational study and not a prospective cohort study. Consequently, this design did not allow the determination of PD of the dysphonic speaker prior to having a voice disorder. Therefore, we cannot state that PD deteriorated in these patients when they acquired a voice disorder. A second issue related to validity, was that the control group comprised all females at a younger age range than the dysphonic group. As auditory perception may vary as a function of age, better matched comparison groups will be needed to determine the size and reliability of any effects due to voice pathology. Lastly, despite its utility and functionality, there is a lack of literature exploring the sensitivity and specificity of the NeAP testing tool in differentiating those with and without voice disorders according to their PD abilities. Further studies are needed to validate this tool for clinical application.

#### **5. Conclusions**

Here we showed that patients with a voice disorder had poorer PD than non-voicedisordered controls. Patients with MTVD and neurological voice disorders had a lower percentage of accurate PD responses and required larger pitch discrimination thresholds to correctly identify pitch differences between tone pairs. These findings provided more evidence for a possible dysfunction or dysregulation of both auditory discrimination pathways and laryngeal motor control in these voice-disordered groups. The mechanisms for poorer PD might be different between functional/MTVD voice disorders and neurological voice disorders given the differences in the pathogenesis of each disorder type.

PD testing significantly differentiated voice-disordered patients (MTVD and neurological voice disorders) from non-disordered speakers. This finding is important as PD testing can serve as not only a diagnostic tool but also a follow-up tool during the treatment process. Moreover, the fact that musical background significantly distinguished PD ability irrespective of voice disorder, suggests that problems in perception can be overcome with training. These data highlight the need to evaluate both auditory discrimination function and voice quality across the diagnosis, treatment, and follow-up stages for voice disorders. It would be necessary to clarify whether PD changes reflect treatment outcome.

**Author Contributions:** Conceptualization, D.D.N. and C.M.; methodology, C.M. and D.D.N.; formal analysis, D.D.N.; investigation, C.M. and D.D.N.; data curation, A.M.C. and D.D.N.; writing—original draft preparation, D.D.N. and C.M.; writing—review and editing, C.M., D.D.N. A.M.C., N.J.H., P.N.C. and D.N.; project administration, C.M.; visualization, D.N.; funding acquisition, C.M. and D.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by The Dr Liang Voice Program at The University of Sydney.

**Institutional Review Board Statement:** This project was approved by Human Research Ethics Committee of the University of Sydney (protocol number: 2020/027). All participants read a participant information sheet and signed a consent form prior to participating in the study.

**Informed Consent Statement:** All participants signed a written consent form prior to taking part in the present study.

**Data Availability Statement:** Data supporting reported results is retained by The University of Sydney in a de-identified form and is confidential under the conditions of the Human Research Ethics Committee of The University of Sydney approval.

**Acknowledgments:** We would like to thank the participants for participating in the study.

**Conflicts of Interest:** A.M.C., C.M., D.D.N. and D.N. are employees of The University of Sydney and are part or fully funded by the Dr Liang Voice Program, a philanthropically funded program of research and post-graduate education in laryngology. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**

**Table A1.** Frequency of tone pairs of the NeAP.


#### **References**


## *Article* **Supraglottic Botulinum Toxin Improves Symptoms in Patients with Laryngeal Sensory Dysfunction Manifesting as Abnormal Throat Sensation and/or Chronic Refractory Cough**

**Daniel Novakovic 1,2,3,\*, Meet Sheth 1,4, Thomas Stewart 1,3, Katrina Sandham 3, Catherine Madill 1, Antonia Chacon <sup>1</sup> and Duy Duong Nguyen 1,5**


**Citation:** Novakovic, D.; Sheth, M.; Stewart, T.; Sandham, K.; Madill, C.; Chacon, A.; Nguyen, D.D. Supraglottic Botulinum Toxin Improves Symptoms in Patients with Laryngeal Sensory Dysfunction Manifesting as Abnormal Throat Sensation and/or Chronic Refractory Cough. *J. Clin. Med.* **2021**, *10*, 5486. https://doi.org/10.3390/ jcm10235486

Academic Editor: Renee Speyer

Received: 2 November 2021 Accepted: 19 November 2021 Published: 23 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Abstract:** Laryngeal sensory dysfunction (LSD) encompasses disorders of the vagal sensory pathways. Common manifestations include chronic refractory cough (CRC) and abnormal throat sensation (ATS). This study examined clinical characteristics and treatment outcomes of LSD using a novel approach of laryngeal supraglottic Onabotulinum toxin Type A injection (BTX). This was a retrospective review of clinical data and treatment outcomes of supraglottic BTX in patients with LSD. Between November 2019 and May 2021, 14 patients underwent 25 injection cycles of supraglottic BTX for treatment of symptoms related to LSD, including ATS and CRC. Primary outcome measures included the Newcastle Laryngeal Hypersensitivity Questionnaire (LHQ), Cough Severity Index (CSI), Reflux Symptom Index (RSI), and Voice Handicap Index-10 (VHI-10) at baseline and within three months of treatment. Pre- and post-treatment data were compared using a linear mixed model. After supraglottic BTX, LHQ scores improved by 2.6. RSI and CSI improved by 8.0 and 5.0, respectively. VHI-10 did not change as a result of treatment. Short-term response to SLN block was significantly associated with longer term response to BTX treatment. These findings suggest that LSD presents clinically as ATS and CRC along with other upper airway symptoms. Supraglottic BTX injection is a safe and effective technique in the treatment of symptoms of LSD.

**Keywords:** laryngeal sensory dysfunction; chronic refractory cough; botulinum toxin; larynx; laryngeal hypersensitivity; cough hypersensitivity syndrome; globus pharyngeus; laryngopharyngeal reflux; neuropathic cough; throat irritation

#### **1. Introduction**

The larynx is innervated by branches of the vagus nerve with complex coordination of afferent (sensory) and efferent (motor) pathways in the brainstem required for optimal physiological functioning [1,2]. Neurological dysfunction can occur secondary to central or peripheral pathology affecting the vagal pathways. Depending upon the level and nature of injury, vagal dysfunction can have either or both sensory and motor effects manifesting within and outside the larynx. Motor manifestations of vagal dysfunction involving the larynx can be broadly classified into hypofunctional (e.g., vocal fold paralysis or paresis) or hyperfunctional (e.g., inducible laryngeal obstruction) with laryngeal movement disorders affecting higher centers. Sensory manifestations of vagal dysfunction are less well understood but can present independently or in conjunction with apparent motor effects.

Laryngeal sensory dysfunction (LSD) represents disorders of laryngeal afferent sensory pathways presenting with abnormal laryngeal sensation. Several phenotypes related to hyperfunctional vagal sensation have been described manifesting in the larynx sharing similar features [3]. These include chronic refractory cough (CRC) [4], various forms of inducible laryngeal obstruction including recurrent laryngospasm, paradoxical vocal fold movement [5,6] and irritable larynx syndrome [7], globus pharyngeus [3] and laryngeal sensory neuropathy [8] with various proposed etiologies. We prefer to use the umbrella term laryngeal sensory dysfunction [3] which recognizes the role of abnormal laryngeal afferent sensory pathways in these conditions which may be affected at one or more levels (peripheral receptors, afferent vagal fibers, central pathways), and which present with abnormal/altered laryngeal sensation, without attribution to a specific underlying pathological process or cause. Accurate evaluation of laryngeal dysfunction and hypersensitivity would allow for accurate diagnosis and effective treatment [9].

Laryngeal hypersensitivity has been best described in the context of CRC [10], which is defined as a cough persisting beyond 8 weeks despite guideline-based treatment. Other terms for CRC include neurogenic cough, idiopathic cough, psychogenic cough, habitual cough and (laryngeal) cough hypersensitivity syndrome [11]. Increased sensitivity of the afferent limb of the cough reflex has been demonstrated in CRC [4], with those affected exhibiting a lower cough threshold in the capsaicin challenge test [12,13]. Furthermore, Vertigan and Gibson observed abnormal laryngeal sensation (laryngeal paresthesia) in 94% of patients with CRC [4], consistent with a sensory neuropathic disorder.

The concept of sensory neuropathy causing laryngeal symptoms was first proposed by Morrison et al. using the term irritable larynx syndrome [7]. They described laryngospasm, dysphonia, globus pharyngeus, pain and/or chronic cough as potential symptoms arising from a hyperexcitable state of the laryngeal neuronal sensory network. Laryngeal sensory neuropathy has also been described in the context of hypofunctional laryngeal sensation associated with a high risk of dysphagia in head and neck cancer patients [14]. Neuropathy represents a disturbance of function or pathological change in one or more nerves which can change the normal sensitivity or thresholds of afferent nerves causing neuropathic symptoms. Neuropathic pain is characterized by the clinical features of paresthesia, hyperalgesia and allodynia [15] with equivalent laryngeal features manifesting as abnormal throat sensation, hypertussia and allotussia (Table 1) [4].


**Table 1.** Equivalent laryngeal features of neuropathic pain.

Laryngeal sensory receptors project centrally towards the nucleus tractus solitarius, primarily via the internal branch of the superior laryngeal nerve (iSLN), where activation can lead to a variety of reflexive responses including cough, swallow and laryngospasm [16]. The laryngeal adductor reflex (LAR) is one such robust and well-studied response which causes bilateral involuntary airway protective closure in response to supraglottic stimuli [17]. Topographic mapping of sensory receptors related to the LAR has recently been achieved. The highest density of LAR sensory receptors and afferent nerve fibers are found in the posterior supraglottis followed by the false vocal folds and epiglottic tip with no LAR activation with stimulation of the membranous vocal folds [18].

Peripheral and central sensitization are features of neuropathy. Peripheral sensitization describes both nociceptive and non-nociceptive sensory afferents becoming sensitized [15,19], with a lowered threshold for signaling, and/or an increase in the magnitude of responsiveness at the peripheral ends of sensory nerve fibers. A wide range of signaling molecules are involved in mediating peripheral sensitization including such neuropeptides as calcitonin gene-related peptide (CGRP) and substance P (SP). Laryngeal sensory dysfunction may occur at the periphery when laryngeal sensory receptors and nociceptive fibers become dysfunctional and undergo peripheral sensitization. The prolonged process of peripheral sensitization can lead to sensitization of the central sensory pathways, where potentiation by neurotransmitter signaling results in a net increase in neuronal spinal output [15,19].

#### *1.1. Etiology of LSD/Mechanisms of Injury*

The etiologies of LSD are yet to be fully elucidated, although numerous causes have been proposed in the literature. Morrison et al. [7] suggested viral infection, emotional distress, chronic reflux and habitual muscle misuse as potential contributing factors amongst other more common organic causes of nerve injuries [7].

Amin and Koufman [8] reported cases with laryngeal electromyographic evidence of lesions to both superior and recurrent laryngeal nerves. They maintained that damage to the vagal nerves was linked to a preceding viral upper respiratory tract infection as a one-off phenomenon rather than an ongoing/progressive degeneration or regeneration process [8]. Rees, Henderson and Belafsky [20] proposed Post-Viral Vagal Neuropathy as a clinical entity resulting from upper respiratory tract infection presenting with chronic cough, excessive throat clearing, dysphonia, and vocal fatigue with laryngoscopic signs of laryngeal motor weakness.

Honey et al. proposed neurovascular compression of the vagus nerve rootlets identified on magnetic resonance imaging [21] as a potential cause of vagal dysfunction presenting in the larynx with sensory symptoms of abnormal throat sensations [22] associated with motor symptoms of laryngospasm/choking, neurogenic cough or intermittent stridor [23].

Altman et al. suggested various factors (including topical airway infectious agents, inflammatory cytokines, viscosity of the airway mucus, gene regulation producing altered mucus in disease, the temperature and pH of the airway surface) may act synchronously to sensitize the larynx [24]. They activate and upregulate multiple upper airway receptors, including TRPV1 (transient receptor potential vanilloid 1, stimulated by acids, protons, and capsaicin). There is evidence that sensitization of the TRPV1 channel underlies hypersensitivity in neuropathic pain [25].

#### *1.2. Assessment and Diagnosis of LSD*

To date, no diagnostic criteria have been established for LSD. Consequently, the assessment of abnormal laryngeal sensation is based largely on patient history, clinical evaluation, appropriate questionnaires/patient-reported outcome measures (PROMs) and laryngeal investigations [9], along with limited response to treatment of other conditions which can present with similar symptomatology.

Several PROMs can be used for assessment of LSD (see methods). These questionnaires provide easily obtainable subjective baseline data which can then be used to monitor patient progress and treatment outcomes [26].

#### *1.3. Superior Laryngeal Nerve Block*

Local anesthetics are used extensively during endotracheal intubation and other procedures requiring upper airway manipulation to suppress normal physiological responses including cough and laryngospasm. Topical lidocaine (lignocaine) applied to the larynx has been shown to suppress laryngeal reflexes activated by mechanoreceptor and chemoceptor stimulation [27]. Superior laryngeal nerve block is another way to suppress these reflexes whereby the supraglottic larynx can be anesthetized in an awake patient by delivering local anesthesia around the internal branch of the superior laryngeal nerve at the thyro-hyoid

membrane as it enters the larynx [28]. Lidocaine blockade of the SLN has been shown to temporarily relieve symptoms of laryngospasm due to known SLN injuries [29]. This opens the potential therapeutic pathway of modulating laryngeal sensation to treat conditions such as chronic refractory cough where LSD is a contributing factor. The temporary duration of this proposed modality as well as ease of administration makes this an excellent initial test to potentially predict response to treatments which can modulate sensation in the distribution of the SLN.

#### *1.4. Treatment of LSD*

Treatment of potential coexisting medical conditions that can present with similar symptoms is crucial in the management of LSD. A limited response will help support the diagnosis, but it is also important to control pathologies which can alter laryngeal sensitivity (including LPR, OSAS and chronic inflammation). Furthermore, any pathological process which can stimulate or irritate the laryngeal mucosa can act as a trigger of hypersensitized laryngeal sensory pathways and reflexes and reducing this sensory input can help with control of symptoms.

Centrally acting neuromodulators including amitriptyline, gabapentin, pregabalin and tramadol have some effectiveness in reducing symptoms linked to vagal neuropathy and have acceptance in the treatment of CRC [30,31]. There is evidence that gabapentin, which is effective mostly in pain due to nerve damage in postherpetic neuralgia and peripheral diabetic neuropathy [32], is also effective in treatment of odynophonia [8], neck pain [8], chronic cough and laryngospasm due to suspected sensory neuropathy of the SLN [33].

Behavioral treatment provided by a speech language pathologist (SLP) or physiotherapist has been found effective in management of CRC by reducing cough frequency [34] and cough reflex sensitivity [35]. Treatment typically includes some or all of the four elements described in the John Hunter Hospital Chronic Refractory Cough (JHCRC) Program: patient education regarding nature of cough, exercises to improve voluntary control over cough and/or suppression of the cough, reduction of behaviors that cause laryngeal irritation and psycho-educational counselling [36]. Improving voluntary control over one's cough and reducing the sources of irritation that trigger coughing are complementary approaches that are of equal importance in alleviating this behavior [37]. The treating clinician must emphasize the commitment required for behavioral change to occur and provide additional supports as necessary to facilitate the patient's independent management and control over their presenting symptoms.

#### *1.5. Botulinum Toxin in the Larynx, and Its Potential Role as a Sensory Neural Modulator*

Onabotulinum toxin Type A (BTX) is a proteolytic enzyme that cleaves neuronal SNARE proteins which play a crucial role in the mediation of neurotransmitter release. The primary studied effect of BTX is in motor nerves, where neuromuscular conduction is inhibited by the toxin, resulting in a localized but reversible chemical denervation of the associated muscle fibers.

The putative mechanism by which BTX may modulate laryngeal sensation can be best understood in the context of chronic refractory cough (CRC) and its correlation with neuropathic pain [30]. The therapeutic effects of BTX in CRC are thought to be due to its effects on sensory transmission and peripheral sensitization. Transient receptor potential (TRP) channels are a group of ion channels present on the plasma membrane of multiple mammalian cell types. In airway physiology, they play an important protective role in pathways inducing inflammation, mucus secretion, airway constriction, and reflexes such as cough and sneezing [38]. The reduced cough threshold in CRC is associated with increased expression of TRPV1 receptors on airway nerves [39,40]. Changes in these, and associated channels, along with the development of sensitization is the understood mechanism by which a chronic cough develops into a hypersensitivity syndrome [41].

In addition to motor effects, BTX also inhibits neurotransmitter release in sensory neurons, likely through the reduction in expression of neuropeptide transmitters, such as SP and CGRP. TRPA1 and TRPV1 [42] are associated with CGRP-dependent pathways. Administration of BTX has been demonstrated to disrupt the transfer of TRP receptors to synaptic membranes [43,44]. Studies have previously demonstrated that BTX reduces pain and neurogenic inflammation caused by capsaicin, which is the antagonist of TRPV1 receptors [45]. As such, BTX sensory mechanism is at least partially via its effect on TRPV1 expression, with this modulation likely also interrupting the process of peripheral sensitization [46]. BTX is also thought to affect central sensitization; however, this remains controversial [46,47]. The interruption of these sensitivity pathways by peripheral administration of BTX is a potential way to modulate the symptoms experienced under the umbrella term laryngeal sensory dysfunction.

BTX was first used in the larynx by Blitzer in 1984 as a treatment for adductor spasmodic dysphonia [48] (a focal laryngeal dystonia). It has since become the gold standard for this condition. Injections are usually targeted to the involved intrinsic laryngeal adductor muscles to weaken them and prevent inappropriate contractions causing disruption of normal speech.

Several studies have reported the use of BTX targeted to the laryngeal adductor musculature for the treatment of chronic refractory cough [49–52]. Delivery of BTX into the supraglottic region is a more recent concept and was initially described by Young and Blitzer in 2007 as an adjunct treatment for patients with adductor spasmodic dysphonia who exhibited sphincteric closure of the supraglottic larynx during phonation [53]. In 2016, Simpson reported supraglottic BTX as an alternative primary treatment for adductor spasmodic dysphonia [54], showing improved voice outcomes with a favorable side effect profile compared with glottic BTX. To date, no study has examined the sensory effects of laryngeal BTX when delivered into the supraglottis rather than into the intrinsic laryngeal musculature.

#### *1.6. Current Study Aims*

The present study investigated a novel treatment of supraglottic BTX for LSD. The aims of the study were to: (1) describe the clinical characteristics of LSD in a cohort of patients referred for CRC and abnormal throat sensation (ATS); (2) describe a new treatment of supraglottic laryngeal botulinum toxin in the symptomatic management of laryngeal sensory dysfunction; (3) evaluate the efficacy of using botulinum toxin A in treatment of a pilot group of patients presenting with different phenotypes associated with laryngeal sensory dysfunction including CRC and ATS. We hypothesized that CRC and ATS can be manifestations of LSD and that treatment aimed at LSD would have therapeutic effects quantifiable using patient reported outcome measures of cough and throat sensation.

#### **2. Materials and Methods**

#### *2.1. Study Design*

This was a retrospective data review of an existing private specialized laryngology clinic database. The study was approved by the Human Research Ethics Committee of The University of Sydney (protocol number 2021/025).

#### *2.2. Participants*

A database search was implemented to identify all patients who underwent supraglottic BTX injections as part of treatment for clinical presentations associated with LSD.

Inclusion criteria were: (1) a history of sensory laryngeal symptoms (manifesting as CRC or ATS) for greater than 12 consecutive weeks despite assessment and treatment of potential/coexisting lower respiratory, sinonasal and laryngopharyngeal reflux pathology; (2) a Newcastle Laryngeal Hypersensitivity Questionnaire (LHQ) score of 17.1 or below [55].

Most patients had previously been offered neuromodulator medication and had either ceased this treatment due to poor response or negative side effects or remained on neuromodulators with partial symptom control whilst undergoing a trial of salvage laryngeal botulinum toxin treatment. All patients had been referred to a speech pathologist for behavioral treatment of their symptoms. Thirteen of the fourteen had seen a speech pathologist prior to BTX treatment. Speech pathology data was unavailable for one patient.

Fourteen patients were identified during the study period who underwent supraglottic BTX treatment for LSD, including six females and eight males. Mean age of patients was 54.9 years (standard deviation, SD = 12.5, range = 32–76).

Figure 1 shows diagram of study protocols. Table 2 presents information regarding demographics, onset, respiratory pathology, and neural modulator treatment for all patients.

**Figure 1.** Flowchart of study protocols.

**Table 2.** Characteristics of the treatment cohort. NM, neuromodulator; SLN, superior laryngeal nerve; Gaba, gabapentin; PR, partial response; Ami, amitriptyline; NR, no response; URTI, upper respiratory tract infection; NS, nonsmoker, FS, former smoker.



**Table 2.** *Cont.*

#### *2.3. Intervention: Supraglottic BTX Injection*

Patients presenting with LSD who had persistent symptoms despite medical and behavioral (speech pathology) management underwent trial superior laryngeal nerve (SLN) block in the clinic. Immediate response to SLN block was measured using a 10-point Likert scale questionnaire based upon the patient's specific presenting symptoms which was developed using the Newcastle Laryngeal Hypersensitivity Questionnaire (LHQ) [55]. Immediate response was measured 20 min after SLN block and an improvement of their primary symptom by three or more points compared with baseline was considered a positive response. In the case of no response at 20 min, contralateral SLN block was offered, and response was assessed after a further 20 min. Patients who had symptomatic but short-term (<2 weeks) improvement after SLN block were offered subsequent botulinum toxin Type A (Botox™, Allergan, Irvine, CA, USA). Some patients who did not respond to SLN block elected to undergo a trial of supraglottic BTX treatment as salvage therapy after failed medical management including a trial of neuromodulator therapy.

BTX was usually given in an office-based outpatient setting. (In one patient with extreme hypersensitivity to flexible laryngoscopy, the BTX injection was given trans-orally during microlaryngoscopy under general anesthetic). Patients were seated semi-reclined with the head extended. Decongestant with local anesthesia was administered topically to the nasal cavity (5% lidocaine + phenylephrine) prior to the procedure. Bilateral SLN blocks were performed using 2% lidocaine, 0.5 cc on each side for the purpose of anesthesia during the procedure. BTX injection was performed using a 1 cc syringe coupled to a 23 or 25 G needle which was introduced into the larynx via a trans thyro-hyoid approach with the

needle directed inferiorly, posteriorly and slightly laterally toward the targeted supraglottic region of the false vocal fold and posterosuperior larynx—where sensory receptor density is thought to be highest [18]. Flexible transnasal videolaryngoscopy was used to help guide the injection into the desired region and confirm placement. The injectate was delivered whilst keeping the needle in a submucosal plane without breaching the laryngeal airway and correct placement was confirmed via the presence of a visible bleb at the injection site (Figure 2). The BTX concentration was kept constant at 2.5 U per 0.1 cc of injectate with dosage adjusted by varying volume of injectate.

**Figure 2.** Endoscopic image of larynx before (**left**) and immediately after (**right**) supraglottic BTX injection showing visible submucosal bleb at injection site.

Nineteen treatments were given unilaterally and six bilaterally (2 synchronous, 4 staged). Mean dose for each supraglottic injection was 7.74 U (SD = 1.75 U). Mean time of post-treatment assessment was 7.1 weeks (SD = 3.2 weeks). The decision on which side to treat with BTX and whether to treat unilaterally or bilaterally was made based on a combination of the following factors: (i) the patient's self-perceived unilaterality of symptoms, (ii) laryngoscopic findings of motor asymmetry, particularly that of vertical height mismatch, with (iii) immediate response to SLN block on that side.

#### *2.4. Data Extraction*

One otolaryngologist and one registered nurse who were blind to the aims of the study performed data extraction from clinical records. The data described in the following subsection were collected during this review.

#### 2.4.1. Demographic Characteristics and History

Demographic characteristics (age, gender). Smoking history. Symptom duration and potential preceding factors. Past investigation/treatment of significant co-morbidities including gastro-esophageal or laryngopharyngeal reflux, lower respiratory tract pathology, sinonasal conditions and obstructive sleep apnea. Current/past medications including ACE inhibitors and neuromodulators.

#### 2.4.2. Videostrobolaryngoscopy Findings at Baseline

Videostrobolaryngoscopy is the gold standard clinical assessment for evaluating laryngeal structure and dynamic function [56]. All patients underwent neurolaryngological examination via trans nasal videostroboslaryngoscopy at baseline using a standardized clinical voice assessment protocol designed to identify potential features of laryngeal motor dysfunction [57]. Findings of vocal fold motion anomalies, glottic insufficiency and mucosal wave anomalies are the most reliable signs for the diagnosis of vocal fold paresis [56], a laryngeal motor impairment which may coexist with sensory dysfunction in some LSD patients where both efferent and afferent functions of the laryngeal nerve/s are affected.

All strobolaryngoscopy exams were extracted and blindly rated by two otolaryngologists using a tool developed in Bridge2practice, an online education and research platform developed for health and medical learning and practice of allied health professionals and students [58]. The following parameters were assessed: (1) vocal fold movement; (2) mucosal wave; (3) laryngeal muscle tension patterns.

Videos of eight strobolaryngoscopy exams were repeated, randomized and re-rated to evaluate intra-rater reliability. Ratings from the two blinded assessors were compared to calculate inter-rater reliability for stroboscopic parameters that are subject to low reliability of ratings such as vertical focal fold plane and phase symmetry [59]. Table 3 shows excellent intra-rater reliability and Table 4 shows good inter-rater reliability for key parameters.


**Table 3.** Intra-rater reliability (exact agreement in second rating/total repeated videos).

**Table 4.** Inter-rater reliability of strobolaryngoscopy ratings.


#### 2.4.3. Outcome Measures

Several patient-reported outcome measures (PROMs) were used to evaluate laryngeal symptoms and were administered to all patients prior to BTX treatment and within 3 months of treatment. Where bilateral staged treatment was given, outcomes were measured after the second treatment.

#### (a) Newcastle Laryngeal Hypersensitivity Questionnaire (LHQ) [55].

The Newcastle Laryngeal Hypersensitivity Questionnaire (LHQ) scores 14 items across three specific domains: obstruction, pain/thermal and irritation, providing a robust measure of laryngeal sensory disturbance. This tool has proved useful in discriminating patients with laryngeal hypersensitivity from healthy people and in measuring changes in symptoms of laryngeal hypersensitivity following speech pathology treatment [55]. A normal score is considered to be 17.1 or above [55]. The clinically minimal important difference for this questionnaire is 1.7 [55].

(b) Cough Severity Index (CSI)

CRC is the context in which LSD has been most associated. The CSI [60] is a validated PROM commonly utilized in evaluating patients with CRC resulting from the upper airway and is proven to be sensitive in detecting treatment outcome [61,62]. A score of 3 or more is considered abnormal [60].

(c) Reflux Symptom Index (RSI)

The Reflux Symptom Index (RSI) is a validated PROM initially developed to measure symptom severity for laryngopharyngeal reflux (LPR) [63]. An RSI score >13 is considered abnormal [63]. Although not specific for LPR [64] it serves as a useful and commonly used marker of throat irritation with which it has been correlated [65] and a marker of symptomatic response to treatment [66].

#### (d) Voice Handicap Index 10 (VHI-10)

The Voice Handicap Index 10 is a validated PROM to assess patients' perception of their voice function [67]. This tool was used in the present study given that patients with LSD and CRC frequently present with voice problems, e.g., muscle tension dysphonia [3]. It also allowed assessment of the frequency and severity of potential voice change which is a recognized potential side effect of laryngeal BTX treatment [68]. A score of greater than 11 is considered abnormal [69] with 6 considered as the minimal important difference [70].

#### *2.5. Statistical Analyses*

Data were managed in Microsoft Excel 365 [71] and analyzed using IBM SPSS Statistics v.24.0 [72] and Prism v8.1.2 [73] for Windows. Descriptive statistics were used to describe the cohort's characteristics. Prior to analyses, normal distribution of the data was examined using Kolmogorov–Smirnov tests [74]. For continuous variables, mean, standard deviation (SD) and 95% confidence interval (normal distribution) or median and quartiles (nonnormal distribution) were used. For categorical data, frequencies and percentages were used. Changes in outcome measures over the treatment period were analyzed using a linear mixed model with patients as random effects and time points (i.e., baseline and post-BTX injection) and gender as fixed effects. Interaction between 'time' (treatment) and the fixed factors was also calculated to determine the impact of included factors on treatment outcome. Association between categorical variables was examined using Chi-square test (χ2). A significance level of two-tailed p of 0.05 was used. Where there were multiple calculations, Sidak-adjustment was applied to the *p* value. Effect sizes were calculated using Cohen's d (small = 0.2; medium = 0.5; large = 0.8) [75].

#### **3. Results**

#### *3.1. Characteristics of LSD*

Table 5 presents primary presenting symptoms and secondary symptoms for all included patients. Primary symptoms were abnormal throat sensation (ATS) (12/14), followed by chronic cough (12/14) with a mean (SD) duration of 81 (110) months (min = 1; max = 360). Other symptoms included dysphonia (5/14), choking sensation (5/14), laryngeal dyspnea (5/14) and dysphagia (2/14).


**Table 5.** Clinical characteristics. CC, chronic cough; ATS, abnormal throat sensation; LD, laryngeal dyspnea.

Table 6 lists the results of PROMs at baseline and normative cut-off values from the literature. This table showed that the score values for these scales were well within the pathological ranges.

14 ATS, dysphonia CC, choking

12 CC, ATS 13 CC, ATS


**Table 6.** Descriptive statistics of patient reported outcome measures at baseline.

Table 7 shows findings for the relevant strobolaryngoscopy parameters. The predominant clinical feature on strobolaryngoscopy observed in 10/14 participants was vertical mismatch of the vocal folds, followed by some form of lateral or medial constriction of the supraglottic structures during phonation. Abduction lag and unilateral false vocal fold hyperfunction were observed in 6/14 participants and 5/14 participants were observed to have one vocal fold shorter than the other. Phase asymmetry and reduced mucosal wave amplitude were not features found in this population.

**Table 7.** Stroboscopy findings in LSD.


*3.2. Effects of Botox Injection on Outcome Measures*

3.2.1. LHQ Score

Figure 3 shows LHQ score of all patients at baseline and post-BTX treatment. The majority of patients showed an improvement in LHQ following BTX treatment. Linear mixed

model analysis was calculated with treatment ("time") and gender being the fixed factors and patients as random factors. There was a significant effect of the treatment on LHQ outcome (F(1, 25) = 12.335, *p* = 0.002). There was no significant effect of gender (*p* = 0.265) and no significant interaction effect between 'time' and gender (*p* = 0.078), indicating treatment effects were independent of gender. Parameter estimate showed that regression coefficient (b) for LHQ scores was statistically significant (b = −2.633, t(25.0) = −3.423, *p* = 0.002). After BTX treatment, mean LHQ score increased by 2.6 (95% CI = 1.1–4.2, Sidak-adjusted *p* = 0.002).

**Figure 3.** LHQ scores before and after BTX therapy with linear trend lines for male (M) and female (F). Higher score means better outcome. 0 = baseline; 1 = post-BTX treatment.

3.2.2. CSI

Figure 4 shows CSI scores for both genders at baseline and after BTX. There were significant fixed effects of treatment on CSI scores (F(1, 18.998) = 15.068, *p* = 0.001) and no significant interaction between treatment and gender (*p* = 0.748). Parameter estimate showed that CSI score decreased significantly after injection (b = 5.444, t(18.998) = 2.900, *p* = 0.009). Pairwise comparison showed that CSI score decreased by 5.0 after treatment (95% CI = 2.3–7.7, Sidak-adjusted *p* = 0.001).

**Figure 4.** CSI scores before and after BTX therapy with linear trend lines for male (M) and female (F). Lower score indicates better outcome. 0 = baseline; 1 = post-BTX treatment.

#### 3.2.3. RSI

Mixed model analysis was calculated for total RSI score which are shown for both males and females in Figure 5. There was significant fixed effect of treatment on total RSI score (F(1, 25.001) = 19.766, *p* < 0.001). There was no significant interaction between treatment and gender (*p* = 0.219). The decrease in RSI score after BTX injection was significant (b = 5.75, t (25.001) = 2.208, *p* = 0.037). Data from both genders showed that the mean RSI scores decreased by 8.0 after BTX injection (95% CI = 4.3–11.8, Sidak-adjusted *p* < 0.001).

**Figure 5.** RSI scores before and after BTX therapy with linear trend lines for male (M) and female (F). Lower score indicates better outcome. 0 = baseline; 1 = post-BTX treatment.

Sub-score analysis of the RSI data was also performed using paired *t* test comparing scores of each of the RSI items between pre- and post-BTX. Results of comparisons are presented in Table 8, which showed significant differences with large effect sizes for sensory items related to cough and "breathing difficulties or choking episodes".



#### 3.2.4. VHI-10

There was no significant fixed effect of treatment on this outcome measure (*p* = 0.734) and there was also no significant interaction between treatment and gender (*p* = 0.196). Pairwise comparison showed that VHI-10 score dropped by 0.7 after BTX (95% CI = −3.6–5.1, Sidak-adjusted *p* = 0.734).

#### *3.3. Effect Sizes of the Treatment*

Table 9 shows mean differences, *p* value of the paired *t* test and Cohen's d for all outcome measures. This table shows that the treatment effect was large for the LHQ and RSI outcomes and medium for the CSI.

**Table 9.** Mean, mean difference, and effect sizes (Cohen's d: small = 0.2; medium = 0.5; large = 0.8). MID, minimal clinically important difference; (\*), significant at *p* < 0.05.


#### *3.4. Prediction of SLN Block Response on BTX Improvement*

Short-term response to SLN block was evaluated using a 10-point Likert scale based upon the patient's specific presenting symptoms. Table 10 presents the number of patients who showed overall improvement after BTX injection versus those who responded to the SLN block. Responses to SLN block was significantly associated with improvement in LHQ scores (χ<sup>2</sup> (1) = 6.618, *p* = 0.01).


**Table 10.** Overall BTX improvement versus outcome of SLN block.

#### *3.5. Adverse Effects of BTX Treatment*

Ten of the fourteen subjects experienced adverse effects of the BTX treatment. Dysphonia was the most common with weakness, breathiness or reduced volume and projection of the voice. These symptoms were mild and self-limiting, lasting for 2–3 weeks on average. There was no change in VHI-10 at reassessment. One person experienced mild dysphagia and a slower swallow mechanism which also resolved within three weeks.

#### *3.6. Repeat Treatments*

Six patients presented for repeat treatment. Two patients had a single repeat treatment at three months and five months respectively. One patient had a further two treatments at six and 9 months after the initial. One patient had a total of three treatments at approximately 3-month intervals. Two patients continue to present for repeat treatment with good effect at 3–6 monthly intervals.

#### **4. Discussion**

#### *4.1. Clinical Presentation of Patients with LSD*

Several disorders triggered by one or more sensory stimuli and manifested by hyperkinetic laryngeal dysfunction such as MTD, PVFM, globus and chronic cough have been grouped under "irritable larynx syndrome" [7]. However, the exact role of the dysfunctional sensory pathway in those conditions has not been confirmed by experimental evidence. Unlike motor function which can be examined using electromyography, there is currently no equivalent objective test for sensory function. This has made it challenging to define, explain and evaluate syndromes involving laryngeal hypersensitivity such as LSD. Explanations for these syndromes have been proposed using neuroplastic [7] or neuropathic models [4,76,77]. Examining sensory symptoms of patients with laryngeal sensitivity is therefore necessary to provide the main clinical clusters that may be useful for diagnosis and treatment follow-up.

Symptoms of LSD have been linked to several umbrella conditions in laryngeal hypersensitivity. Vertigan et al. [3] maintained that laryngeal hypersensitivity existed in the context of CRC, PVFM, MTD and globus. They found that laryngeal hypersensitivity was characterized by significantly higher symptom scores than controls in the breathing, cough, swallowing and phonation domains. They also found that within each clinical group of CRC, PVFM and MTD, the scores for the dominant domain were the highest, e.g., the CRC group had the highest cough score and PVFM had the highest breathing scores. Laryngeal paresthesia scores were significantly higher in these groups compared with controls and there were no significant differences in this score across the groups. Laryngeal sensory dysfunction was therefore investigated in the general pivotal syndromes related to phonation, cough, respiration and swallowing rather than in specific throat sensory profiles. However, they did not specifically describe sensory profiles in relevant PROM scales such as LHQ and RSI.

From case history data, the primary presenting symptoms in this cohort of patients were an abnormal throat sensation and CRC. Other symptoms observed with a lower frequency included choking sensation, voice problems, laryngeal dyspnea and problems with swallowing. PROM data were within the pathological ranges for LHQ, CSI, RSI and VHI-10 (Table 6). Videostrobolaryngoscopy was used to exclude other gross laryngeal pathology but was also useful in identifying signs of laryngeal motor impairment associated with sensory

dysfunction in patients with vagal neuropathy [8]. Decreased gross vocal fold movement (3/14), abduction lag (6/14) and unequal vocal fold vertical height (10/14) were the main findings in these patients and gave some indication of laterality of peripheral neuropathy.

When examining potential preceding factors associated with onset, several patterns appear evident. Three of the fourteen people reported preceding URTI which has previously been suggested as a cause of vagal neuropathy [8,20]. Three of the fourteen reported preceding occupational inhalational exposure, a recognized trigger factor of irritable larynx syndrome [78]. Trauma to laryngeal nerves is another recognized cause of neuropathic symptoms [29] and was reported in 2/14 people (one iatrogenic during thyroid surgery and one due to external trauma), both of which exhibited motor signs of weakness on videostrobolaryngoscopy. Two of the fourteen reported preceding intubation, the relevance of this is unclear but local irritation of the larynx is one potential mechanism by which sensitization can take place. Four of the fourteen patients could not recall any preceding event.

Ten of the fourteen patients had a favorable response to a trial SLN block, supporting a diagnosis of sensory neuropathy. When considering a diagnosis of LSD, the majority of the following components should be present: ATS or CRC that has failed conventional medical/behavioral therapy; symptoms easily triggered by sensory stimuli; abnormal patient reported outcome measures of laryngeal sensory function (e.g., LHQ +/− RSI); signs of motor asymmetry on laryngeal stroboscopy; favorable response to a trial SLN block.

#### *4.2. Treatment Effects of BTX on LSD*

The present study is the first to describe the use and investigate the efficacy of supraglottic botulinum toxin type A injection for symptoms associated with Laryngeal Sensory Dysfunction. We postulated that BTX may affect the sensory afferent loop of the cough reflex via multiple mechanisms using a sensory neuropathic model [50,51]. The internal branch of the SLN is the primary laryngeal sensory afferent nerve contributing to a number of important reflexes including cough, swallow, respiration and laryngospasm [16]. It was thus hypothesized that targeting the peripheral sensory receptors in the distribution of this nerve would be a more effective and logical approach than targeting the intrinsic laryngeal musculature (previously described for the treatment of CRC [49–52]). Our hypothesis and treatment approach appears to be supported by the findings of this study.

There were statistically significant improvements in the primary patient reported outcome measures of LHQ (improving by 2.6 post-BTX) and CSI (improving by 5.0). The findings suggest a therapeutic effect of supraglottic BTX in the treatment of laryngeal sensory dysfunction. While not mechanistic proof, these findings are in support of the previously discussed peripheral and central sensitization model, and support the use of BTX in the treatment of neuropathic sensory dysfunction.

The findings relating to RSI score are noteworthy. Baseline RSI scores were within the abnormal range [63], despite ongoing medical and behavioral management of laryngopharyngeal reflux at the time of the BTX injection. Sub-item analysis (Table 8) showed significant improvement in the items relating to abnormal sensation; "excess throat mucous or post-nasal drip" and "sensation of something sticking in your throat or a lump in the throat" which are symptoms common to LSD. Improvement was also seen in the sub items relating to cough and breathing difficulties/choking episodes which are potential motor manifestations of laryngeal sensory dysfunction. These findings support the multi-faceted nature of LSD.

Despite RSI being developed as a tool for LPR symptoms [63], there is a lack of agreement between its score and laryngopharyngeal pH monitoring [79]. The findings of this study support the fact that symptoms reflected in the RSI are not always associated with LPR [80] and may be related to other etiologies including LSD. In light of this, the mechanism of action of BTX on ATS which resulted in improvement in RSI can be interpreted based upon findings from previous research on neuropathic pain involving peripheral nerve injury.

We found that CSI scores decreased significantly after supraglottic BTX injection, supporting its role as a potential treatment for CRC. The therapeutic effects of BTX on cough are thought to stem from its action upon the sensory pathways in modulating the cough and laryngeal adductor reflexes [50–52]. It is also possible that diffusion from the injection site into the intrinsic laryngeal adductor muscles may have occurred, producing the effects which have been reported and explained in some previous studies [50,51]; however, we would have expected an associated decrease in voice if this was the primary mechanism of action

In this study, VHI-10 scores did not change significantly despite the common reports of voice change after BTX treatment. This is in line with the mild and temporary nature of dysphonia after laryngeal BTX injections reported elsewhere in the literature [54,68].

#### *4.3. The Role of SLN Nerve Block as Predictor of LSD, CRC and Efficacy of BTX*

Recent work has explored SLN block as an office-based treatment for chronic refractory cough with a suspected neuropathic cause. In 2018, Simpson [62] reported improvement in cough severity index scores in a cohort of 23 patients where superior laryngeal nerve block was performed using local anesthesia with steroid. In total, 44% of patients had lasting improvement after one treatment but the mechanism of this extended effect remains unclear. Bupivicaine, considered to be the longest lasting local anesthetic, has an analgesic duration of action of only 4–8 h [81]. The addition of steroid to the local anesthetic could theoretically address any inflammation of the superior laryngeal nerve if it happens to be delivered to the site of the nerve inflammation. Twenty eight of the thirty patients treated by Dhillon reported at least a 50% reduction in symptoms along with significant improvement in CSI scores (the only outcome measure employed in this study) after a minimum of three injections [82,83]. Bradley et al. [84] described surgical section of the SLN as a viable option for treatment of selected patients with refractory neuropathic cough. They also however recognized dysphagia and aspiration as potential complications of this treatment.

In our practice we find SLN block a useful tool to assist with diagnosis of LSD and help guide treatment. Patients with laryngeal sensory symptoms persisting despite medical management of laryngeal irritants such as postnasal drip and laryngopharyngeal reflux are offered a trial unilateral SLN block based upon laterality of symptoms and any laryngeal stroboscopic findings that may suggest superior laryngeal nerve paresis. If there is no improvement in symptoms at 20 min compared with baseline, SLN block is offered on the contralateral side. Where symptom improvement is reported, this suggests that the anesthetized nerve or its peripheral receptors and nerve endings play a significant part in the patient's presentation, supporting a neuropathic diagnosis and offering a potential target for treatment. It is our experience that symptomatic improvement of LSD after SLN block is short term with most patients reporting a duration of effect in the order of hours rather than days before symptoms return.

In the present study, short term response to SLN block was a significant predictor of longer-term response to supraglottic botulinum toxin. Where laryngeal symptoms do not improve with SLN block, a diagnosis of sensory neuropathy is still possible but is likely to involve other sensory branches of the larynx such as the recurrent laryngeal nerve or may be referred from other sites of a neuropathic process in the vagal pathways.

#### *4.4. Limitations of This Study*

This was a retrospective study; however, we had a high level of data completeness with no patient loss to follow up. When performing supraglottic BTX treatment, it is our experience that the procedure is tolerated much better by the patient with the assistance of laryngeal anesthesia. We used SLN block at the time of BTX treatment for this purpose. Theoretically, some of the treatment effect may be related to the SLN block; however, all patients had reported only short-term response to prior SLN block performed as an independent procedure as part of workup for LSD and a much longer effect of treatment with concurrent BTX treatment. Finally, due to the retrospective nature of this study we

were unable to include a separate control group. Future prospective studies investigating this novel treatment for LSD using a control treatment group (perhaps SLN block with BTX vs. SLN block alone) are indicated based on the promising results of the current study.

#### *4.5. Recommendations for Assessment and Treatment of CRC*

This study identified a sub-group of patients presenting with various symptoms within the LSD syndrome and provided preliminary data on the therapeutic effects of BTX administered into a novel supraglottic region of the larynx. This method of BTX administration can be safely performed as an office-based procedure that does not require complicated equipment and concurrent invasive procedures such as laryngeal electromyography. The recommended treatment planning for these patients can be summarized in a flowchart in Figure 6. Patients who present with LSD symptoms are offered superior laryngeal nerve block. If the symptoms improve, supraglottic BTX treatment is indicated. If LSD symptoms do not change after the block, patients will undergo alternative treatments such as medical treatment, neuromodulators, and speech pathology treatment. Those who do not respond to these alternative treatments can be indicated supraglottic BTX as a salvage treatment and they can revert to medical treatment and speech pathology treatment. It is important to mention that clinical trial designs are now required to validate the findings.

**Figure 6.** Recommendations of treatment plans for patients with LSD.

#### **5. Conclusions**

This study provided further evidence for defining, describing, and diagnosing a subgroup of patients presenting with various laryngeal symptoms related to altered laryngeal sensation. The major presenting symptoms for these patients were abnormal throat sensation and chronic cough. Diagnostic criteria for these patients should be based upon the onset and history of the sensory symptoms, resistance to medical and behavioral treatment, abnormal scores in PROMs evaluating abnormal laryngeal sensation including the LHQ and RSI, laryngeal videostroboscopy findings and responses to SLN block.

Symptomatic immediate response to SLN block supports the diagnosis of LSD affecting the supraglottic laryngeal afferent pathways. It was also a useful predictor of which patients were likely to respond to subsequent treatment with supraglottic BTX injection where the response to SLN block is short-lived.

Supraglottic BTX administration is a safe office-based procedure that effectively reduced sensory symptoms in a cohort of patients with various clinical presentations related to laryngeal sensory dysfunction. This treatment may be considered after the patient fails behavioral intervention and standard medical management for any related co-morbidities such as asthma, laryngopharyngeal reflux or sinonasal conditions including control of potential trigger factors. It can be used as an adjunct to neural modulators or as a standalone treatment to address neuropathic laryngeal symptoms related to LSD including reducing hypersensitivity of the laryngeal afferent pathways and protective reflexes manifesting as chronic refractory cough and throat clearing and reducing sensory symptoms of laryngeal paresthesia presenting as abnormal throat sensation.

**Author Contributions:** Conceptualization, D.N.; methodology, D.N., C.M. and D.D.N.; formal analysis, D.D.N.; investigation, D.N., M.S., T.S. and K.S.; data curation, T.S., A.C., K.S. and M.S.; writing—original draft preparation, D.N., D.D.N., A.C., M.S. and T.S.; writing—review and editing, D.N., D.D.N., C.M., A.C. and T.S.; project administration, D.N.; visualization, D.N.; funding acquisition, D.N. and C.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by The Dr. Liang Voice Program at The University of Sydney.

**Institutional Review Board Statement:** This study was a retrospective cohort study conducted by medical record review of a private laryngology specialized clinic for the period from November 2019 to May 2021. The study was conducted according to the guidelines of the Declaration of Helsinki. The study was approved by the Human Research Ethics Committee of The University of Sydney (protocol number 2021/025).

**Informed Consent Statement:** Patient consent was waived due to it being impractical to seek consent for patients seen in the past and it was considered a threat to patient privacy to implement a process to locate and contact each individual participant to seek their consent. This waiver was approved by the Human Research Ethics Committee approval provided above.

**Data Availability Statement:** Data supporting reported results is retained by The University of Sydney in de-identified form and is confidential under the conditions of the Human Research Ethics Committee of The University of Sydney approval.

**Acknowledgments:** We acknowledge the clinicians referred to in the study and thank them for their rigorous data collection practices that supported this work.

**Conflicts of Interest:** First author D.N. is Director of Sydney Voice and Swallowing. C.M., D.D.N. and A.C. are employees of The University of Sydney and are partly or fully funded by the Dr. Liang Voice Program, a philanthropically funded program of research and post-graduate education in laryngology. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Article* **Active Ingredients of Voice Therapy for Muscle Tension Voice Disorders: A Retrospective Data Audit**

**Catherine Madill 1,\*, Antonia Chacon 1, Evan Kirby 1, Daniel Novakovic 1,2 and Duy Duong Nguyen <sup>1</sup>**


**Abstract:** Background: Although voice therapy is the first line treatment for muscle-tension voice disorders (MTVD), no clinical research has investigated the role of specific active ingredients. This study aimed to evaluate the efficacy of active ingredients in the treatment of MTVD. A retrospective review of a clinical voice database was conducted on 68 MTVD patients who were treated using the optimal phonation task (OPT) and sob voice quality (SVQ), as well as two different processes: task variation and negative practice (NP). Mixed-model analysis was performed on auditory–perceptual and acoustic data from voice recordings at baseline and after each technique. Active ingredients were evaluated using effect sizes. Significant overall treatment effects were observed for the treatment program. Effect sizes ranged from 0.34 (post-NP) to 0.387 (post-SVQ) for overall severity ratings. Effect sizes ranged from 0.237 (post-SVQ) to 0.445 (post-NP) for a smoothed cepstral peak prominence measure. The treatment effects did not depend upon the MTVD type (primary or secondary), treating clinicians, nor the number of sessions and days between sessions. Implementation of individual techniques that promote improved voice quality and processes that support learning resulted in improved habitual voice quality. Both voice techniques and processes can be considered as active ingredients in voice therapy.

**Keywords:** Sob Voice Therapy; Optimal Phonation Task; Negative Practice; auditory-perceptual analysis; acoustic voice analysis

#### **1. Introduction**

A muscle-tension voice disorder (MTVD) is a commonly occurring dysphonia that results from disorganisation or dysfunction of the laryngeal musculature [1]. It can occur as a primary condition without organic changes to the vocal folds or as a secondary, compensatory condition to underlying organic or neurological laryngeal pathology. The aetiology of MTVD can be multifactorial and includes phonotrauma, excessive vocal load, glottic incompetence (vocal fold paresis and atrophy), psychological stress, and cooccurring medical conditions such as upper respiratory tract infection, laryngopharyngeal reflux, and sinusitis with post-nasal drip [2,3]. Within the voice-disordered population, functional dysphonia has documented prevalence rates of between 20.5 to 41%, while the prevalence of phonotraumatic lesions (e.g., vocal nodules and polyps) is 12–15% [4,5]. The majority of MTVDs are preventable [2] and early intervention is recommended to mitigate the negative impact of the disorder [6].

#### *1.1. Behavioural Voice Therapy Is the First-Line Treatment for MTVD*

Treatment of MTVD requires voice therapy as the first line of treatment [7], alongside the medical management of co-existing or contributing medical conditions. Both

**Citation:** Madill, C.; Chacon, A.; Kirby, E.; Novakovic, D.; Nguyen, D.D. Active Ingredients of Voice Therapy for Muscle Tension Voice Disorders: A Retrospective Data Audit. *J. Clin. Med.* **2021**, *10*, 4135. https://doi.org/10.3390/jcm10184135

Academic Editor: Renee Speyer

Received: 12 August 2021 Accepted: 8 September 2021 Published: 14 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

indirect and direct voice therapies are utilised in the treatment of MTVD in adults and children [8–10]. Indirect voice therapy, also termed vocal hygiene, aims to facilitate an individual's vocal rehabilitation by identifying and eliminating poor vocal behaviours or other constraints to good vocal health, while promoting vocal health. Direct voice therapy describes a large range of individual vocal techniques and structured programs designed to change the habitual movement of the vocal system during phonation [8] such that the vocal needs of the individual are met without deterioration in the sound or sensation of phonation. Numerous systematic reviews and an increasing body of evidence have demonstrated that voice therapy is effective for the majority of patients with MTVD [11]; however, there is insufficient evidence to determine if one treatment is more effective than another. While some research has demonstrated that speech and language pathologists (SLPs) use a common approach to therapy [12], it is also well documented that SLPs use more than one MTVD therapy technique at a time [9,10]. This prevents clear identification of the therapeutic effect of each component of the treatment regime prescribed by the clinician. Therapies for MTVD are also very heterogenous and target different aspects of voice production. In addition, different therapies employ different conceptual approaches and there is a paucity of outcome data on the individual treatment components thought to modify voice production towards more optimal function.

There is a pressing need to ensure that the most cost-effective treatments are used, that is, treatments that provide evidence-based treatment effects with the maximum therapeutic effect in the minimum amount of time. Average treatment times for dysphonia across 140 research publications were documented as approximately consisting of 11 sessions of mostly 30 or 60-min durations, with average clinician-to-client face-to-face time estimated at 8.17 h [13]. The authors acknowledge that this was a conservative analysis, with many studies using fixed-treatment designs and others documenting clinical outcomes in North America, in which health insurance rules may influence intervention length and cost. If treatment efficacy can be improved, time and health-care costs may be reduced without compromising treatment outcomes, nor patient-centred care [14].

#### *1.2. What Is an Active Ingredient in Voice Therapy?*

The definition of an active ingredient has been recently considered in allied health and speech language pathology (SLP), specifically in [15–17]. Nevertheless, behaviours that generate a therapeutic effect can be difficult to identify in behavioural therapies due to a number of challenges. These include the lack of clarity surrounding rehabilitation ingredients, the fact that rehabilitation treatments often attempt to change multiple interacting patient functions, and a lack of standard nomenclature and definitions for specific treatment ingredients [18]. The treatment of voice disorders is one area in which significant efforts are being made to identify active ingredients in detail.

Quantifiable ingredients such as dosage, frequency, and intensity were initially proposed as active ingredients in SLP [16]. In recent times, a more expansive consideration of those components of a therapy that may have a therapeutic effect has been modelled in the Taxonomy of Voice Therapy [19]. This model proposes that treatment components may be classified into direct interventions (subdivided into auditory, somatosensory, musculoskeletal, respiratory, and vocal function), intervention delivery models (extrinsic and intrinsic), and indirect interventions (pedagogy and counselling) with more specific interventions listed under each sub-category [14]. The Rehabilitation Specification System (RTSS) [18] describes a simpler theoretical framework and proposed methodology by which treatments can be described according to a singular treatment target (the patient function that is to be changed by the ingredient(s)); one or more ingredients (what the clinician does to modify the target); and the mechanism(s) of action of the treatment [16]. Both the Taxonomy of Voice Therapy and the broader RTSS provide complex and detailed theoretical models that can inform our understanding; however, these models defining active ingredients are yet to be tested in clinic-based research.

Verdolini provides a simpler conceptualisation of the mechanisms of action as being divisible by the 'what' (the vocal technique) and the 'how' (the modality by which the change of function is learned) [20]. Across different voice therapies, the 'what' can vary from a single technique, such as Conversation Training Therapy (CTT) (use a clear voice) and Resonant Voice Therapy (RVT) (feel the buzz and notice the ease of phonation), to multiple technique therapies, such as Vocal Function Exercises (VFE) [21] (four distinct exercises targeting the whole vocal system), stretch and flow therapy [22], and the Accent Method [23,24]. The 'how' of learning to habituate the new vocal technique is remarkably homogenous across voice therapies [25] and involves processes originally described in motor learning research, such as task variation (hierarchical or end goal target) and negative practice.

There is little existing research on voice-disordered populations investigating the effectiveness of specific techniques and/or processes, as most research designs have evaluated the impact of the whole therapy rather than its component parts or stages. Most voice therapy programs that aim to provide a standardised series of voice exercises have been evaluated in controlled clinical trials [21–24,26,27]. All of these programs consist of multiple exercises that may be hierarchical in nature (e.g., Lessac Madsen Resonant Voice Therapy and RVT) or address different aspects of vocal function (e.g., VFE and the Accent Method). All have demonstrated efficacy with a range of effect sizes demonstrated across a variety of voice outcome measures; however, none have systematically evaluated the effect of each component or 'ingredient' in the treatment provided. Preliminary research investigating individual effects of components of VFE has isolated the therapeutic effects of practise dosage and the use of a semi-occluded vocal tract (nasal sound) [28,29]; however, this research was conducted in controlled experimental conditions with non-voice-disordered volunteers.

#### *1.3. VoiceCraft® Sob Voice Therapy*

VoiceCraft® Sob Voice Therapy (SVT) [30] is a direct voice therapy program whereby discrete individual techniques and processes are introduced at specific times and thus provides an opportunity to isolate possible effects of individual ingredients. Voicecraft® is an SLP-directed voice therapy treatment model developed in the 1980s based on the work of numerous voice-science researchers and clinicians [31]. Described as a differentiated vocal tract model of vocal training that aims to develop the control of specific muscular movements in the larynx [32], it consists of a range of treatment programs for different patient populations (e.g., Yell Well for children with vocal nodules) that can be adjusted to the individual presentation of the patient depending on the type of voice condition, their individual muscular function in the larynx, and/or awareness of perceptual outcomes of phonation. This approach to the remediation of functional voice disorders has not been documented previously. Voicecraft training has proven to be effective in improving voice quality in healthy subjects [33] and to 'fatigue proof' the voice under conditions of sleep deprivation [33]. Despite being used across Australasia, Singapore, Europe, and the UK to treat voice and resonance disorders in adults and children, efficacy of Voicecraft® programs, such as Sob Voice Therapy, has not been reported in voice-disordered populations.

Sob Voice Therapy is used to treat adolescent and adult patients with MTVD with or without organic change. The program consists of up to four techniques (as required) and utilises two common learning processes to support the generalisation and maintenance of the new voice techniques, namely task variation and negative practice (Table 1). It follows a hierarchical progression from an initial exercise utilising the most common features of voice therapy exercises, namely the optimal phonation task (OPT), followed by the introduction of sob voice quality (SVQ), the so-called heartbroken voice quality, and then habitual speech quality. Twang voice quality can be taught to assist in the production of loud voicing without effort, should the patient require this skill to meet their vocal needs. Task variance and negative practice are used in between the introduction of each technique. The difference between each technique can be physiologically and perceptually described

according to the targeted activation of muscle groups that result in measurable movement outcomes. For example, the difference between OPT and SVQ involves targeting a lower larynx potion and some degree of laryngeal tilt in SVQ compared to OPT.

**Table 1.** Name and brief description of each of the first four Sob Voice Therapy components.


NB: Practise recommendations are cumulative over the four components. Patients are instructed to randomise practise tasks in hourly practise sessions as different tasks are introduced.

VoiceCraft® and SVT describe voice therapy techniques that are based on a dynamical systems approach which acknowledges that the vocal system, like other complex movement systems, is self-organising [34]. Identifying the component of vocal function that is the most disorganised is the focus of the treatment and in the case of MTVD, it relates to some aspect of laryngeal function; for example, differentiated control of the adduction of the true vocal folds and retraction of the false vocal folds, and/or lowering of the larynx. Specifically, primary movements are targeted as these are implicated across a number of presenting symptoms (e.g., supraglottic constriction is associated with degraded voice quality and increased vocal effort). In this way, targeting a single movement, such as the widening of the supraglottic area via the release of laryngeal constriction manoeuvres, that then may address multiple aims, presents an efficient process of treatment, as multiple symptoms are addressed in one movement adjustment. Other aspects of the phonatory system such as breathing and resonance are de-emphasized unless they are the primary source of dysfunction, as it is presumed the neural system will automatically reorganise these functions around the biomechanical movement that is reorganised/optimised. For example, breathing is assumed to be mediated by communicative intent [35,36]. Different learning processes may have greater effect in the learning of the new, more optimal movement.

#### *1.4. Retrospective Cohort Analysis vs. Randomised Control Trial*

Given the value of voice therapy programs as the first line of treatment for commonly occurring MTVDs, understanding which treatment programs are effective and estimating their potential 'active ingredients' is essential. Despite being considered the highest level of evidence, the use of randomized controlled trials (RCTs) in investigating the treatment efficacy of voice therapies on voice disorders presents certain difficulties. Firstly, it is ethically challenging to allocate patients into different study arms given the need to recover the voice of professional voice users. Secondly, cost-effectiveness is a barrier to both clinicians and their patients, as most voice therapy programs require a course of weeks to months to complete. Lastly, patient compliance and the impact of various co-factors and comorbidities/medical conditions are amongst the burdens that can interfere with the intervention outcomes and how these are interpreted. A retrospective review of existing clinical databases had advantages of bringing evidence from 'real-world' scenarios to help clinicians and researchers determine (1) whether a particular therapy program is effective and, in standardised treatment programs, (2) to compare different therapy components with respect to their treatment efficacy.

The aims of the present study were to:


It was hypothesized that: (1) Sob Voice Therapy, which includes two vocal techniques (OPT and SVQ) and two training processes (SVQ variant and NP), would be effective in the treatment of MTVD; (2) processes (task variation and negative practice) rather than techniques (OPT and SVQ) would demonstrate statistically significant treatment effects; and (3) session number, treatment duration, and diagnostic and service delivery factors would have significant effects on treatment outcomes.

#### **2. Materials and Methods**

#### *2.1. Study Design*

This was a retrospective file audit of an existing private practice speech pathology clinical database. This study was approved by the Human Research Ethics Committee of the University of Sydney (protocol number: 2019/529).

#### *2.2. Participants*

#### 2.2.1. Selection Criteria

Participants were included if they had received a diagnosis of primary or secondary MTVD from an Ear, Nose and Throat specialist (ENT). 'Primary' referred to MTVD without visible vocal fold mucosal lesions and 'secondary' referred to MTVD with slight associated mucosal changes related to vocal trauma, such as pre-nodular and swelling lesions.

Inclusion criteria included: (1) over 18 years of age; (2) diagnosis of MTVD by an ENT report based on laryngoscopy; (3) had attended at least one voice assessment and one voice therapy session, enabling pre and post-acoustic data baseline recordings prior to and following both the teaching and practise of the OPT; (4) received only Sob Voice Therapy components as described above; and (5) reported to have done some practise of the therapy component (technique or process) as recommended by the clinician.

Exclusion criteria included: (1) under 18 years of age; (2) missing an ENT laryngoscopy report/diagnosis; (3) had undergone surgery of the larynx or surrounding structures (e.g., thyroid surgery) throughout their voice intervention period; (4) neurological voice and speech problems (e.g., dysarthria) or predominant mucosal lesions (e.g., cysts, polyps, and neoplasms); (5) types of functional dysphonia not related to vocal trauma, e.g., puberphonia, presbyphonia, and transgender voice; (6) missing voice recordings for more than one data

point other than the initial and final session; (7) voice recordings with severely aperiodic signals (type 3 and type 4 signals) [37], precluding fundamental frequency-based measures; (8) received instruction in another voice therapy technique or process not described in Sob Voice Therapy; and (9) patients who could not detect any change in the sound or sensation of their voice production regardless of their success in achieving voice change during the OPT trial therapy task in the initial assessment, as this would suggest a possible undiagnosed neurosensory or cognitive impairment.

#### 2.2.2. Sample Size Calculation

The required number of patients for the retrospective review was estimated using an online sample calculation tool called GLIMMPSE [38], as this has been recommended for calculating samples for repeated-measures study designs [39]. Parameters used included: power = 90%; Geisser-greenhouse corrected test; Type I error rate α = 0.05; outcome measures = harmonics-to-noise ratio (HNR); number of measurements = 3 (baseline and two post-therapy assessments); predictor variables = type of muscle-tension voice disorders (primary and secondary); treatment effects = [MTD type x harmonics-to-noise ratio interaction]; mean scale factor = 2; and variability scale factor = 1. Regarding the mean values to put into the formula, we used baseline HNR values taken from baseline data in a randomized control clinical trial by Nguyen and Kenny [21], in which HNR pre-treatment of primary MTD was 18.6 decibels (dB). Considering there has been no similar study design in the literature, we assumed the first treatment and second treatment resulted in a 3.8 dB improvement in HNR for the primary MTD group as observed in the Nguyen and Kenny study [21]. Mean baseline HNR for secondary MTD was taken from Wenke et al. [40] in which baseline HNR was 16.6 dB as their study used participants with both primary MTD and MTD with lesions such as vocal nodules. We assumed the first and second treatments resulted in a 2.9 dB improvement in HNR for the secondary MTD group as observed in their standard treatment protocols [40]. Standard deviation (SD) of HNR for the formula was set at 4.5 dB according to the study of Wenke et al. [40]. The calculation resulted in a sample size of 74 (patients).

#### *2.3. Voice Therapy Programs under Review: Sob Voice Therapy*

Sob Voice Therapy was delivered to the patients by six different SLPs who had completed a 4-day workshop in VoiceCraft® and SVT [30]. All were certified practicing speech pathologists with experience in treating patients with MTVD ranging from 1 to 15 years. Therapy was delivered in a face-to-face, one-on-one service delivery model across six different sites in an office setting. Patients were charged a fee for service in all cases. Eighteen out of sixty-eight participants were treated by more than one clinicians. Patients were taught the specific technique or process and required to perform the technique or task to 80% accuracy as judged by the clinician before moving onto the next phase. All sessions were documented as being 60 min long (according to the clinical hour of 50 min face-to-face time and 10 min of note taking/administration). Patients were recommended to undertake a specific amount of daily practise in each technique and/or process. Recommendations were based on motor learning principles of high frequency, distributed variable, and randomised and context-variable practise [41]. Typically, patients were recommended to practise once an hour for between 1 and 3 min, aiming for 10 practise sessions/day. As the therapy is based on hierarchical additive fractionation, patients were required to add practise in a new technique or process to that of their previous practise, which also allowed for task variation and randomisation. Individual specific practise data was not collected routinely from patients; however, all patients reported some level of practise. The number of sessions required to meet 80% correctness in the technique/process ranged from 1.3 to 2.4, with the number of days between each technique/process ranging from 27.8–37.5.

Extracted data was collected at five time points: (1) at the initial session (baseline) after which the OPT was taught in the same session; (2) at the subsequent session in which it was judged by the clinician whether the OPT had been acquired and the next technique

(SVQ) was taught (OPT-SVQ); (3) at the subsequent session in which the clinician judged that SVQ had been acquired and sob variants were taught (SVQ-SVQ variants); (4) at the subsequent session in which the clinic judges whether the SVQ variants had been acquired and NP was taught (SVQ variants-NP); and (5) at the beginning of the session following the introduction of the NP process (NP post-NP). The number of sessions and days between each of the time points varied due to variation in clinic attendance and time taken to acquire each technique/process. The modal number of sessions between each technique/process was 1 and modal number of days was 14 (Table 2).

**Table 2.** Number of sessions and days between each technique and process of the Sob Voice Therapy. Abbreviations: SD, standard deviation, and CI, confidence interval.


#### *2.4. Data Extraction*

#### 2.4.1. Demographic Characteristics

During the initial voice assessment, a thorough case history interview was conducted. This supplemented the referral and case history information collected by a comprehensive case history questionnaire [42] and the patient reported outcomes (PROMS) data collected prior to the assessment session including both the Voice Handicap Index-10 (VHI-10) [43] and Reflux Symptom Index (RSI) [44] as a standard (data not reported here). Data about age, gender, occupation, MTVD type (primary and secondary), vocal load, lifestyle, and history of comorbidities were extracted.

#### 2.4.2. Extraction of Voice Recordings

Patient data was extracted and de-identified by authors AC and EK to ensure the first author was blinded to the identification of patient data to remove any risk of bias. All patients included in this review had high-quality audio recordings of a comprehensive voice assessment undertaken at baseline including the reading of the Rainbow Passage [45], the Consensus Auditory Perceptual Evaluation–Voice (CAPE-V) phrases [46], and the prolonged vowel (/a/). All voice signals were captured using an AKG C520 cardioid ear-mounted microphone [47] placed at a constant distance of 6 cm, 45◦ off the mouth axis, and were analogue-to-digital converted using a professional external sound card (Roland Quadcapture [48]) at 44.1 kHz and 16-bit resolution. The signals were processed and saved to a laptop computer using the Audacity sound editing software [49] in \*.wav format. Calibration of the sound level in the voice signals was not undertaken. In subsequent treatment sessions, audio recordings were made at the beginning of each session of the Rainbow Passage, CAPE-V phrases, and prolonged vowel/a/for a minimum of 3 s.

#### *2.5. Auditory–Perceptual Outcome Measures*

This retrospective review used four auditory–perceptual parameters for outcome measures, including overall severity of dysphonia, roughness, breathiness, and strain. These outcome measures were evaluated using auditory–perceptual analysis, which is considered the gold standard for clinical voice assessment [50].

#### 2.5.1. Listeners

Two certified practicing SLPs (2 and 3.5 years of experience in clinical voice assessment, respectively) and one ENT surgeon (19 years of experience in voice assessment) participated

in the perceptual analyses. The raters reported normal hearing and vision at the time of the study.

#### 2.5.2. Stimuli

Voice samples were edited to include the middle three seconds of the second attempt of the sustained/a/vowel production, the third CAPE-V phrase (CAPEV3), and the Rainbow Passage ('When the sunlight ... ... at the end of the rainbow'). These tasks were combined into a single file in Audacity. To avoid variabilities related to unequal sound pressure levels/hearing levels of the samples, all stimuli were normalized for loudness using the command 'Loudness Normalization' in the program to ensure that the perceived loudness of stimuli was 23 loudness units full-scale (LUFS). The intensity level of stimuli ranged from 70 to 72 dB as measured in Praat [51] using default intensity settings. Stimuli from 35 patients were randomly repeated for testing intra-rater reliability. In total, 285 samples were used.

#### 2.5.3. Procedure

Raters judged the level of the four voice dimensions, including overall severity, roughness, breathiness, and strain, using a 100-point visual analogue scale (VAS) based on the items described in the CAPE-V protocol [46] and embedded in an online auditory– perceptual rating tool called Bridge2practice, which is an education and research platform developed for audio–perceptual learning and practise of speech pathology students [52]. Judgments were made by moving a slider between 1 and 100, representing the minimum and maximum level of the quality being rated, respectively. Listeners were required to listen to the voice tasks as many times as they wished using a headphone and to make a judgment by changing the position of the slider on the VAS line mentioned above. All voice tasks were randomized. Responses were registered in the rating platform and exported to an Excel spreadsheet. The CAPE-V rating includes other perceptual rating features such as pitch, volume, and resonance, as well as additional features such as fry and diplophonia; however, features were not rated in this dataset.

#### 2.5.4. Reliability of Auditory–Perceptual Analyses

Reliability was assessed using SPSS 24.0 [53]. Intraclass correlation coefficients (ICC) [54] were used to determine the level of agreement between the first and second (repeated) ratings (intra-rater reliability) and across listeners (inter-rater reliability). ICC was calculated using a two-way mixed model, consistency type, and single measure analysis [ICC (3,1)]. To assess the level of correlation, ICC < 0.5 indicates poor correlation, 0.5–0.75 indicates moderate correlation, 0.75–0.9 indicates good correlation, and >0.9 indicates excellent correlation [55]. Table 3 shows good to excellent intra-rater reliability for most of the rated voice dimensions. Table 4 shows moderate to good inter-rater reliability for all rated voice dimensions.


**Table 3.** Intra-rater reliability of the perceptual analysis (*p* < 0.001 for all measures).


**Table 4.** Inter-rater reliability of the perceptual analysis.

#### *2.6. Acoustic Outcome Measures*

Voice samples were edited in Audacity to extract the middle three seconds (s) of the sustained/a/vowels, CAPEV3, and the second and third sentences of the Rainbow Passage (RP23). RP23 is a standard task in the analysis of dysphonia in speech and voice (ADSV) [56], which was used for the acoustic analysis in the present study. The use of RP23 would allow for cepstral measures to be comparable with the previous studies that used this task [57]. The quality of audio recordings for all samples was checked using the signalto-noise ratio (SNR) using a Praat script called 'Speech-to-noise ratio/voice-to-noise ratio v.01.01' [58]. Only samples with a SNR ≥ 30 dB were used for the acoustic analyses [59].

#### 2.6.1. Harmonics-to-Noise Ratio (HNR)

HNR quantifies the level of noise in the voice signals and intensifies it in pathological voices [60]. It has been found that HNR is correlated with the perceptual assessment of hoarseness [60] and vocal clarity [61]. HNR has been an important and commonly used outcome measure of voice treatment [62,63]. Praat 6.1.40 [51] was used to measure HNR from the middle 3-s segments from three trials of vowel samples and the averaged result (in dB) was used for the statistical analysis.

#### 2.6.2. Fundamental Frequency (F0)

F0 remains one of the most important frequency-based measures that has been extensively used to reflect voice changes associated with different laryngeal configurations, e.g., vocal fold dimension [64] and vocal fold stiffness [65]. F0 was measured in Praat from CAPEV3 and the full Rainbow Passage. The standard deviation of F0 (F0SD), which represent vocal stability [66], was measured from the sustained vowel/a/. All voice data with severely aperiodic signals (signal types 3 and 4) [37] were excluded from the F0 and HNR measurements. F0 settings in Praat are presented in Appendix A.1.

#### 2.6.3. Cepstral Peak Prominence: Non-Smoothed (CPP) and Smoothed (CPPS)

A voice cepstrum is measured using a Fourier transform of the logarithm power spectrum [67]. A cepstral peak is identified within the dominant 'rahmonic' corresponding to the fundamental period from which the cepstral peak prominence (CPP) is calculated as the amplitude between the peak and the regression line directly below it [68]. A signal with a highly periodic waveform and a clear harmonic structure would have a higher cepstral peak than aperiodic signals [68]. CPP has been shown to have stronger weighted correlations with overall voice quality than any other acoustic measure [69]. It has also been considered a significant predictor of dysphonic severity [70].

The acoustic analysis program ADSV [56] was used to measure cepstral peak prominence (CPP) in dB for the vowel, CAPEV3, and RP23 vocal tasks. CPP settings in ADSV are presented in Appendix A.2. CPPS was measured in Praat using recommended settings [71,72], which are shown in Appendix A.3. Smoothing before calculating the cepstral peak can improve the accuracy of estimation [73]. In Praat, the smoothing of the cepstral measurement followed the procedures by Hillenbrand and Houde [73] using 20-ms (10 frame) time-smoothing windows and 1-ms (10-bin) quefrency smoothing [51]. The first step involves averaging cepstral values over time, while the second step involves cepstra

being averaged across the quefrency [51]. Both CPP and CPPS were used to allow the data to be comparable to the other studies that used either of these measures. We also expected that CPPS was more sensitive than CPP in detecting treatment outcome due to its smoothing algorithm.

#### 2.6.4. Cepstral/Spectral Index of Dysphonia

The Cepstral/Spectral Index of Dysphonia (CSID) reflects overall voice quality [57,74] and has been shown to have high sensitivity and specificity [57] in discriminating pathological aspects from normal voice quality [75]. CSID data were obtained automatically in ADSV for the vowel and CAPEV3 task, and were manually calculated for RP23 samples based on CPP, low/high spectral ratio (LH), and low/high spectral ratio standard deviation (SDLH) values measured in ADSV using the following formula [57]:

CSID of Rainbow Passage = 154.59 − 10.39 × CPP − 1.08 × LH−3.71 × SDLH

#### 2.6.5. Vocal Intensity

Vocal intensity was measured in Praat from the/a/vowel, CAPEV3, and the whole Rainbow Passage. It was used to validate the cepstral measures as previous research has found CPP measures to be affected by vocal intensity: CPP would increase when vocal intensity was elevated [76].

#### 2.6.6. Reliability Analysis of Acoustic Measurements

Baseline acoustic data for 30 patients were reanalysed for two acoustic measures that involved the manual selection of the analysis samples (HNR of the vowel and F0 of CAPEV3). Results from the two analyses were compared using ICC statistics. The results showed that, for HNR, ICC values were 1 for both single measures and average measures (*p* < 0.001). For F0 of CAPEV3, ICC = 0.999 for single measures (*p* < 0.001) and ICC = 1 for average measures (*p* < 0.001). These results demonstrated excellent inter-rater reliability of the acoustic analysis. CPP, CPPS, and CSID measures were analysed using the entire edited vocal samples, which involved no manual selection of the waveform. Therefore, reliability analyses were deemed not necessary for those measures.

#### *2.7. Statistical Analyses*

Data were managed in Microsoft Excel [77] and analysed using IBM SPSS Statistics v.24.0 [53]. Descriptive statistics were used to describe cohort characteristics. Prior to the analyses, normal distribution of the data was examined using Kolmogorov–Smirnov tests [78]. For continuous variables, mean, standard deviation (SD), range, median, and the interquartile range were used. For categorical data, frequencies and percentages were used. Changes in outcome measures over the treatment period were analysed using a linear mixed model with patients representing random effects and time point (baseline and the four treatment technique points) representing fixed effects. Gender, diagnosis (MTVD primary vs. secondary), and treating clinicians also represented fixed effects. Interaction between time and the fixed factors was calculated to determine the impact of the factors on the treatment outcome. Significant fixed effects of time were further tested using pairwise comparison with the Sidak adjustment for *p* values. One-way repeated-measures analysis of variance (ANOVA) was used to examine the effects of each individual treatment ingredient on auditory–perceptual and acoustic outcome measures by comparing data between baseline and after each treatment. Effect size was calculated using partial Eta squared (η2) with the values of 0.01, 0.1, and 0.25 indicating small, medium, and large effects, respectively [79].

Pearson's correlation coefficient (r) was used to calculate the correlation between the number of therapy sessions and treatment duration, as well as the treatment outcome in which r = 0.1, 0.3, and 0.5 indicated small, medium, and large effects, respectively [80]. Where there were multiple calculations, the Bonferroni adjustment was applied to the *p* value. In all statistical analyses, a significance of *p* < 0.05 was used.

#### **3. Results**

#### *3.1. Characteristics of the Study Population*

In total, 68 participants were included in this study. Of these, there were 60 females (88.7%) with a mean age of 34.5 years (SD = 13.0, range = 20–84). There were eight males (11.3%) with mean age of 43.6 years (SD = 16.3, range = 25–70). In brief, 11 were vocal performers (16.2%), 49 were professional voice users (72.1%), and 8 belonged to other occupations (11.8%). Twenty-six had a history of vocal training (38.2%), 36 had not had voice training before (52.9%), and 6 did not provide information about voice training history (8.8%). Laryngeal assessment via ENT was reported to have been conducted on all 68 patients, which showed that 34 had primary MTD and 34 had MTD with mucosal lesions. The mean duration of voice problems was 19.2 months (SD = 26.5; 95% CI for mean = 12.5–25.9; minimum = 1.0; maximum = 132.0; median = 12.0; interquartile range = 18.0). The mean VHI-10 score was 17.8 (SD = 9.4; 95% CI = 15.5–20.1; minimum = 1; maximum = 38; median = 18.0; and interquartile range = 14.0). The study cohort was therefore considered typical of previously documented treatment-seeking populations with voice disorders reported in other studies [81,82]. Data on vocal load, history of comorbidities, and lifestyle are presented in Tables A1–A3 in Appendix B.

Figure 1 shows the number of patients who underwent all four components of Sob Voice Therapy. For all participants (*n* = 68), the OPT was taught as the initial therapy exercise/laryngeal posture. Sixty-four participants (94.1%) went on to be taught SVQ as their second voice therapy exercise. Three (4.7%) were taught SVQ in addition to a SVQ variant (i.e., sob phrases or sob sirens) simultaneously in their second appointment. Of the 61 patients who were taught the OPT followed by SVQ, 43 (70.5%) were then taught SVQ variants, with most of these participants (*n* = 33) first being taught SVQ phrases. Fourteen out of sixty-one (22.9%) did not attend any further sessions following the successive teaching of the OPT and SVQ. Following teaching of the OPT, SVQ, and SVQ variants, 55.8% (*n* = 24/43) of participants were then taught the generalisation technique of negative practice, with the remaining 19 participants being lost to follow up or having incomplete data sets.

**Figure 1.** Flowchart of the treatment techniques.

#### *3.2. Treatment Effects of Sob Voice Therapy on MTVD*

#### 3.2.1. Auditory-Perceptual Outcomes

The changes in perceptual outcome measures over time were calculated using a linear mixed model. Patients were treated as random effects and treatment (i.e., baseline and the four technique points) as fixed effects. Diagnosis (primary MTD and secondary MTD) was also a fixed factor to examine the interaction with treatment. The estimate of the fixed effects was based on the regression coefficient (b) for each effect associated with its 95% CI and the *p* value. Changes of the outcome measures over time were evaluated using multiple pairwise testing in which the Sidak adjustment for *p* values was applied.

#### • Overall severity ratings

Figure 2 shows rating scores of the overall severity of dysphonia for all time points. The overall progression, as indicated by the trend line, was that the rating scores were lower towards the final technique point (NP) for both diagnostic groups. There were significant fixed effects of treatment [F(4, 170.706) = 12.142, *p* < 0.001]. There was no significant effect of diagnosis (*p* = 0.125) and no significant interaction between treatment and diagnosis (*p* = 0.431). Parameter estimates showed a significant decrease in the overall severity ratings at the final technique point (NP) compared to baseline (b = 5.603, t = 3.047, *p* = 0.003). Compared with baseline, the mean (95% CI, Sidak-adjusted *p*) of the overall severity rating score decreased by 3.2 (0.4–5.9, *p* = 0.013), 6.9 (3.7–10.2, *p* < 0.001), 5.4 (1.7–9.0, *p* < 0.001), and 7.2 (3.0–11.4, *p* < 0.001) after treatments with OPT, SVQ, the SVQ variants, and NP, respectively.

**Figure 2.** Longitudinal plot of data for the overall severity ratings. Trend line is shown for each subgroup. 0 = baseline, 1 = OPT, 2 = SVQ, 3 = SVQ variant, and 4 = NP.

• Roughness ratings

Figure 3 shows the changes of roughness rating scores over time with a steady decrease towards the end of the treatment program. The effects of the fixed factor 'treatment' on this outcome measure were significant [F(4, 171.467) = 10.082, *p* < 0.001]. The effect of diagnosis (*p* = 0.090) and interaction effects between treatment and diagnosis (*p* = 0.231) were not significant. Parameter estimates showed a significant decrease in the rating score of roughness after NP as compared to baseline (b = 4.842, t = 2.493, *p* = 0.014). The mean (95% CI, Sidak-adjusted *p*) of the roughness rating scores decreased by 3.5 (0.6–6.4, *p* = 0.007), 5.7 (2.3–9.2, *p* < 0.001), 6.4 (2.5–10.2, *p* < 0.001), and 7.3 (2.9–11.7, *p* < 0.001) after OPT, SVQ, the SVQ variants, and NP, respectively.

**Figure 3.** Longitudinal plot of data for the roughness ratings. Trend line is shown for each subgroup. 0 = baseline, 1 = OPT, 2 = SVQ, 3 = SVQ variant, and 4 = NP.

• Breathiness ratings

Changes in the breathiness rating scores over the treatment period are presented in Figure 4, which shows a similar trend of decrease across the treatment techniques. There were significant effects of treatment [F(4, 170.294) = 5.482, *p* < 0.001], no significant effect of diagnosis (*p* = 0.102), and no significant interaction between treatment and diagnosis (*p* = 0.715). The decrease in breathiness rating scores after NP was significant as compared with baseline (b = 4.27, t = 2.13, *p* = 0.035). The mean (95% CI, Sidak-adjust *p*) ratings of breathiness decreased by 2.1 (−0.9–5.1, *p* = 0.367), 4.7 (1.2–8.2, *p* = 0.002), 4.1 (0.1–8.0, *p* = 0.040), and 5.8 (1.3–10.3, *p* = 0.004) after OPT, SVQ, the SVQ variants, and NP, respectively.

**Figure 4.** Longitudinal plot of data for the breathiness ratings. Trend line is shown for each subgroup. 0 = baseline, 1 = OPT, 2 = SVQ, 3 = SVQ variant, and 4 = NP.

• Strain ratings

Figure 5 shows the changes in the rating scores for strain quality after each technique. Overall, rating scores of this voice dimension decreased over the technique points. The trajectory of the trend lines shows that the rating scores for primary MTD decreased immediately at OPT while the decrease was not so obvious for MTD with lesions. There were significant effects of the fixed factors 'treatment' [F(4, 171.739) = 9.743, *p* < 0.001]

and 'diagnosis' [F(1, 73.367) = 5.033, *p* = 0.028], and marginally significant interaction between treatment and diagnosis [F(4, 171.739) = 2.422, *p* = 0.05]. There was a significant improvement in this voice quality after the last time point (NP) as compared to baseline (b = 5.01, t = 2.643, *p* = 0.009). There were decreases in the mean (95% CI, Sidak-adjusted *p*) of 3.8 (0.9–6.6, *p* = 0.002), 3.7 (0.3–7.0, *p* = 0.021), 6.6 (2.9–10.4 *p* < 0.001), and 7.4 (3.1–11.7, *p* < 0.001) after OPT, SVQ, the SVQ variants, and NP, respectively.

**Figure 5.** Longitudinal plot of data for the strain ratings. Trend line is shown for each subgroup. 0 = baseline, 1 = OPT, 2 = SVQ, 3 = SVQ variant, and 4 = NP.

3.2.2. Acoustic Outcomes

• Harmonics-to-noise Ratio

Figure 6 shows the mean HNR (dB) at baseline and at all the treatment time points. Significant effects of the treatment were found [F(4, 168.921) = 3.672, *p* = 0.007], while no significant interaction between treatment and diagnosis was present (*p* = 0.327), meaning that the effects of the treatment did not depend upon MTD type (primary or with mucosal lesions). The improvement in HNR between baseline and NP was significant (b = −2.82, t = −2.470, *p* = 0.014). The mean (95% CI, Sidak-adjusted *p*) of HNR (dB) increased by 1.6 (−0.2–3.3, *p* = 0.099), 1.7 (−0.3–3.7, *p* = 0.141), 2.5 (0.2–4.8, *p* = 0.022), and 2.4 (−0.3–5.2, *p* = 0.11) after OPT, SVQ, the SVQ variants, and NP, respectively.

**Figure 6.** Mean harmonics-to-noise ratio. Error bars indicate 95% CI for the mean.

• Fundamental frequency

Table 5 presents F0 data at baseline for all three vocal tasks. For F0 of CAPEV3, there were no significant fixed effects of treatment (*p* = 0.585) and no significant interaction between treatment and diagnosis (*p* = 0.358). There were also no significant effects of treatment (*p* = 0.276) and no significant interaction between treatment and diagnosis (*p* = 0.523) for the F0 of the Rainbow Passage.


**Table 5.** Fundamental frequency data (Hz) of the cohort at baseline (*n* = 68).

F0SD (vowel) also showed significant effects of treatment (*p* = 0.716) and no significant interaction between treatment and diagnosis (*p* = 0.111).

• CPP

Figure 7 shows CPP data for all three vocal tasks. A significant effect of treatment was found for the CPP of CAPEV3 [F(4, 168.369) = 4.721, *p* = 0.001] but there was no interaction between treatment and diagnosis (*p* = 0.737). The improvement of CPP at NP as compared to baseline was significant (b = −0.915, t = −2.726, *p* = 0.007). Compared to baseline, the CPP of CAPEV3 only improved by 0.5dB after OPT (95% CI = −0.04–0.96, *p* = 0.088). After SVQ, the SVQ variants, and NP, the mean (95% CI, Sidak-adjusted *p*) of this measure (in dB) increased by 0.62 (0.04–1.21, *p* = 0.03), 0.8 (0.2–1.5, *p* = 0.006), and 0.8 (0.05–1.5, *p* = 0.03), respectively.

**Figure 7.** Cepstral peak prominence (CPP) of all vocal tasks. Error bars indicate 95% CI for the mean. Abbreviations: CAPEV3, third CAPEV phrase and RP, Rainbow Passage.

There was marginal fixed effect of treatment on the CPP of the Rainbow Passage [F(4, 171.130) = 2.312, *p* = 0.06]. Significant improvement in this measure was found after NP as compared to baseline (b = −0.442, t = −2.137, *p* = 0.034). Pairwise comparisons with baseline showed that the mean (95% CI, Sidak-adjusted *p*) of the CPP (dB) of the Rainbow Passage increased by 0.14 (−0.17–0.45, *p* = 0.912), 0.17 (−0.2–0.54, *p* = 0.879), 0.31 (−0.1–0.72, *p* = 0.284), and 0.45 (−0.02–0.92, *p* = 0.069) after OPT, SVQ, the SVQ variants, and NP, respectively.

There was no significant fixed effect of treatment (*p* = 0.849) and no significant interaction between treatment and diagnosis (*p* = 0.227) on the CPP of the vowel.

• CPPS

Figure 8 shows CPPS data for all treatment time points. The CPPS of CAPEV3 demonstrated a steady increase from OPT towards the final technique (NP). There was a significant effect of treatment on this measure [F(4, 171.649) = 14.921, *p* < 0.001] but there was no significant interaction between treatment and diagnosis (*p* = 0.673), i.e., the changes in this measures over time were similar between primary MTD and MTD with lesions. The increase in the CPPS of CAPEV3 at NP as compared to baseline was significant (b = −1.985, t = −4.286, *p* < 0.001). The mean (95% CI, Sidak-adjusted *p*) of this measure (in dB) increased by 1.02 (0.33–1.7, *p* < 0.001), 1.43 (0.63–2.23, *p* < 0.001), 2.06 (1.17–2.95, *p* < 0.001), and 1.91 (0.88–2.94, *p* < 0.001) after OPT, SVQ, the SVQ variants, and NP, respectively.

**Figure 8.** Smoothed cepstral peak prominence of all vocal tasks. Error bars indicate 95% CI for the mean. Abbreviations: CAPEV3, third CAPEV phrase and RP, Rainbow Passage.

The CPPS of vowels and RP23 are also shown in Figure 8. No significant effects of treatment were found for the CPPS of the vowel (*p* = 0.819) and RP23 (*p* = 0.156).

• CSID

Figure 9 shows the CSID of all tasks. There was a significant effect of treatment on the CSID of the Rainbow Passage [F(4, 170.887) = 2.859, *p* = 0.025]. No significant interaction effect between treatment and diagnosis was found (*p* = 0.161). The model showed a significant decrease in CSID after NP as compared with baseline (b = 6.04, t = 2.327, *p* = 0.021). Pairwise comparisons across time points showed that the mean (95% CI, Sidak-adjusted *p*) of CSID decreased by 2.82 (−1.07–6.7, *p* = 0.344), 2.51 (−2.11–7.13, *p* = 0.736), 4.33 (−0.8–9.47, *p* = 0.164), and 6.03 (0.16–11.89, *p* = 0.04) after OPT, SVQ, the SVQ variants, and NP, respectively.

**Figure 9.** Mean CSID of all vocal tasks. Error bars indicate 95% CI for the mean. Abbreviations: CAPEV3, third CAPEV phrase and RP, Rainbow Passage.

The effects of the treatment for the CSID of the vowel (*p* = 0.683) and CAPEV3 (*p* = 0.935) were not statistically significant (*p* > 0.05).

#### • Vocal intensity

There were no significant fixed effects of treatment on the intensity of the vowel (*p* = 0.557), CAPEV3 (*p* = 0.357), and Rainbow Passage (*p* = 0.777).

#### *3.3. Estimates of Active Ingredients within the Sob Voice Therapy Program*

Apart from evaluating the treatment outcome of the whole Sob Voice Therapy program, we were also interested in estimating the effects of each of the individual therapy components (OPT, SVQ, the SVQ variants, and NP). This was evaluated via effect sizes, which were calculated as the Eta squared (η2) using one-way repeated-measures ANOVA for the differences in the outcome measures between baseline and after each technique point. This calculation was performed for auditory–perceptual and acoustic measures with statistically significant fixed effects of treatment. The data set for this calculation was *n* = 24 patients who had completed voice recordings at all mentioned time points. Patients with any missing data points were excluded from this analysis.

#### 3.3.1. Effect Size for Auditory–Perceptual Outcomes

Table 6 shows the mean (SD) and mean differences between baseline and each of the voice therapy techniques for all auditory–perceptual parameters. This table also presents the effect sizes corresponding to the results for the repeated-measures ANOVA. Overall, findings for auditory–perceptual ratings of overall severity, roughness, and breathiness showed that SVQ, the SVQ variants, and NP were active ingredients with large effect sizes. OPT did not demonstrate therapeutic effects. For strain ratings, only the SVQ variants and NP were the active ingredients.

**Table 6.** Auditory–perceptual outcomes after four stages of Sob Voice Therapy. Partial η<sup>2</sup> = 0.01, 0.1, and 0.25 indicate small, medium, and large effects, respectively. Abbreviation: MD, mean difference; (\*), significance at *p* < 0.05.



**Table 6.** *Cont*.

3.3.2. Effect Size for Acoustic Outcomes

Table 7 shows effect sizes associated with the outputs of the repeated-measures ANOVA for the changes in acoustic measures after each voice therapy ingredient as compared with baseline. Findings on the CPPS of CAPEV3 showed that SVQ, the SVQ variants, and NP were the active ingredients, and the last two ingredients (SVQ variant and NP) were associated with large effect sizes. Data of the CPP of CAPEV3 and CSID of the Rainbow Passage suggested that NP was an active ingredient.

**Table 7.** Acoustic outcomes after four stages of Sob Voice Therapy. Partial η<sup>2</sup> = 0.01, 0.1, and 0.25 indicate small, medium, and large effects, respectively. Abbreviations: MD, mean difference and NA, not available; (\*), significance at *p* < 0.05.



**Table 7.** *Cont*.

Other acoustic measures did not show significant changes after the treatment techniques as compared with baseline. The effect sizes for acoustic measures with nonsignificant fixed effects of treatment are shown in Table A4 in Appendix C.

#### *3.4. Impact of Service Delivery Factors on the Treatment Outcome*

#### 3.4.1. Number of Sessions and Duration of Sob Voice Therapy

Bivariate correlation coefficients were calculated to examine the relationship between the treatment dose and the differences in the outcome measure values for each technique. For example, for OPT, the differences between baseline and post-OPT data were calculated, which were then used to calculate the correlation with the number of sessions and treatment duration. For OPT, there was no significant correlation between the number of therapy sessions, duration of voice therapy (weeks), and any of the pre/post differences in the auditory–perceptual and acoustic measures (*p* > 0.05). For SVQ and SVQ variants, there was no significant correlation between the number of sessions, duration of voice therapy, and the pre/post differences in the auditory–perceptual and acoustic outcome measures (*p* > 0.05). For NP, there were correlations between the number of sessions and the pre/post differences in both the roughness ratings (r = −0.49, *p* = 0.024) and strain ratings (r = −0.49, *p* = 0.024). After Bonferroni's adjustment for multiple correlation calculations, a significant *p* value would be 0.0035. Therefore, these were deemed not statistically significant.

#### 3.4.2. Clinician Effects

Due to the involvement of six SLPs in the treatment process across patients, the effects of the treating clinicians were examined using a factorial two-way ANOVA test [clinician × treatment] with repeated measures on 'treatment' (baseline, OPT, SVQ, SVQ variants, and NP). Main effects were calculated for the 'clinician × treatment' interaction. The results showed that there were no significant interaction effects between clinicians and the treatment for all perceptual and acoustic variables (*p* > 0.05). This suggested that all clinicians contributed the same amount of variance in the treatment outcome over time.

#### *3.5. Drop-Out Rate*

Ten out of 68 (14.7%) did not attend further therapy following their second appointment. Twelve participants (17.6%) did not attend further sessions after their third appointment. Eleven participants (16.2%) did not return to therapy following their fourth session.

#### **4. Discussion**

Voice therapy is a major therapeutic intervention that can be delivered as a stand-alone treatment or in combination with medical and/or surgical treatment. Early and effective voice therapy outcomes can prevent more complicated pathologies within the larynx that require costly treatment regimes. The purpose of this study was to retrospectively review clinical data from an SLP voice database to investigate the clinical outcomes of four components of a standardised voice therapy program (Sob Voice Therapy) and to provide preliminary data on the effects of its 'active ingredients'. Statistical analyses involved the use of a linear mixed model, which allowed for the robust estimation of the treatment effects, given that patients were treated as random effects [88]. Patient factors such as history of comorbidities, voice use, and previous training were therefore considered random and were not specifically analysed. Treatment outcomes were evaluated using CAPE-V auditory–perceptual analysis, which is the "gold standard" of voice evaluation, and acoustic analysis including spectral-based measures (CPP/CPPS and CSID), which is an objective, non-invasive, and reliable evaluation with great sensitivity and specificity to voice changes [57,69,89]. These were believed to accurately reflect the treatment effects of the Sob Voice Therapy. Treatment sessions and timeframes were comparable to averages reported in the literature [13].

#### *4.1. Treatment Effects of Sob Voice Therapy on Patients with MTVD*

The first aim in the present study was to evaluate the treatment effects of SVT on MTVD. The study population consisted of typical treatment-seeking patients with primary MTVD (without obvious vocal fold mucosal lesions) or secondary MTVD (with mild mucosal changes deemed related to vocal hyperfunction, such as pre-nodules swellings and mucosal thickening) as these are the most common voice disorder types, representing approximately 40% of the case load in voice clinics [90]. The findings showed significant treatment effects in all auditory–perceptual measures for the whole treatment when compared to pre-treatment levels. There was a significant positive effect of SVT as measured by the decreased auditory–perceptual ratings of overall severity, roughness, breathiness, and strain between baseline and NP. Significant effects of treatment were also observed for acoustic measures such as HNR (vowel), CPP (CAPEV3 and Rainbow Passage), CPPS (CAPEV3), and CSID (Rainbow Passage). Notably, the HNR (vowel) value post-treatment is likely to have been judged perceptually clear compared to being not clear prior to treatment, based on [61]. However, no significant changes were found for F0, F0SD, and intensity (*p* > 0.05). These suggested that this voice therapy program was more effective in improving voice quality than in modifying pitch and loudness. The non-significant effects on F0SD also stemmed from the findings that the values of this measure were within normal ranges for both genders (Table 5).

For both auditory–perceptual and acoustic measures, the treatment effects did not depend upon the MTVD type, whether being primary or secondary. The significant effects of diagnosis observed for the auditory–perceptual ratings of breathiness and strain accurately reflected the MTVD type, with primary MTVD showing lower rating scores than secondary MTVD. This is expected with persistent associated laryngeal pathology that may affect voice quality.

Baseline values across outcome measures were indicative of predominantly mild MTVD in the cohort. For example, the mean auditory–perceptual rating score ranged from 18.7 for strain to 26.8 for overall severity (Table 6). Mean acoustic measure values were only marginally below cut-off values for voice disorder for CPP, while CSID values at baseline were within normative ranges (Table 7). The effects of the SVT on patients with more severe MTVD and on patients with predominantly mucosal lesions remain unclear and would need future studies to investigate if the same therapy components are 'active ingredients' in this cohort; signal typing as an outcome measure would be recommended in that case. Home practise dosage and frequency data was not collected, which precluded the analysis of home practise as an active ingredient. This study also lacked long-term follow-up, which impacts on the inference of the maintenance/sustainability of the outcome for this voice disorder. This study did not directly measure specific muscle-tension parameters or provide patient-reported outcome measures as outcome data, and not all participants were diagnosed by examination using videostrobolaryngoscopy. Prospective designs would address these issues.

#### *4.2. Active Ingredients of the Sob Voice Therapy Program*

Each technique within the SVT has a specific role. In OPT and SVQ, patients practised different techniques that targeted at a clear and effortless voice. In the SVQ variants and NP, patients practise specific exercises for generalising a clear and effortless voice to connected speech with intonation variation. We hypothesised that treatment effects in habitual voice quality would be observed after the SVQ variants and NP were introduced, that is, after the patient had practised exercises designed to facilitate generalisation of improved vocal function to habitual, connected speech contexts. The findings revealed that the SVQ, SVQ variants, and NP were the most active ingredients with small to medium effect sizes across the auditory–perceptual and acoustic measures of voice quality.

#### 4.2.1. Effects of OPT

As hypothesized, the findings showed that OPT was not a statistically significantly active ingredient to change voice quality in the habitual phonation of the cohort, despite resulting in improved voice outcome measures after this component was introduced. Auditory–perceptual outcome measures (Table 6) and acoustic measures, except the CSID of CAPEV3 (Table 7), demonstrated that the effects of OPT were not significant. The data on OPT may be explained by a range of factors. Firstly, the task is taught at the end of the initial assessment session with the purpose of raising perceptual awareness to the auditory–perceptual and kinaesthetic features of the voice, as well as providing cues to prime improved laryngeal function. The sound produced, however, is brief (less than 2 s as modelled) and may not be sufficient for the generalisation of improved vocal function in habitual connected speech. As it is described, it is the 'sound we make when we say yes', ergo, it is cueing a habitual phonatory task, while cueing only subtle muscular or physiological improvements in phonation. The use of features that prime improved vocal function, including a semi-occluded vocal tract [29], voice onset at resting expiratory level [91], and cueing for a clear and effortless voice [19], may not be sufficient in this technique as gross changes in voice quality and increased activation of muscles not usually activated in habitual phonation (e.g., low larynx and cricothyroid activation) are not cued. These features are, however, repeated in SVQ in which increased muscle activation and re-posturing of the larynx is also cued.

The finding of improved voice quality measures after OPT (/m/) and SVQ (/ <sup>Ď</sup>/) were taught and practised as single sounds was unexpected, as these tasks are individual sounds designed to assist the patient to re-posture the larynx for more optimal phonation, which is acquired (or re-acquired) as a new voice motor skill. They were not trained in connected speech and were not habitual speech task targets, and as such were not expected to generalise to habitual speaking after having just acquired the task (and met the target in a single sound). Consideration of these two techniques as active ingredients is therefore warranted. It is important to note that the effect size was calculated with *n* = 24, a rather small sample size. Significant findings in the CSID of the CAPEV3 phrase may be due to the increased sensitivity of CSID as a measure of voice quality. Therefore, the findings on OPT need further investigation in future studies.

#### 4.2.2. Effects of SVQ

The significant effect of SVQ (as measured in auditory-perceptual ratings and the CPPS of CAPEV3) on the habitual speaking voice of patients after practising the SVQ in isolation was not predicted, given that the task itself was to acquire (not immediately generalise) the desired laryngeal adjustments of the technique and practise in preparation for the next exercise, which was task variation using SVQ. The improved voice quality in habitual speech was observed after the practising of an isolated sound suggests that the postural adjustments cued by the SVQ are possibly primary muscular movements of optimal phonation that could be considered active ingredients in themselves. Alternatively, the likely increased activation of both muscular and neural systems may also be implicated.

SVQ requires the production of a clear, quiet, and effortless 'ng' sound, descending as if imitating a puppy whimper, to refine control of the optimal posture for phonation [30]. First described as 'light' registration by Vennard [92] and defined as "Falsetto break, expressive of grief" (p. 251) and "whine: Prolonged nasal or twangy sound, usually light in production, on descending portamento, expressing pain or disappointment" (p. 251), SVQ has subsequently been investigated as a voice quality mode named 'cry', compared to three other voice quality modes (speech, twang, and opera) [93]. Biomechanical and postural features observed in cry include low larynx position, increased space between the hyoid and thyroid, pharyngeal/supraglottic widening, increased aryepiglottic space, elongation of vocal folds, arytenoids not being tightly adducted, gentle and brief vocal fold closure, and possible increased activity of the cricothyroid and posterior crico-arytenoid [93]. Nearly all of these muscular parameters have been implicated in MTVD, including a raised larynx position, narrow supraglottic region, hyperadduction of the true vocal folds [94], and decreased hyoid/thyroid 'visor' [95].

This physiological description of SVQ suggests that all three biomechanical dimensions of the larynx are manipulated concurrently (medio-laterally, anterio-posteriorly, and inferior-superior) to correct the common biomechanical features of MTVD, with the added element of possibly activating the secondary neurological vocal pathway responsible for emotional vocalisation, as described by Simonyan [96]. Auditory–perceptual and kinaesthetic training is provided and encouraged in practice to link perception and production links in the vocal system [97,98]. More efficient learning and re-organisation of motor movements has been demonstrated in other domains to require maximal tolerable task complexity [99,100] and ability to recognise the target so that an internal reference of correctness is established for effective practice [41]. SVQ is a complex muscular task, the sound of which does not resemble habitual phonation (often a criticism of patients) but is perceptually recognizable and distinct from habitual phonation. This may promote increased recognition of the target (clear and effortless voicing) more readily than voicing in habitual conversation speech, in which the suboptimal phonation automatically occurs, assisting in generalisation.

#### 4.2.3. Effects of SVQ Task Variation

Task variation of carrier phrases and sirening in SVQ was used in this treatment to generalise the features of clear voice quality and the perceptions of effortless phonation to contexts other than/ <sup>Ď</sup>/. Results confirmed, as hypothesised, that task variation was effective in improving habitual voice quality across auditory–perceptual and acoustic analysis outcome measures. This was hypothesised based on a large body of previous research in voice therapy and motor learning, as task variation is considered essential in the learning, generalisation, and maintenance of all motor skills [101,102], despite the use of SVQ in the task. Task variation using connected speech tasks such as phrases and conversational speech is common across voice therapy approaches [25]. The explicit vocal target, use of connected speech contexts with a communicative intent, and practise regimes of SVQ are similar to other voice therapies, e.g., CTT (clear speech), but the physiological mechanism by which it is achieved is extremely different. This suggests that the mechanism of action [18] as a concept could be expanded to include the physiological description of movement as well as the acquisition and learning processes.

#### 4.2.4. Effects of Negative Practice

The NP component of SVT was highly effective across outcome measures based on results from both the mixed model and the ANOVA analysis of effect size. This was observed in auditory–perceptual outcomes in patients with primary and secondary MTVD, and across the whole cohort in acoustic measures. Negative practice (also called old way/new way) is thought to be a form of proactive interference that promotes forgetting of the old movement [103] and is commonly used in SLP and voice therapy [25,104,105]. The plateau in outcome from SVQ variants and NP may be explained by the function of NP to maintain the improvements resulting from SVQ variants, which may have resulted in a reduction in performance in the short term in some cases. As NP reintroduces the 'old' pre-treatment movement pattern, it is also possible that the performance parameters of the 'new way' are temporarily shifted until a clear differentiation between the generalised motor program for the two voice modes are well established. It is therefore conceivable that one session of NP with subsequent practise may have temporarily destabilised consistent access to the improved technique, resulting in temporary reduction in voice quality. As NP is designed to assist with generalisation and maintenance of a newly acquired skill and to extinguish access to the old suboptimal movement, an improvement in voice quality may not occur but rather a stabilisation of improvement may be more likely, as was observed in this study. Analysis of subsequent sessions is required to evaluate if habitual voice quality returned to post-SVQ levels and was retained in the long term.

#### *4.3. Effect of Diagnosis and Service Delivery*

Diagnosis of primary or secondary MTVD had a significant effect on auditory– perceptual voice ratings of strain only and was consistent over the four stages of the treatment. The clinical population in this study was typical of other MTVD cohorts reported in the literature, with retention rates also similar to other studies in which therapy was provided at no charge. There is significant evidence across RCTs and clinical studies that the retention of clients in voice therapy is generally poor [106]. Although the consequence of this is undocumented, high attrition runs the risk of ineffective treatment outcomes if the session dosage for the therapeutic effect is insufficient. In this study, positive therapeutic effects were observed across multiple voice outcomes within one to two sessions of 60-min durations with minimum durations of 1–2 weeks. If positive effects can be measured and demonstrated to patients within these short time frames, it is hoped that this would reduce attrition and increase compliance with further therapy recommendations.

Researchers have speculated that clinicians can have a therapeutic effect independent of the treatment type [107]. This is the first study to evaluate whether therapy delivered by multiple clinicians has a significant effect on voice outcomes. In this study, neither clinician, length of time, nor number of sessions had a significant effect on efficacy. This suggests that the active ingredients and overall efficacy of SVT are independent of the clinician, number of sessions, and length of treatment.

#### *4.4. Comparison with Other Voice Therapy Outcomes Research*

Comparison of effects found in this study with other treatments for patients with MTVD are difficult to make given the large range of outcome measures and different statistical analyses used across studies [11]. Numerous RCTs and prospective studies report a reduction in auditory–perceptual rating scores and improved acoustic analysis measures of voice quality including HNR, CPP, and CSID. Two retrospective cohort studies were found investigating the efficacy of VFE on patients with age-related dysphonia [108,109], only one of which documented the therapeutic outcome on voice quality auditory–perceptual and acoustic measures [109]. Small to medium effect sizes using Hedge's 'h' were reported

across a number of prospective and retrospective studies for improvements in voice quality outcome measures (shimmer and jitter only) after therapy, utilising VFE in patients with voice disorders [110]. Only one voice therapy treatment study reported using a mixed-model statistical analysis to measure voice outcomes across multiple time points in a prospective study of CTT with patients with mild MTVD who were stimulable for CTT [106]. This study reported significant effects of 4 weekly sessions, conducted no more than 10 days apart, using CTT. Five outcome measures were comparable with our study, including auditory–perceptual ratings using the CAPE-V, mean F0, CPP and CSID of a prolonged vowel, and CSID on the third CAPE-V phrase (amongst other outcome measures). Increases in mean F0 and reductions in CAPE-V ratings of the six CAPE-V phrases were reported. Effect sizes for significant effects were not reported, however. Baseline measures of the cohort in the CTT study were similar for mean F0 and the CPP vowel; however the CSID of the vowel and the third CAPE-V phrase were lower in our study. Significant improvements in habitual phonation as measured by acoustic voice analyses (CPP and CSID) were measured during and 1 week after the CTT therapy, but was not retained at 3 months. While the average number of sessions, average time between sessions, and practise recommendations were similar between the two studies, the retrospective nature of our study and the use of multiple clinicians meant that there was less control of the treatment variables, as it occurs in real-life clinical contexts.

We used both CPP (measured from ADSV) and CPPS (measured from Praat) to ensure that researchers can compare their data with the present study depending upon which software is available to them. Although ADSV is a commercial specialized software for clinical application, it is not accessible/available to many users, especially the nonclinicians, while Praat is a freeware. The discrepancy between the CPP and CPPS results for the CAPEV3 task (Table 7) probably resulted from the slight differences in the algorithms between these two programs rather than from the effects of the data distribution. CPPS showed more significant effects of treatment as the smoothing is believed to improve the cepstral estimation accuracy [73]; therefore, it would be more likely to detect finer changes in the voices given the mild dysphonic severity of the study cohort.

#### **5. Conclusions**

SVT was effective in reducing the signs and symptoms of mild MTVD in a typical treatment-seeking cohort, as measured by auditory–perceptual and acoustic voice outcomes. Three out of four individual components of the therapy program demonstrated statistically significant positive therapeutic effects, independent of the session number, duration of therapy, and clinician. This provides preliminary evidence that the SVQ technique and both the SVQ task variation and NP can be considered as active ingredients in the treatment of patients with MTVD.

**Author Contributions:** Conceptualisation, C.M.; methodology, C.M. and D.D.N.; formal analysis, D.D.N.; investigation, C.M.; data curation, A.C. and E.K.; writing—original draft preparation, C.M. and D.D.N.; writing—review and editing, C.M., D.D.N., A.C., E.K. and D.N.; project administration, C.M.; visualisation, D.N.; funding acquisition, C.M. and D.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Dr. Liang Voice Program at the University of Sydney.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and was approved by the Human Research Ethics Committee of the University of Sydney, protocol number: 2019/529.

**Informed Consent Statement:** Patient consent was waived due to it being impractical to seek consent for patients seen in the past and it was considered a threat to patient privacy to implement a process to locate and contact each individual participant to seek their consent. This waiver was approved by the Human Research Ethics Committee and the approval is provided above.

**Data Availability Statement:** Data supporting the reported results are retained by the University of Sydney in a deidentified form and is confidential under the conditions of the Human Research Ethics Committee of the University of Sydney approval.

**Conflicts of Interest:** The first author, C.M., is the director and sole shareholder of Voicecraft International Pty Ltd., the legal entity retaining ownership of the intellectual property of Voicecraft®. C.M., D.D.N., D.N., A.C. and E.K. are employees of the University of Sydney and are partly or fully funded by the Dr. Liang Voice Program, a philanthropically funded program of research and post-graduate education in laryngology. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

MD, mean difference; CAPEV3, the third CAPEV phrase; RP, Rainbow Passage; RP23, second and third sentences of the Rainbow Passage; CPP, cepstral peak prominence; CPPS, cepstral peak prominence smoothed; and CSID, Cepstral/Spectral Index of Dysphonia.

#### **Appendix A. Settings for Acoustic Measurements**

*Appendix A.1. Settings for the Fundamental Frequency Measurement in Praat*

F0 range = 75–500 Hz, cross-correlation method, maximum number of candidates = 15, silence threshold = 0.03, voicing threshold = 0.45, octave cost = 0.01, octave-jump cost = 0.35, and voice/unvoiced cost = 0.14. The check box of "very accurate" was checked.

#### *Appendix A.2. Settings for the CPP Measurement in the Analysis of Dysphonia in Speech and Voice (ADSV)*

Resampling rate = 25 kHz, spectral window size (pts) = 1024, maximum frequency for regression line calculation = 10,000, frame overlap = 75%, cepstral time-averaging (frames) = 7, CPP threshold (dB) = 0, and cepstral peak extraction range minimum–maximum = 60–300 Hz. Low/high spectral ratio cut-off = 4000 Hz.

#### *Appendix A.3. Settings for the CPPS Measurement in Praat*

Pitch floor (Hz) = 60, time steps (s) = 0.002, maximum frequency (Hz) = 5000, preemphasis from (Hz) = 50, time-averaging window (s) = 0.01, quefrency-averaging window (s) = 0.001, peak search pitch range (Hz) = 60–330, tolerance (0–1) = 0.05, interpolation = parabolic, subtract tilt before smoothing = no, tilt line quefrency range (s) = 0.001–0.0 (=end), line type = straight, and fit method = robust.

#### **Appendix B**




**Table A2.** History of voice-related comorbidities.

**Table A3.** Lifestyle information.


#### **Appendix C**

**Table A4.** Changes in the acoustic outcomes before and after the four stages of Sob Voice Therapy. Partial η<sup>2</sup> = 0.01, 0.1, and 0.25 indicate small, medium, and large effects, respectively.



**Table A4.** *Cont*.

#### **References**


## *Article* **Effect of Progressive Head Extension Swallowing Exercise on Lingual Strength in the Elderly: A Randomized Controlled Trial**

**Jin-Woo Park \*, Chi-Hoon Oh, Bo-Un Choi, Ho-Jin Hong, Joong-Hee Park, Tae-Yeon Kim and Yong-Jin Cho**

Department of Physical Medicine and Rehabilitation, Dongguk University Ilsan Hospital, Goyang-si 10326, Gyeonggi-do, Korea; chejuoh@hanmail.net (C.-H.O.); moongirl33@naver.com (B.-U.C.); frischen@naver.com (H.-J.H.); s65271@hanmail.net (J.-H.P.); tinaccjj@naver.com (T.-Y.K.); pigboom@hanmail.net (Y.-J.C.)

**\*** Correspondence: jinwoo.park.md@gmail.com; Tel.: +82-31-961-7484

**Abstract:** Lingual strengthening training can improve the swallowing function in older adults, but the optimal method is unclear. We investigated the effects of a new progressive resistance exercise in the elderly by comparing with a conventional isometric tongue strengthening exercise. Twentynine participants were divided into two groups randomly. One group performed forceful swallow of 2 mL of water every 10 s for 20 min, and a total of 120 swallowing tasks per session at 80% angle of maximum head extension. The other group performed five repetitions in 24 sets with a 30 s rest, and the target level was settled at 80% of one repetition maximum using the Iowa Oral Performance Instrument (IOPI). A total of 12 sessions were carried out by both groups over a 4-week period. Blinded measurements (for maximum lingual isometric pressure and peak pressure during swallowing) were obtained using IOPI before exercise and at four weeks in both groups. After four weeks, both groups showed a significant improvement in lingual strength involving both isometric and swallowing tasks. However, there was no significant difference between the groups in strength increase involving both tasks. Regardless of the manner, tongue-strengthening exercises substantially improved lingual pressure in the elderly with equal effect.

**Keywords:** deglutition disorders; tongue; exercise; deglutition; ageing

#### **1. Introduction**

Presbyphagia means characteristic alteration in the deglutition mechanism of healthy older adults [1]. Aging worsens motor swallowing mechanism, which, in turn, leads to weakness in tongue muscle [2]. It is significant that the tongue is the main source of propelling oropharyngeal swallowing [3], and abnormal tongue strength and coordination can decrease the safety and efficiency of swallowing [4,5].

Fortunately, tongue exercises can increase tongue strength and improve swallowing ability in older people. In this way, exercise using an air bulb or pushing against hard palate as a resistive isometric exercise can improve tongue strength and swallowing function [6,7]. Real swallowing exercise can also improve tongue strength in the elderly [8]. However, the method that is the best for increasing tongue strength is currently unclear.

We know that the training method based on the basic principle of exercise is the best [9]. Training specificity means that improvement in performance is most dramatic when movements closely coincide with the exercise. When applied to the tongue, the tongue strength is improved during swallowing. According to the overload principle, exercise resistance should be gradually increased as the individual capabilities improve throughout the training. Exercises using an air bulb or tongue depressor [6,10,11] are resistive isometric exercises and appropriate for the overload principle but are not based on training specificity. Actual swallowing exercises such as effortful swallow [12] are based on training specificity, but they do not adhere to the overload principle because the exercise intensity cannot be adjusted.

**Citation:** Park, J.-W.; Oh, C.-H.; Choi, B.-U.; Hong, H.-J.; Park, J.-H.; Kim, T.-Y.; Cho, Y.-J. Effect of Progressive Head Extension Swallowing Exercise on Lingual Strength in the Elderly: A Randomized Controlled Trial. *J. Clin. Med.* **2021**, *10*, 3419. https:// doi.org/10.3390/jcm10153419

Academic Editor: Renee Speyer

Received: 13 July 2021 Accepted: 29 July 2021 Published: 31 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

However, head extension swallowing exercises can increase lingual swallowing pressure and endurance in an older adult population [13]. Even though this exercise is based on the work of a single research group involving a limited number of people which has yet to be replicated elsewhere, it can be easily performed anytime and anywhere without the need for additional equipment, especially given the benefits of resistance exercise. We thought that it might conform to training specificity and overload principle, and effectively improve tongue strength. We modified this exercise by adjusting the angle of head extension in order to control and increase the intensity of the exercise (progressive resistance exercise). We hypothesized that this new exercise is effective in increasing tongue strength in older adults, and that the exercise is superior to the lingual elevation exercise. Therefore, in this study, we analyzed the effects of a new progressive resistance exercise for performance by older adults, and we compared the results with conventional isometric tongue-strengthening exercises.

#### **2. Materials and Methods**

#### *2.1. Participants*

Thirty-five healthy older volunteers were eligible for this study, which was conducted from August 2019 to February 2020. The inclusion criteria were: (1) healthy older people aged above 65 years without dysphagia, and (2) sufficient cognitive function to perform tongue-strengthening exercises (mini-mental status exam ≥ 26). Thus, the exclusion criteria were: (1) history of odynophagia or dysphagia, (2) drugs that influence swallowing, and (3) history of cervical spine disease that prohibits head extension. Before attending this study, all of the participants were examined by a doctor. This study adheres to CONSORT guidelines, and the Institutional Review Board approved this study. Informed consent was obtained from each subject. Twenty-nine volunteers participated in this study, and 26 of the 29 participants who completed the 12 sessions of the exercise were included in this analysis (Figure 1). Three of the 29 participants dropped out after performing the exercise 2 to 3 times because they either had no time to visit the hospital or their place of residence was located too far from the hospital. The mean age of the study group was 72.9 ± 6.4 years, and the study included 5 males and 21 females. The general characteristics of these volunteers are shown in Table 1.


**Table 1.** General characteristics of participants in this study.

**Figure 1.** Flow diagram and exercise protocol.

#### *2.2. Experimental Protocol*

This study was designed as a randomized, controlled study and was scheduled for a total of 4 weeks. The study participants were randomly allocated to two groups with a 1:1 ratio: tongue progressive resistance exercise group (G1) or tongue isometric exercise group (G2) using a randomization computer program. The assessor and statistical analyst were unaware of the group assignment. Before strengthening training, we measured the baseline data including maximum lingual isometric pressure and peak pressure during swallowing using Iowa Oral Performance Instrument (IOPI) (model 2.1; IOPI Medical LLC, Carnation, WA, USA), which is a handheld tool for measuring the pressure on a small air-filled bulb [14]. Each strengthening program was then administered to the participants over a course of 4 weeks, followed by reassessment of strength to evaluate the training effects of the tongue-strengthening exercise.

#### *2.3. Tongue Strengthening Training*

The G1 group performed an effortful swallow of 2 mL of water every 10 s for 20 min with a total of 120 swallowing tasks per session at 80% angle of maximum head extension (MHE). One session consisted of two 10 min period exercises with a 5 min period rest between exercises to avoid muscle fatigue. All participants received instruction to maintain the same posture by staring at one point during the swallowing attempts. The point was determined to ensure that the participants looked comfortable by staring at the grid on the wall 1 m away while maintaining the determined head extension angle. Next, the G2 group did an exercise, which consisted of five repetitions, 24 sets, 30 s rest between sets and a total of 120 lingual pressing tasks per session, with the target level set at 80% of one repetition maximum (RM) using an IOPI. Participants hold the bulb for 3 s based on the light-emitting diode (LED). MHE and one RM were repeatedly measured every week and the exercise levels were readjusted. Three sessions were performed by both groups each week over a 4-week duration (total 12 sessions). All exercises were carried out in the University Hospital under supervision.

#### *2.4. Head Extension Measurements*

Each participant in the G1 group sat on a chair ensuring that the thoracic vertebrae were in constant contact with the back of the chair, and the lumbar vertebrae filled the gap between the seat and the back. The participant's feet were placed flat on the floor and arms were placed freely at their sides. Next, the inclinometer (Baseline® Bubble Inclinometer, FEI, White Plains, NY, USA) was mounted over the participant's vertex of the head. Next, the tester instructed each participant to extend his or her head until they could not swallow volitionally, and then measured the MHE angle using the inclinometer (Figure 2).

**Figure 2.** Head extension angle. A. neutral position B. maximal head extension (MHE) C. 80% of MHE.

#### *2.5. Tongue Strength Measurements*

In the study, the blinded lingual pressures were measured using IOPI with participants seated comfortably in an upright position during two different tasks: (1) maximum isometric pressure and (2) peak pressure during saliva swallowing [15]. The bulb was positioned at 10 mm anterior to the most posterior circumvallate and pressures (expressed in kPa) were displayed on a liquid crystal display (LCD) panel on the device. For the isometric task, volunteers received instruction to press the bulb against the "roof of the mouth" with the tongue as "hard as possible." For the swallowing task, the participants were instructed to swallow saliva as they would normally with the bulb in place. Three trials to generate maximal pressures were attempted and the highest pressure was used to measure the tongue strength.

#### *2.6. Statistical Analysis*

The statistical analysis was carried out using SPSS version 12.0 (SPSS, Inc., Chicago, IL, USA). For determining the sample size, the predicted difference (d) of IOPI was set to 5 and the standard deviation S was set to 5. An alpha error of 0.05 and a beta error of 0.2 were calculated to arrive at a total of 32 subjects. Group comparisons of baseline

demographics were performed using Student's *t*-test for continuous variables and χ<sup>2</sup> test for categorical variables to test imbalance between groups. Likewise, the paired *t*-test was used for comparison between paired variables (pre- and post-training in groups). Finally, the comparison of the absolute increase in strength between groups was performed with Student's *t*-test. The significance level was set at *p* < 0.025 to consider alpha-level adjustments for multiple comparisons.

#### **3. Results**

The mean baseline maximum head extension angle in G1 was 39.6 ± 9.9 (25–55) degrees, which significantly increased to 57.7 ± 7.8 (40–70) degrees after 4 weeks. The increase in maximum head extension angle was positively correlated with the increase in tongue strength in the G1 group (Spearman's Rho, r = 0.651, *p* = 0.016)

The average baseline maximum isometric pressures (average ± standard deviation) of G1 and G2 were 40.5 ± 9.2 kPa and 43.5 ± 10.4 kPa, respectively, showing no significant differences between groups (*p* = 0.455). The average baseline peak pressures during swallowing of G1 and G2 were 26.1 ± 12.4 kPa and 31.3 ± 12.6 kPa, respectively, and also there was no significant difference between the groups (*p* = 0.297). After four weeks of exercise, the tongue strength in both isometric and swallowing tasks was increased significantly in both groups (G1, *p* < 0.001, Cohen's d = 2.222 and G2, *p* < 0.001, Cohen's d = 1.469 for isometric pressure; G1, *p* = 0.001, Cohen's d = 0.882 and G2, *p* = 0.003, Cohen's d = 0.763 for pressure during swallowing) (Figure 3). However, no significant difference in strength increment in both tasks was detected between the groups (G1, 17.6 ± 7.5 kPa and G2, 14.0 ± 7.9 kPa, *p* = 0.244 for isometric pressure; G1, 11.9 ± 10.3 kPa and G2, 10.2 ± 10.1 kPa, *p* = 0.662 for pressure during swallowing) (Figure 4).

**Figure 3.** Comparisons of maximal tongue pressure between baseline and post-training sessions in both groups. G1, Tongue progressive resistance exercise group; G2, Tongue isometric exercise group. (**A**) Maximum isometric pressure. Tongue strength was increased significantly in both exercise groups (G1, *p* = 0.000; G2, *p* = 0.000). (**B**) Peak pressure during swallowing. Tongue strength was also increased significantly in both exercise groups (G1, *p* = 0.001; G2, *p* = 0.003).

**Figure 4.** Comparison of the degree of strength increment between the two groups. G1, Tongue progressive resistance exercise group; G2, Tongue isometric exercise group. (**A**) Maximum isometric pressure. There were no significant differences between the groups (G1, 17.6 ± 7.5 kPa and G2, 14.0 ± 7.9 kPa, *p* = 0.244). (**B**) Peak pressure during swallowing. No significant differences were detected between groups. (G1, 11.9 ± 10.3 kPa and G2, 10.2 ± 10.1 kPa, *p* = 0.662). Box: 1st quartile and 3rd quartile; Whisker: minimum and maximum; Line: median; X: average.

#### **4. Discussion**

Four weeks of progressive head extension swallowing exercise improved tongue strength in older volunteers. However, this method was not superior to conventional isometric strengthening exercise. Likewise, the head extension swallowing exercise strengthens the tongue and suprahyoid muscles. It was originally a compensatory method administered to inpatients with head and neck cancer who generally present with problems associated with oral food intake [16]. However, the use of head extension as a resistance mechanism to strengthen the tongue was applicable to young and old alike [13,17]. We modified this exercise by additionally increasing the angle of head extension to control the intensity of exercise. Progressive head extension swallow training that meets training specificity criteria and overload principle is expected to be the most effective method to increase lingual strength.

However, lingual strengthening training does not follow standard exercise principles. In fact, the unique physiology of the lingual musculature may defy many types of exercise principles [18]. The tongue is a muscular hydrostat, which generates force via contraction of muscle fibers to generate hydraulic pressure within a limited area. However, the muscles of the human tongue are unique in that they are attached to only a single static support (mandible or styloid process), or to a floating support (hyoid bone). The tongue is a cylindrical structure with a constant volume that adjusts its shape and size by co-activating many of its muscular components. The implication in this case is that because the muscles cannot contract by attaching to a bony support, as in the arm or leg, the hydrostatic pull on the muscles results in a net productive movement. In contrast, skeletal muscles usually contract with joints to create force, and most of the theory underlying exercise physiology is based on skeletal muscle studies. Regardless of the direction, most tongue motions require simultaneous contraction of several tongue muscles to produce hydraulic pressure that alters the functional strength in any untrained tongue movements [10].

Robbins et al. reported that average baseline peak isometric pressure was 41 (36–46) kPa and the pressure increased 7 kPa in older adults after an 8-week program of lingual resistance exercise entailing compression of an air-filled bulb [6]. Van den Steen et al. performed tongue-strengthening exercises for 8 weeks using IOPI in healthy older adults and reported an approximate increase in strength of 26.0 kPa in the anterior maximum isometric pressure (baseline 35.9 ± 6.0 kPa) [14]. Park et al. performed a home-based program for the older adults involving tongue-pressing effortful swallow exercise. Baseline mean tongue pressure was 37.51 ± 15.26 kPa. Four weeks after exercise, the average of

the maximum tongue pressure increased by 8.17 kPa [8]. Four weeks of progressive head extension swallowing exercise increased the maximal isometric pressure of 17.6 kPa in this study.

Few studies reported attempts to strengthen the tongue muscles in the form of resistance-swallowing exercise (consistent with exercise principles). Repetitive tongueholding swallowing exercise was proposed for improving swallowing function in young healthy people, but it showed the same effects as compared to normal swallowing exercise [19]. Park et al. showed that chin-down swallowing exercise improved the lingual strength of healthy young people. However, this exercise was not superior to other tonguestrengthening trainings [20]. The results reinforced our findings in this study.

This study has a few limitations. First, although increasing the degree of head extension requires additional effort during swallowing, evidence is insufficient to show that the resistance increases in proportion to the increasing angle of head extension. However, the maximum head extension angle was increased with exercise. The increment of maximum head extension angle significantly correlated with the increase in the tongue strength, which might support the role of increasing head extension as an appropriate mechanism for achieving overload. Second, we had the participants stare at a point, which was set to maintain the same posture during exercise, but we did not ensure that this direction was perfectly followed in each case. However, we supervised the exercise of all participants to ensure that they followed our instructions correctly. Third, the head extension exercise was conducted with effortful swallows but lingual pressure during swallowing was measured during non-effortful swallows. In terms of training specificity, this limitation might have affected the results of this study.

#### **5. Conclusions**

Swallowing exercise with progressive head extension increased tongue strength in the older participants. It was easy to monitor the participants anytime and anywhere without any equipment. However, the benefits of this training intervention were not better than other conventional tongue-strengthening exercise. The results suggest that since lingual musculature exhibits atypical response to strength training and all tongue-strength training interventions yield favorable results regardless of the type, it is best to select an exercise option that is easy and most appropriate for the participant and the specific circumstances.

**Author Contributions:** Conceptualization, J.-W.P.; methodology, C.-H.O., B.-U.C., H.-J.H. and J.-H.P.; validation, J.-W.P.; formal analysis, T.-Y.K. and Y.-J.C.; investigation, T.-Y.K. and Y.-J.C.; writing—original draft preparation, J.-W.P.; writing—review and editing, J.-W.P.; visualization, J.-W.P.; funding acquisition, J.-W.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by a grant (NRF-2019R1F1A1043950) of the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT, Republic of Korea. The funder had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Dongguk University Ilsan Hospital Institutional Review Board. (Approval No. 2019-07-010-002).

**Informed Consent Statement:** We obtained from all subjects participating in the study.

**Data Availability Statement:** Data presented in this study are provided by the corresponding authors upon reasonable request.

**Conflicts of Interest:** The authors have no conflict of interest.

#### **References**


## *Article* **Validation and Classification of the 9-Item Voice Handicap Index (VHI-9i)**

**Felix Caffier 1,†, Tadeus Nawka 1,†, Konrad Neumann 2, Matthias Seipelt <sup>3</sup> and Philipp P. Caffier 1,\*,†**


**Abstract:** The international nine-item Voice Handicap Index (VHI-9i) is a clinically established shortscale version of the original VHI, quantifying the patients' self-assessed vocal handicap. However, the current vocal impairment classification is based on percentiles. The main goals of this study were to establish test–retest reliability and a sound statistical basis for VHI-9i severity levels. Between 2009 and 2021, 17,660 consecutive cases were documented. A total of 416 test–retest pairs and 3661 unique cases with complete multidimensional voice diagnostics were statistically analyzed. Classification candidates were the overall self-assessed vocal impairment (VHIs) on a four-point Likert scale, the dysphonia severity index (DSI), the vocal extent measure (VEM), and the auditory–perceptual evaluation (GRB scale). The test–retest correlation of VHI-9i total scores was very high (r = 0.919, *p* < 0.01). Reliability was excellent regardless of gender or professional voice use, with negligible dependency on age. The VHIs correlated best with the VHI-9i, whereas statistical calculations proved that DSI, VEM, and GRB are unsuitable classification criteria. Based on ROC analysis, we suggest modifying the former VHI-9i severity categories as follows: 0 (healthy): 0 ≤ 7; 1 (mild): 8 ≤ 16; 2 (moderate): 17 ≤ 26; and 3 (severe): 27 ≤ 36.

**Keywords:** Voice Handicap Index (VHI-9i); international short scale; VHI-9i severity levels; test– retest reliability; validation of classification ranges; self-assessed vocal impairment (VHIs); hoarseness; dysphonia severity categories; voice diagnostics

#### **1. Introduction**

A patient's self-assessment of his or her own voice is an important tool for diagnosing voice disorders and vocal treatment outcomes [1,2]. Only the patients themselves can quantify how much a voice disorder impacts their daily lives. For instance, mild hoarseness affects professional voice users such as opera singers in a different way than non-professional voice users such as office workers [3,4].

The Voice Handicap Index (VHI) was developed and validated as a statistically robust method to measure the subjective impact of voice disorders [5]. The original questionnaire consists of 30 items (VHI-30) addressing functional, physical and emotional impairments in the context of dysphonia according to the patient's own experience. Each question is answered on a scale from 0 (never) to 4 (always), resulting in an overall score ranging from 0 to 120. The VHI-30 was translated and validated cross-culturally to form international variants (e.g., [6–11]) which were proven to be equivalent with each other [12,13].

From our own clinical experience, many patients and medical staff perceive the original 30-item questionnaire as rather time-consuming. To increase overall acceptance and practicability, shortened versions with fewer items were developed. A 12-item questionnaire [14,15] was soon followed by another reduction to 10 items [16,17]. Since 2009, the

**Citation:** Caffier, F.; Nawka, T.; Neumann, K.; Seipelt, M.; Caffier, P.P. Validation and Classification of the 9-Item Voice Handicap Index (VHI-9i). *J. Clin. Med.* **2021**, *10*, 3325. https://doi.org/10.3390/jcm10153325

Academic Editor: Renee Speyer

Received: 31 May 2021 Accepted: 25 July 2021 Published: 28 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

commonly used variant at the Charité-Universitätsmedizin Berlin is the VHI-9i international questionnaire [14]. It consists of only nine items, after item reduction based on the original VHI-30 and European translations. A detailed discussion of the item and scale development can be found in the original VHI-9i publication [14]. In everyday diagnostic practice, the German translation of the VHI-9i is widely used by laryngologists and phoniatricians in German-speaking countries (e.g., [18–22]). Despite its clinical adoption, the reliability and validity of this VHI short scale as well as its classification have not yet been statistically verified. Instead, the current classification scale is based on the 25th, 50th, and 75th percentiles, dividing the scores into four severity classes. Thus far, clinical experience seems to plausibly reflect the self-perceived voice impairment. However, to overcome this arbitrary percentile-based exploration, we looked for a sound statistical basis for VHI-9i severity levels by revising the current cut-off points. In the context of expert opinion, thorough classifications of vocal parameters are essential for the assessment of dysphonia. In addition, a reliable and valid VHI-9i severity classification is needed to improve clinician-rated evaluations of treatment outcomes (e.g., better characterization of the quantified extent of subjective vocal impairment, more comprehensible assessment of individual pre- vs. post-therapeutic comparisons).

This study aims to address these shortcomings. Initially, we investigated whether the VHI-9i produces reliable results independent of age, gender or professional voice use. Next, the questionnaire validity was examined. For this purpose, the relationship between VHI-9i total scores and other established vocal parameters was statistically analyzed to establish cut-off values for healthy voices and mild to severe dysphonia. For external criteria, we intended to use objective acoustic–aerodynamic voice function diagnostics including voice range profile (VRP) measurements, dysphonia severity index (DSI) and vocal extent measure (VEM) calculations, as well as the subjective auditory–perceptual evaluation of voices by experienced examiners (GRB scale). Furthermore, the overall self-assessed vocal impairment (VHIs) served as an internal criterion.

#### **2. Materials and Methods**

#### *2.1. Study Design and Patients*

This study was conducted in accordance with the Declaration of Helsinki and approved by the local ethical review board. Selection criteria involved informed consent and the completion of the standard phoniatric examination procedures. After taking the medical history, all patients presenting in the Department of Audiology and Phoniatrics, Charité-Universitätsmedizin Berlin, Germany, received a digital videolaryngostroboscopy to assess the laryngeal findings and to establish a medical diagnosis. Subsequently, multidimensional voice function diagnostics were carried out as recommended by the European Laryngological Society (ELS) [1], starting with subjective evaluations (GRB, VHI-9i) and followed by objective voice function diagnostics (VRP, DSI, VEM). For subjective vocal self-assessment, patients completed the VHI-9i questionnaire. To estimate the voice use of every study participant, we also asked about their occupation and categorized them according to Koufman and Isaacson [23]: elite vocal performers (Level 1; e.g., actors, singers, voice artists), professional voice users (Level 2; e.g., teachers, politicians, moderators), non-vocal professionals (Level 3; e.g., lawyers, medical personnel, civil service employees), and non-vocal non-professionals (Level 4; e.g., IT staff, office workers, mechanics).

Between May 2009 and March 2021, a total of 17,660 consecutive cases were documented in the clinical database. To analyze the reliability of the VHI-9i, 718 patients were asked to complete the same questionnaire for a second time, without therapeutical intervention. The retest form had to be returned within one week to study the differences between the original answers and the retest. The second VHI-9i questionnaire was returned by 517 patients, corresponding to a response rate of 72%. Some questionnaires containing unanswered items or ambiguous checkmarks (e.g., between items) had to be excluded, resulting in 416 test–retest pairs.

The remaining 16,942 consecutive cases were analyzed to establish the validity of the questionnaire and to calculate statistically valid classification ranges. Since the VHI-9i should be compared with other established vocal parameters, only 7766 cases with complete multi-dimensional diagnostic assessment were considered. Cases with unreliable perturbation measures (jitter > 5%) were excluded, as recommended in the literature [1,24], resulting in a sample size of 6882. After another exclusion of follow-up visits, 3661 complete and unique cases were left for statistical analysis.

#### *2.2. Subjective Examination Instruments*

The VHI-9i represents an item-reduced short scale of the established VHI-30 [14], available in several languages (i.e., Dutch, English, French, German, Italian, Portuguese and Swedish). In this study, the German translation of the questionnaire was used (see Appendix A). Study participants were asked to answer all 9 items on a scale from 0 to 4 (0: never, 1: almost never, 2: sometimes, 3: almost always, 4: always), resulting in a total score between 0 and 36. The total score was then assigned to one of four dysphonia severity categories, ranging from 0 (healthy; 0 ≤ 5), 1 (mild; 6 ≤ 13), 2 (moderate; 14 ≤ 22), to 3 (severe; 23 ≤ 36). However, these categories correspond to a classification proposed by Nawka et al., based on the percentiles of a representative investigation of 716 patients [25]. Since these classification ranges have not yet been validated, statistical calculation of potential cut-off values for the VHI-9i classification was a main goal of this study.

Additionally, participants were asked to rate their overall voice impairment at present on a scale from 0 to 3 (0: normal, 1: mild, 2: moderate, 3: severe), the VHI summary assessment (VHIs). This index allows patients to assess how they feel about their voice with only one number. The relationship between VHI-9i and VHIs scores was examined to determine whether patients would rate themselves differently when asked about specific situations in their lives (VHI-9i items) or directly about their overall impairment (VHIs).

Apart from self-assessment, voices were also evaluated by auditory–perceptual assessment using the GRB system [26–28]. Based on the GRBAS scale, our department developed the modified GRB classification [29,30]. Only the first three criteria are used, focusing on the overall grade of hoarseness (G) and both main pathophysiological hoarseness components: roughness (R) and breathiness (B). The assessment of voice quality can be carried out more quickly and easily. Therefore, this system has become established in German-speaking countries and is also recommended in the ELS protocol [1]. Patients were asked to read the standardized text "The north wind and the sun" (German version), while the perceived G, R and B were scored on a scale from 0 to 3. To increase objectivity, each voice recording was rated independently by one experienced phoniatric physician and one senior speech– language therapist. The means were used for further exploration. While the degree of G serves as the overall indicator of dysphonia in the original GRBAS scale, it is regarded as gold standard for hoarseness evaluation in the GRB system presented here [31].

#### *2.3. Objective Acoustic Assessment*

For objective external validation criteria, we applied acoustic–aerodynamic voice function diagnostics. Voice recordings of all participants were conducted at the voice lab of our outpatient department, which is a sound-treated room with a background noise <40 dB(A). Study participants were asked to wear a head-mounted microphone with a stable mouth–microphone distance of 30 cm [32]. The equipment used for this purpose was the XION microphone headset (model number 352,009,010; XION GmbH, Berlin, Germany), which enables the realization of speech and singing VRP measurements and voice analyses under reproducible conditions. Technical microphone specifications include a frequency response of 70 Hz–20 kHz and a dynamic range of 40–120 dB(A). The microphone headset incorporates a calibrated audio interface that transmits digitized data to the PC via USB. The built-in electronics ensure the automatic calibration of the microphone connection without additional adjustments. The audio was processed via the DiVAS 2.8 software using the Singing Voice Analysis module (product number 350,020,013) and the Speaking

Voice Analysis module (product number 350,020,024; XION GmbH, Berlin, Germany). VRP measurements were performed to show the functional interactions of different components of voice generation regarding vocal frequency and intensity [33,34]. The detailed procedure of VRP recordings is described in previous publications [35,36].

The established parameter DSI was automatically calculated as a weighted combination of the highest possible fundamental frequency, the lowest phonation intensity, maximum phonation time and jitter [37]. Regarding jitter, the waveform matching method was used for fundamental frequency extraction as it meets the high-precision criterion of being able to extract a 1% frequency change per cycle with a 1% accuracy, as long as the signal-to-noise ratio is greater than about 40 dB and concomitant amplitude modulations are below about 5% [24]. Measurements were conducted in a standing position. Subjects were asked to produce a sustained vowel (/na/ or /a/) for about 3 seconds at comfortable pitch and loudness. The most stable recording out of 3 trials was chosen for DSI calculation. Based on Gonnermann's investigation of 495 subjects [38], the DSI scores were sorted into 4 severity categories, discriminating healthy voices (≥4.2) from mildly (<4.2 to ≥1.8), moderately (<1.8 to ≥−1.2), or severely (<−1.2) dysphonic voices. Since the DSI quantifies dysphonia as a negative criterion and involves the risk of imprecise results due to its multidimensional data acquisition, the one-dimensional parameter VEM was recently developed [35].

VEM calculation was performed automatically after VRP recording via the proprietary AVA software [39,40]. The VEM quantifies a subject's dynamic performance and frequency range. It is calculated as a relation of the area and perimeter of the VRP and describes the vocal function by an interval-scaled value without unit, usually between 0 and 120. These limits may be exceeded at both ends by either severely impaired or exceptionally capable voices with a large ambitus and dynamic range. A small vocal capacity is described by a low VEM, a large VRP by a high VEM. The VEM emphasizes the vocal abilities and enables a classification of voice performance as a positive criterion [21,31,41]. Based on Müller's investigation of 994 subjects [36], the resulting VEM scores were divided into percentiles, distinguishing a normal vocal capacity (≥108) from mildly reduced (<108 to ≥93), moderately (<93 to ≥69) and severely reduced (<69) vocal capacities.

Table 1 summarizes the severity classification of different objective and subjective vocal parameters by reference range. In contrast to the ordinally scaled GRB and VHIs, the classifications of metrically scaled parameters (VEM, VHI-30, VHI-9i) are based on the percentiles of the respective study cohorts (Level 0: 100th percentile/4th quartile; Level 1: 75th percentile/3rd quartile; Level 2: 50th percentile/2nd quartile; Level 3: 25th percentile/1st quartile).

**Table 1.** Severity classification of different vocal parameters, assessed by study participants (VHI-30, VHI-9i, VHIs), experienced clinicians (GRB), and acoustic–aerodynamic analysis (VEM, DSI). Although all parameters share the same classification scale (0–3), equal levels of severity among different parameters do not imply equivalence (**\*** classification ranges based on percentiles).


#### **3. Data Analysis**

Statistical analysis was performed using IBM SPSS version 26.0.0.1. To establish the questionnaire as reliable, the absolute differences in total VHI-9i scores between test and retest were compared. An analysis of the differences of every single item in the questionnaire is individually important, but only the total scores are relevant in diagnostic practice. Paired-sample *t*-tests were used to check for biases, and correlations were established

through Pearson's r. To test the dependency of the VHI-9i total score on age, a regression analysis was performed. Gender differences were analyzed through independent sample t-tests. We checked for a dependency on voice use by means of the nonparametric Kruskal–Wallis H-test.

Before the cut-off points for the VHI-9i severity categories could be validated, the correlations between the VHI-9i and the severity classifications for VHIs, DSI, VEM, G, R and B had to be determined using Spearman's rho (ρ), in order to choose which of them was best suited for classification. These vocal parameters had to be balanced in terms of sensitivity (i.e., true positive rate, TPR) and specificity (i.e., true negative rate, TNR) when applied to the VHI-9i scores. Receiver operator characteristic (ROC) curves were used, which plot the TPR against the false positive rate (FPR = 1 − TNR). Since ROC is a binary classifier, the curves had to be plotted three times to establish possible cut-off points for every severity level (0 vs. 1–3, 0–1 vs. 2–3, 0–2 vs. 3). The area under the curve (AUC) was used to rank the performance of every curve to distinguish between two severity classes. Values between 0.8 and 0.9 are considered excellent, 0.7 to 0.8 acceptable, 0.5 to 0.7 poor.

Several methods exist to determine good class boundaries from ROC curves. As a starting point, we used Youden's index (J) [42]. The highest J (Max J) is achieved when sensitivity and specificity are at optimal balance (J = TPR − FPR = TPR + TNR − 1). As a second possible class boundary, we determined the point where the number of correctly classified cases (CCCs) was the highest. The CCC is calculated as follows:


To find plausible cut-off values or categories of reasonable size, we selected a value between the two suggested class boundaries based on the median between Max J and Max CCC, also taking into account well over a decade of clinical experience with the VHI-9i.

#### **4. Results**

#### *4.1. Test–Retest Reliability*

After eliminating all incomplete questionnaires, 416 test–retest pairs were left. The mean age (±SD) was 50 (±17), with males skewing generally older at 56 (±16) compared to female patients at 46 (±17) years of age. A total of 26 participants (6.3%) were classified as elite vocal performers, 59 as professional voice users (14.2%), 78 as non-vocal professionals (18.7%) and 253 as non-vocal non-professionals (60.8%). An overview of the test–retest population is given in Figure 1 and Table 2.

**Figure 1.** Overview of the test–retest population (age, gender, voice use classification).


**Table 2.** Study participant distribution and VHI-9i score differences between test and retest.

The median gap between test and retest was 2 days, with a mean of 3.3 days. The overall mean difference between VHI-9i scores (± SD) was very small at 0.25 (±3.52). Gender, voice use or age showed similarly minor differences (see Figure 2 and Table 2).

**Figure 2.** VHI-9i score difference between test and retest (total differences, by gender, by voice use, by age group). Age dependency was analyzed using discrete age values; age groups were only used in the diagram to improve the graphical representation. Circles (-) mark outliers (3rd quartile + 1.5\*interquartile range; 1st quartile − 1.5\*interquartile range) and asterisks (\*) mark far outliers (3rd quartile + 3\*interquartile range; 1st quartile − 3\*interquartile range).

A paired-sample *t*-test between the VHI-9i total scores showed no significant differences (*p* = 0.146). Test and retest scores also correlated very well (r = 0.919, *p* < 0.01), indicating a highly reliable questionnaire. Only 5% of the population had a difference larger than 7 points. Gender had no impact on the reliability of the questionnaire. The independent sample *t*-test for the absolute VHI-9i score difference between males and females was not significant (*p* = 0.589). The level of voice use did also not affect reliability. The Kruskal–Wallis H-test showed no significance between the four voice use classifications (*p* = 0.701). The absolute score differences lightly depended on age. For every year of life, the difference rose by 0.016 points (*p* = 0.028).

#### *4.2. Validation*

Of the 3661 participants remaining for VHI-9i validation, 1456 were male (39.8%) and 2205 were female (60.2%). The mean age (±SD) was 48 (±17), with males being on average slightly older at 50 (±18) years compared to females at 47 (±17) years of age. Vocal impairment was caused by functional dysphonia in 40.8% of the study population. Patients with organic dysphonia (50.8%) showed various pathologies: mostly lesions of the lamina propria (e.g., vocal fold nodules, polyps, cysts, Reinke's edema), followed by benign and malignant changes of the epithelium (e.g., leukoplakia, papillomatosis, carcinoma), as well as neurogenic voice disorders (e.g., unilateral paralyses of the recurrent laryngeal nerve, spasmodic dysphonia). The remaining 8.4% were healthy subjects without dysphonia, mainly college applicants who presented to receive a vocal fitness examination, or prior to starting a profession associated with high vocal demands (e.g., teachers, singers, lecturers). The population pyramid and pathology classification are shown in Figure 3.

**Figure 3.** Overview of the validation population (age, gender, pathology classification).

As the test–retest examinations demonstrated, the reliability of VHI-9i scores is not affected by gender or voice use. Although statistically significant, the age dependency is so small that it can be neglected in clinical practice. Therefore, all further observations and calculations were conducted for the entire population of 3661 participants. Using the old VHI-9i classification scale based on percentiles [25], 15.5% of our participants had healthy voices (total score 0 ≤ 5), 25.7% mild dysphonia (6 ≤ 13), 32.3% moderate (14 ≤ 22) and 26.5% severe dysphonia (23 ≤ 36). Applying the same method to the current database, 25% of patients had a score between 0 and 9, 50% up to 16, and 75% up to 22 points. The severity distribution for the other vocal parameters can be found in Table 3. Regarding VHIs, 63 cases had to be excluded (*n* = 3598 instead of 3661), because these test subjects had marked this question outside or in-between the provided options for the severity levels, rendering them invalid.

**Table 3.** Collected voice data by vocal parameter, classified according to the associated level of severity as shown in Table 1.


The size and mean of each severity category as well as the distribution of scores were notably different between parameters. The VHI-9i histogram shows a centered flat curve (skewness 0.063, kurtosis −0.90), the DSI is still centered but steeper (skewness −0.04, kurtosis 0.48) and the VEM is even steeper and skewed towards lower VEM values (skewness −1.08, kurtosis 1.94), with most patients falling into severity category 3 (Figure 4).

**Figure 4.** Observed VHI-9i, DSI and VEM scores with their associated severities.

The VHI-9i total scores correlated the most with the VHIs, even though ρ was only moderate (ρ = 0.592; see Table 4). All other parameters correlated notably weaker with the VHI-9i. The objective DSI and VEM were also moderately correlated to each other at ρ = 0.663. The distribution of subjects into G and R severity levels was rather similar, while B showed a different result with over 50% of all cases falling into the "healthy" category. G and R also had the strongest correlation among each other (ρ = 0.871), reinforcing clinical experience that G serves as the gold standard for hoarseness evaluations via the GRB scale.

**Table 4.** Results of correlation analysis between vocal parameters (Spearman's rho). All correlation coefficients were significant (*p* < 0.001).


Figure 5 shows the distribution of VHI-9i total scores using the classifications for VHIs, DSI, VEM and G. The boxplots reveal a clear tendency: the higher the severity level, the higher the associated median. However, there is also a lot of overlap between the quartiles of different severity levels. This especially applies to DSI and VEM, which makes these parameters less suitable for VHI-9i classification.

**Figure 5.** Distribution of VHI-9i total scores classified by VHIs, DSI, VEM and G severity levels. Upper row: stacked bar chart showing the number of subjects with their VHI-9i scores. Lower row: boxplots showing the percentiles of patients' VHI-9i scores by severity level. Circles (-) and asterisks (\*) mark outliers and far outliers.

The ROC plots (Figure 6) also favor the VHIs as the best classifying index. DSI, VEM and G are visibly less suitable classifiers, because their curves are closer to the hypothetical diagonal through the ROC plot, signifying weaker discriminating performance.

**Figure 6.** Combined ROC plots to determine cut-off points between severity categories 0 and 1 (blue), 1 and 2 (red), 2 and 3 (green).

The AUC results (Table 5) mirror the correlations of vocal parameters (compare Table 4). The best performance was achieved by the VHIs with excellent AUCs, followed by acceptable values for G. The parameters DSI and VEM turned out to be poor discriminators, with AUCs below 0.7.

As shown by our reliability analysis, severity categories must be at least 7 points in size to account for significant changes and minimize the possibility of retest artifacts. Neither optimizing for sensitivity and specificity (Max J) nor correctly classified cases (Max CCC) alone produced classes that were all wide enough (>7 points). Apart from the VHIs, Max CCC even produced cut-off recommendations that would eliminate the lowest (VEM) or lowest and highest (DSI, G) severity categories (highlighted in Table 5). Since both methods did not produce plausible cut-off values or categories of reasonable size, medians between the Max J and Max CCC measurements had to be calculated.

**Table 5.** ROC results for potential cut-offs between severity categories (0–1, 1–2, 2–3) using Max J, Max CCC and Median calculations. Yellow cells mark impossible cut-offs. Median calculations for every ROC parameter (TPR, FPR, J, CCC) resulted in slightly different class boundaries, which were specified by the ranges of cut-off values.



**Table 5.** *Cont.*

However, both median calculations did not always return the exact same result, which is why the J–CCC–Median cut-off values are expressed as ranges in Table 5. In general, the difference between both medians was below 0.25 points most of the time and very rarely exceeded 0.5 points. The medians for all vocal parameters agreed on the first boundary (i.e., between severity levels 0 and 1) at 7 or 8. Between "mild" and "moderate" (severity levels 1 and 2), the median recommendations ranged from 14 to 20. Except for the VEM, the medians led to a cut-off point between 26 and 28 for the boundary distinguishing "moderate" from "severe" impairment (i.e., severity levels 2 and 3).

#### **5. Discussion**

The VHI-9i short scale has proven to be a valuable diagnostic tool in our clinical practice for well over a decade. The total number of 17,660 consecutively completed questionnaires documented in our database confirms its high acceptance among patients and medical staff. In our test–retest analysis, the VHI-9i questionnaire demonstrated very high reliability independent of gender or voice use. Age had a minor influence, which we do not consider clinically relevant: For every year of life, the absolute score difference between test and retest increased by 0.016. If we applied that difference to the entire age range of our study population, the VHI-9i total score of an adolescent compared to a senior person would differ by about 1. The reliability analysis also showed that the severity classes for the VHI-9i need to be at least 7 points in size (2\*SD of paired sample *t*-test), since only differences of 7 points and above account for significant changes and minimize the possibility of retest artifacts. Our interpretation of the ROC analysis had to consider this requirement. Unfortunately, neither optimizing for Max J nor Max CCC resulted in categories that were all large enough. Calculating the median between them for each cut-off point, however, yielded satisfactory results for clinical use.

All classification ranges are listed in Table 6. The Median J method strikes a good balance between sensitivity, specificity and the minimum class width of 7 points. The new boundary of a score of 7 corresponds directly with the VHIs Median J result for healthy voices (class 0). Finding a reasonable upper boundary for severity level 1 is more difficult: using VHIs Median J (a score of 14) would result in a category that is too small. The median for the expert auditory–perceptual assessment (G) points towards an even higher boundary (a score of 19). Since we were trying to find a mid-point for our severity classes, we decided to use the upper boundary of the 50% quartile (a score of 16). The upper boundary for

severity level 2 (moderate impairment) can be taken once again from the VHIs Median J row, placing class 2 between 17 ≤ 26 and class 3 between 27 ≤ 36.

**Table 6.** Sizes of severity classes based on Max J, Max CCC and Median calculations. Green cells serve as the basis for our proposed new VHI-9i severity classification.


Compared to the old VHI-9i classification scale based on percentiles [25], the revised severity ranges classify more patients towards the lower categories. Severity level 3 is reduced by 4 points and is no longer the largest category. Level 1 and 2 start at higher class boundaries due to the size increase in level 0.

The best correlation was observed between VHI-9i and VHIs, making the overall self-assessed vocal impairment the best candidate for the validation process. However, the VHI-9i did not correlate well with the two objective parameters DSI and VEM, and had only slightly higher correlations with GRB. This supports recent studies that all these vocal parameters measure different aspects of a patient's voice and are neither mutually interchangeable nor redundant [31,36,41,43]. Due to the weak correlations, poor discriminating performance and sometimes impossible cut-off points, DSI, VEM and G ultimately had no part in our recommendation for the revised VHI-9i cut-off points. It is important to remember that the VHI-9i does not measure objective voice impairment (DSI) or vocal capacity (VEM), but personal suffering due to a subjectively perceived vocal handicap. None of the parameters allow conclusions to be drawn about the diagnoses or underlying causes of the voice disorder.

#### *Study Limitations*

Over 60% of our test–retest population were categorized as non-vocal non-professionals. Ideally, the study would have included more subjects with professional backgrounds in singing, acting or teaching, especially since establishing independence from voice use was one of our goals during the rest-retest analysis. A bigger population of elite vocal performers and professional voice users would have been preferrable, but does not represent the actual proportions of our clinic clientele.

Furthermore, males are underrepresented in our study, so there may be participation bias. Despite the limited number of male subjects, we concluded that the VHI-9i was independent of gender, but a more balanced gender involvement would have been more representative. However, our clinical experience shows that women are generally more likely to see a doctor for voice problems.

In addition, signal-to-noise ratio (SNR) analysis and signal typing are considered to be important for valid and reliable perturbation measurements [44–46]. Unfortunately, this functionality is not included in the DiVAS software, which was specified in our study design as the main tool for objective voice analysis. One of the fundamental limitations of the DSI is the inclusion of jitter without sufficient evaluation of the signal type. In general, only type 1 and 2 are considered viable for perturbation analysis. The 5% jitter cut-off applied in our study was established to exclude type 4 signals only [46]. However, the categorization of a small test sample (*n* = 40) revealed signal type 1 and 2 exclusively, even for patients with low DSI and high jitter values. Furthermore, the majority of SNR results were between 42 and 50 dB ("recommended"), with a smaller number between 30 and 42 dB ("acceptable") [45]. Therefore, we believe that our exclusion criteria were sufficient to eliminate voices which are not suitable for perturbation analysis. We recognize that this estimate cannot be taken as proof for the entire dataset and plan to include SNR and signal typing analyses in our future studies from the outset. It should also be noted that jitter was only used for DSI calculation, which proved to be irrelevant for the main goal of our study, i.e., a revised VHI-9i classification. Therefore, our recommendations regarding VHI-9i severity categories should not have been distorted.

Moreover, our initial ROC analysis produced boundary recommendations that were not feasible for diagnostic purposes. The resulting severity categories would have been either too small (<7 points) or would even not exist at all. Calculating the median between Max J and Max CCC is not a commonly used method for solving these problems. However, based on the frequent use of the VHI-9i in clinical investigations [18–22,31,36,41], it appears that the new classification will be a practical option for clinical settings.

In general, the auditory-perceptual assessment of voices via GRB was conducted only by two experienced examiners. Safer larger group judgments were not made. Due to the enormous number of cases (*n* = 17,660) and over a decade of diagnostic voice recordings, a retrospective blinded voice evaluation with 4-5 raters was not an option.

#### **6. Conclusions**

The VHI-9i is a reliable questionnaire which is independent of gender and professional voice use. Its dependency on age is negligible. Based on many years of clinical experience, it also has high acceptance among patients and medical staff, making it a valuable diagnostic tool.

The old cut-off values for the VHI-9i severity categories based on percentiles had to be adjusted. We recommend setting class 0 (healthy) between 0 ≤ 7, class 1 (mild impairment) between 8 ≤ 16, class 2 (moderate impairment) between 17 ≤ 26 and class 3 (severe impairment) between 27 ≤ 36.

The subjective VHI-9i does not correlate well with objective vocal parameters (DSI, VEM) or subjective auditory–perceptual assessment (GRB), reinforcing the notion that all these parameters measure different dimensions of a patient's voice and are neither mutually interchangeable nor redundant.

**Author Contributions:** Conceptualization, T.N., M.S. and P.P.C.; Methodology, F.C., T.N. and P.P.C.; Literature Review, F.C. and P.P.C.; Investigation, T.N., M.S. and P.P.C.; Data Analysis, F.C. and K.N.; Original Draft Writing, F.C. and P.P.C.; Draft Review and Editing, F.C., T.N., K.N. and P.P.C.; Visualization, F.C. and K.N.; Supervision, T.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of Charité-Universitätsmedizin Berlin, Berlin, Germany (reference number: EA4/140/10).

**Informed Consent Statement:** Informed consent was obtained from all study participants.

**Data Availability Statement:** All data of the study are available in the Department of Audiology and Phoniatrics, Charité-Universitätsmedizin Berlin, Berlin, Germany.

**Acknowledgments:** The authors wish to thank Tatiana Ermakova for the statistical advice.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** VHI-9i questionnaire items (*German translation*) as used in the study.


Scoring: 0 = never (*nie*), 1 = almost never (*selten*), 2 = sometimes (*manchmal*), 3 = almost always (*oft*), 4 = always (immer).

**Table A2.** Global VHIs question added to the study questionnaire.


Scoring: 0 = normal (*normal*), 1 = mildly (*leicht*), 2 = moderately (*mittelgradig*), 3 = severely disturbed (*hochgradig gestört*).

#### **References**


## *Article* **Reliability of Machine and Human Examiners for Detection of Laryngeal Penetration or Aspiration in Videofluoroscopic Swallowing Studies**

**Yuna Kim 1, Hyun-Il Kim 2, Geun Seok Park 1, Seo Young Kim 1, Sang-Il Choi 2,3,\* and Seong Jae Lee 1,4,\***


**Abstract:** Computer-assisted analysis is expected to improve the reliability of videofluoroscopic swallowing studies (VFSSs), but its usefulness is limited. Previously, we proposed a deep learning model that can detect laryngeal penetration or aspiration fully automatically in VFSS video images, but the evidence for its reliability was insufficient. This study aims to compare the intra- and interrater reliability of the computer model and human raters. The test dataset consisted of 173 video files from which the existence of laryngeal penetration or aspiration was judged by the computer and three physicians in two sessions separated by a one-month interval. Intra- and inter-rater reliability were calculated using Cohen's kappa coefficient, the positive reliability ratio (PRR) and the negative reliability ratio (NRR). Intrarater reliability was almost perfect for the computer and two experienced physicians. Interrater reliability was moderate to substantial between the model and each human rater and between the human raters. The average PRR and NRR between the model and the human raters were similar to those between the human raters. The results demonstrate that the deep learning model can detect laryngeal penetration or aspiration from VFSS video as reliably as human examiners.

**Keywords:** dysphagia; swallowing; laryngeal penetration or aspiration; deglutition; reliability; videofluoroscopic swallowing study; deep learning; machine learning

#### **1. Introduction**

The videofluoroscopic swallowing study (VFSS) is currently regarded as the gold standard method for evaluating swallowing function because it allows real-time visualization of bolus movement along with the dynamics of anatomical structures associated with the swallowing process [1,2]. A VFSS makes it possible to detect the presence and timing of laryngeal penetration or aspiration and helps to identify its physiological mechanisms [2–4].

The videofluoroscopic images are recorded while the patients swallow boluses mixed with contrast, and physicians or speech–language pathologists analyze the recorded videos [2]. VFSS analysis depends on the subjective visual judgment of the reviewers and is inevitably susceptible to human bias [5–7]. Human examiners usually have the burden of reviewing the images dozens of times for one patient because the swallowing process is repeated 10 to 15 times per test and repeated replay is required due to the fast and complex nature of swallowing. Consequently, it is difficult to avoid human error due to the fatigue that results from high concentration and repetitive examination. Because of this vulnerability to human error, the reported reliability of VFSS analysis is not excellent; wide variation is present in both intra- and inter-rater agreement (intrarater к= 0.530~1.00, interrater к= 0.269~0.700) [5–9].

**Citation:** Kim, Y.; Kim, H.-I.; Park, G.S.; Kim, S.Y.; Choi, S.-I.; Lee, S.J. Reliability of Machine and Human Examiners for Detection of Laryngeal Penetration or Aspiration in Videofluoroscopic Swallowing Studies. *J. Clin. Med.* **2021**, *10*, 2681. https://doi.org/10.3390/ jcm10122681

Academic Editor: Eng Ooi

Received: 7 May 2021 Accepted: 15 June 2021 Published: 18 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

As an alternative to overcome the limitations of human reading, recent studies have attempted to develop computer-assisted analysis [10–15]. Aung et al. suggested that automated reading enables more objective and immediate analysis with a quantifiable level of accuracy, eliminates the need for high levels of training for analysis and reporting, and provides a platform for larger-scale screening of populations with dysphagia [10]. Computer-assisted analysis typically tracks anatomical landmarks automatically after they are demarcated by humans in the first few frames of the videos [10–15]. However, its clinical usefulness has been limited because most of the models use obsolete semiautomated tracking and segmentation algorithms that require manual demarcation of anatomical landmarks.

Recently, deep learning technology has increased the accuracy of image classification to a level exceeding that of human eyes and is expected to reduce error in reading medical images [16–19]. In a previous study, we developed and proposed a model capable of detecting laryngeal penetration or aspiration from VFSS images in a fully automated manner without any human intervention by applying deep learning algorithms [20]. The model showed an overall accuracy of 97.2% in classifying image frames and 93.2% in classifying video files in which laryngeal penetration or aspiration was evident, exceeding the accuracy of previous semiautomated computer-assisted analysis. The results showed the potential value of the model for clinical practice in many respects, but the evidence for its reliability still seems to be insufficient.

This study aims to examine and compare the intra- and inter-rater reliability of our deep learning model and human examiners for the detection of laryngeal penetration or aspiration from VFSS images. We anticipate that the results of this study may provide further evidence to support the clinical application of deep learning technology in VFSS analysis, although dichotomous results of whether penetration/aspiration was detected or not on VFSS does not always represent the degree of pathology in the swallowing mechanism.

#### **2. Materials and Methods**

#### *2.1. Dataset*

We collected a total of 205 VFSS video files from 49 patients, aiming for an even distribution of attributes including gender, age, viscosity of diet and degree of laryngeal penetration or aspiration. Presence of the penetration or aspiration was determined using the PAS (Penetration/Aspiration Scale) [21] and videos scored as PAS 2 or higher were included. The video files were selected from the database of Dankook University Hospital, which contains the videos of VFSSs conducted between January 2015 and June 2020. The VFSS was performed according to the protocol described by Logemann [22] with minor modifications. Briefly, video images were acquired via lateral projection at a speed of 30 fps (frames per second) while the seated patients swallowed boluses of various consistencies mixed with contrast medium; the videos were stored digitally. The types of boluses swallowed were as follows: 3 mL of thick liquid (water-soluble barium sulfate diluted to 70%); 3 mL of rice porridge; 3 mL of curd-type yogurt; 3 mL of thin liquid (water-soluble barium sulfate diluted to 35%) from a spoon; or 5 mL of thin liquid from a cup. The video files were selected by an investigator who had more than two years of experience in analysis of VFSS. Every effort was made to select videos in which the presence or absence was evident. The video files were edited to contain only one swallowing event. Each swallowing was defined as the process from the backward movement of bolus in oral cavity to the returning of larynx to original position. A little space was also put on the front and back of the swallowing event to include the whole swallowing event. When the bolus was not fully swallowed in first attempt, subsequent swallows were also included until the bolus was completely swallowed. The videos were not included if they showed remaining of the bolus aspirated from previous swallow in the larynx. Among those files, 32 were excluded due to poor image quality. Ultimately, 173 video files from 42 patients were included in the VFSS dataset; the distribution of their attributes is shown in Table 1. The shortest video lasted 4 s, and the longest video lasted 240 s. The depth of penetration/aspiration was

categorized as shallow (PAS 2 or 3), deep (PAS 4 or 5) and aspiration (PAS 6 or higher) and their distribution is shown in Table 1. The proportion of presence and depth was set to equal the overall distribution in database of authors' institution.


#### *2.2. Analysis of VFSS*

#### 2.2.1. Machine Reading

The video files were examined for the presence of laryngeal penetration or aspiration using the computer model described in a previous study [20]. In summary, the model consisted of three phases: (1) image normalization, (2) dynamic ROI (region of interest) determination, and (3) detection of laryngeal penetration or aspiration (Figure 1). After the input images were normalized using CLAHE (contrast-limited adaptive histogram equalization) [23], an ROI was defined with reference to the cervical spinal column segmented using U-net. The ROI was set to include the larynx, the cervical spine, and adjacent areas. Noise from the movement of head and neck could be minimized by setting the ROI to move dynamically with the cervical spines. Within the ROI, the presence of laryngeal penetration or aspiration was classified by the deep learning network trained with the Xception module [24]. The output was reported and displayed in the form of histograms as shown in Figure 2. The classification and reporting process was conducted in a fully automated manner without any human intervention except for inputting the image data. Display of at least one peak was considered "positive" result.

**Figure 2.** Example output of the deep learning model represented as histograms: (**A**) No laryngeal penetration or aspiration was detected in any frame of the video. (**B**) Laryngeal penetration or aspiration occurred in approximately the 100th to 115th frames and the 180th to 200th frames of the video.

#### 2.2.2. Human Reading

The human raters were three physicians: "Human 1", with more than 20 years of experience in VFSS analysis; "Human 2", with 10 years; and "Human 3", the novice with 1 year. Working in separate locations, the three human examiners judged the existence of laryngeal penetration or aspiration, regardless of severity or depth, in the same video files. When multiple swallowing attempts were included in the video clip, the result was rated as "positive" if any one of the attempts shows penetration/aspiration. Discussion was not allowed, and no information about the subjects in the videos (including gender, age, and medical history) or the viscosity of the bolus was given to the raters.

#### *2.3. Analysis of Intra- and Inter-Rater Reliability*

#### 2.3.1. Intrarater Reliability

Trials were conducted in two sessions, separated by four weeks, to calculate the intrarater reliability of machine and human reading. In both sessions, the presence or absence of laryngeal penetration or aspiration was judged by three human raters and the deep learning model. In the second session, 173 video files were reordered and randomly assigned to the raters by an investigator who was blinded to the results of the first session. The results were collected from the three human raters and the model in both sessions, and Cohen's kappa coefficient was calculated. However, the meaning of epidemiological statistics derived in this way can be limited because there is no absolute gold standard for VFSS analysis. Therefore, we used the positive reliability ratio (PRR) and negative reliability ratio (NRR), as suggested by Kuhlemeier et al. [8]. In the absence of a gold standard, PRR and NRR can provide statistics about the agreement between session results from the same interpreter [8]. According to the definition of Kuhlemeier et al. [8], we calculated the PRR as the percentage of cases a given rater judged abnormal in the first session that he or she also judged abnormal in the second session. The NRR was calculated in the same way for normal ratings.

Therefore, the PRR and NRR were calculated by the following formulas:

PRR = Abn(1 and 2)/Abn(1), where Abn(1 and 2) = number rated abnormal in both the first and second sessions and Abn(1) = number rated abnormal in the first session.

NRR = Normal(1 and 2)/Normal(1), where Normal(1 and 2) = number rated normal in both the first and second sessions and Normal(1) = number rated normal in the first session.

#### 2.3.2. Interrater Reliability

The interrater reliability was verified in the same way as the intrarater reliability. As with the intrarater reliability, the interrater PRR and NRR were defined according to the definition by Kuhlemeier et al. [8]. PRR and NRR were calculated between each possible combination of human raters and machine, not between sessions. For interrater reliability, PRR denoted the percentage of cases judged abnormal (i.e., having laryngeal penetration or aspiration) by rater "A" that were also judged abnormal by rater "B". In the same way, NRR was calculated based on the cases judged to be normal.

Thus, interrater PRR and NRR were calculated by the following formulas:

PRR = Abn(A and B)/Abn(A), where Abn(A and B) = number rated abnormal by both "A" and "B" and Abn(A) = number rated abnormal by "A".

NRR = Normal(A and B)/Normal(A), where Normal(A and B) = number rated normal by both "A" and "B" and Normal(A) = number rated normal by "A".

All statistical analysis was performed with SPSS for Windows version 26.0, and the whole study protocol was approved by the institutional review board of Dankook University Hospital (approval No. 2020-11-015).

#### **3. Results**

#### *3.1. Intrarater Reliability*

Intrarater reliability is shown in Table 2. The kappa coefficients of all human raters showed almost perfect agreement except for Human 3 (a novice physician), who had only moderate agreement. The kappa coefficients of the model showed perfect agreement (intrarater kappa = 1.00), as expected. The PRRs of all human raters were above 90%. The NRRs of experienced human raters (Human 1 and Human 2) were above 90%, but Human 3 showed an NRR of only 68%. The PRR and NRR of the model were both 100%.


**Table 2.** Intrarater reliability represented by kappa coefficients, PRR and NRR.

#### *3.2. Interrater Reliability*

The interrater kappa coefficients are shown in Table 3. All pairs of two human raters showed substantial agreement in both sessions, except that there was only moderate agreement between Human 2 and Human 3 in the second session. The machine and every human rater also showed substantial agreement in both sessions, except that there was only moderate agreement between the machine and Human 3 in the second session.

**Table 3.** The interrater Cohen's kappa coefficients.


Scale for kappa coefficient: below 0.00 = poor agreement; 0.00–0.20 = slight agreement; 0.21–0.40 = fair agreement; 0.41–0.60 = moderate agreement; 0.61–0.80 = substantial agreement; 0.81–1.00 = almost perfect agreement.

The calculated PRRs and NRRs are shown in Table 4. Overall, the PRR values ranged from 62% to 100%, and the NRR values ranged from 50% to 100%. No particular pattern was found in the distribution of PRR or NRR among the human and machine ratings. The ratios were somewhat variable among the raters and between sessions. In order to delineate the difference in reliability, the PRR and NRR values were averaged and compared. The average PRR was 86.6% when measured between each pair of human raters and 85.5% when measured between the machine and each human rater. The average NRRs were 82.4%, and 81.3%, respectively. PRR and NRR values were not significantly different regardless of whether they were between human raters or between machine and human raters (Figure 3).


**Table 4.** PRR and NRR values calculated between each human rater and the machine.

<sup>1</sup> positive reliability ratio = Abn(A and B)/Abn(A), <sup>2</sup> negative reliability ratio = Normal(A and B)/Normal(A).: A changes according to rows into Human 1, Human 2, Huma 3, Model, and B changes according to columns into Human 1, Human 2, Human 3, Model. See the method Section 2.3 for further details.

**Figure 3.** PRR and NRR values averaged between human raters and between machine and human raters.

#### **4. Discussion**

One of the major limitations of VFSS is unsatisfactory interrater reliability. Its poor reliability may originate from the rapidity and complexity of the swallowing process and resultant difficulties in its analysis [25], as well as incomplete standardization of the definitions and judgment criteria of parameters [9]. Several methods have been used to improve the reliability of VFSS, including training and education [26], group discussion [25], directed search [27], frameby-frame observation [5] and computer-assisted automated analysis [10–15]. Most previously proposed computer-assisted analyses use semiautomated algorithms that require human manual demarcation of salient anatomical structures [10–15]. To our knowledge, the deep learning model we proposed in our previous study was the first fully automated model capable of detecting laryngeal penetration or aspiration in VFSS images [20]. The model showed more than 90% accuracy, but its reliability has not been tested sufficiently. The reliability of computer-assisted analysis, whether with semiautomated or deep learning models, has never been compared with that of human examination. This is the first study designed to compare the reliability of machine and human examiners for VFSS analysis and demonstrate the reliability of VFSS analysis using a deep learning model.

Since there is not yet an absolute gold standard for the analysis of VFSS results, the significance of classical epidemiologic statistics, such as the kappa coefficient, intraclass correlation coefficient or positive and negative predictive values, may be limited for assessing the reliability or validity of VFSS analysis. Kuhlemeier et al. [8] proposed that the PRR and NRR, modified from the positive and negative predictive values, can be useful for verifying the reliability or agreement among raters in the absence of a gold standard. They used the PRR to denote the probability that a condition that has been judged to be abnormal by a rater will also be judged the same by a separate rater or in a second rating by the same rater [8]. Similarly, the NRR was used to denote the probability that a rating of "normal" would be followed by a second rating of "normal" either by a different rater or by the same rater at a different time [8]. In this study, we used the PRR and NRR in addition to the kappa coefficient to increase statistical strength.

The results of reliability analysis for VFSS data can be influenced by test videos because VFSS data frequently shows diverse findings according to the severity and type of dysphagia. If the test videos contain only mild or vague laryngeal penetrations and aspirations, raters may have difficulties in judgment, and the reliability will be lowered. If the videos contain only severe laryngeal penetrations and aspirations, agreement between the raters may appear excessively high because judgment of definite laryngeal penetration or aspiration might be easy for all raters. We made our best effort to include test videos with a balanced distribution of characteristics, including the gender and age of patients and the viscosity of the diet. Efforts were also made to include patients with diverse degrees of penetration and aspiration in the test dataset. In this way, we believe that selection bias was minimized in the measurement of reliability.

The experience of the raters may also affect the results of reliability analysis. [25]. Experienced raters usually have highly accurate standards of judgment, while less experienced raters can have confusion or difficulty in making decisions. We invited and compared three human raters with different levels of experience to minimize the effect of experience. The raters comprised one with more than 20 years of experience, one with approximately 10 years and one with approximately one year. We believe that the bias caused by different degrees of experience was minimized by comparing human raters with different experience levels. In addition to experiences, more extensive training also affected the difference between experienced and less experienced examiners because it had been recommended for precise use of the Penetration/Aspiration Scale [26].

As expected, the intrarater reliability was excellent for human and machine reading except in the novice physician (Human 3). Regarding interrater reliability, the kappa coefficients between the deep learning model and each human rater showed moderate to substantial agreement, except for Human 2 vs Human 3 and the machine vs Human 3 in the second session. Human 3 showed the lowest agreement with other human raters and machines as well as the lowest intrarater reliability, suggesting that experience may play an important role in the analysis of VFSS results by humans. It is reasonable to speculate that our deep learning model might be more reliable than an inexperienced human reader for VFSS analysis.

The PRRs showed inconsistent results both between human raters and between the machine and human raters, but the values were generally above 70%, except for Human 3 in the second session. It can be speculated that the agreement between experienced human raters and the deep learning model is high for positive results (the presence of penetration or aspiration). The lower PRR values between Human 3 and the other human raters as well as the machine may again suggest that interrater agreement may be affected by the raters' experience level. The PRRs of the machine to the human raters showed almost perfect agreement (above 80%), although the PRRs of the human raters to the machine showed much lower values. The meaning of the difference between "machine-to-human" and "human-to-machine" PRRs is unclear. The NRRs, meaning the agreement for negative results (the absence of laryngeal penetration or aspiration), were generally lower, but not by a wide margin. To compare the agreement between the human raters and the agreement

between the machine and human raters, we averaged and compared the PRRs and NRRs. The differences were not significant, suggesting that the overall agreement between the machine and human raters was noninferior to that between the human raters for both positive and negative results.

These results indicate that computer-assisted analysis using a deep learning model is a reliable method for detecting laryngeal penetration or aspiration through a VFSS. Considering its consistency and efficiency, deep learning computer analysis could provide good assistance to human examiners, who are vulnerable to fatigue and variability. It is anticipated that machine reading with a deep learning model will be able to improve the reliability and accuracy of VFSS analysis by reducing the time and effort required of human observers. The concept of computer-assisted detection of penetration or aspiration is of great clinical value for many reasons such as the potential for lower cost screening for aspiration or the facilitation of telehealth.

This study has several limitations. In the present study, human raters and the machine judged the existence of laryngeal penetration or aspiration only, although most VFSS examiners evaluate the depth and amount of laryngeal penetration or aspiration as well as its presence. The ultimate purpose of VFSS is not only to detect penetration or aspiration, but also to evaluate the pathophysiology and mechanism of swallowing. However, variables other than laryngeal penetration and aspiration were not considered in the analysis because the deep learning model was designed and trained only for the detection of laryngeal penetration or aspiration. Therefore, the machine described in this study is at best a prototype that proves that penetration/aspiration can be detected by computers, but in no way resembles human interpretation of VFSS at least for now. There was no distinction between penetration and aspiration in this study, although they have different clinical meanings [28]. Dynamics of continuous eating was not verified in this study because the analysis was limited to the video containing only one swallowing event. Additionally, the meaning and usefulness of the reliability results might be limited by the absence of a gold standard for comparison. For the same reason, selection bias could not be eliminated completely in choice of video files although we made every effort to avoid it. Despite these limitations, we believe that machine reading by a deep learning algorithm can assist human observers, helping to minimize the variability and improve the efficiency of VFSS analysis. Further studies are required to develop more sophisticated models that can assess VFSS images more comprehensively. The results presented in this study are only descriptive statistics. This study did not aim to determine the superiority or inferiority of machine reading, only to demonstrate its usefulness.

#### **5. Conclusions**

Computer analysis using a deep learning model can provide a reliable method for detecting the existence of laryngeal penetration or aspiration in VFSS images. This deep learning model has promising prospects for use in VFSS analysis although further research will be required to increase its reliability and accuracy.

**Author Contributions:** Conceptualization, S.J.L. and Y.K.; methodology, S.-I.C., H.-I.K., S.J.L. and Y.K.; software, H.-I.K. and S.-I.C.; validation, S.J.L. and S.-I.C.; formal analysis, S.J.L. and Y.K.; investigation, Y.K., S.J.L., G.S.P. and S.Y.K.; resources, S.J.L. and Y.K.; data curation, Y.K. and S.J.L.; writing—original draft preparation, Y.K.; writing—review and editing, S.J.L., S.Y.K. and S.-I.C.; visualization, Y.K. and H.-I.K.; supervision, S.J.L. and S.-I.C.; project administration, S.J.L.; funding acquisition, S.J.L. and S.-I.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Research Foundation of Korea through the Korean Government (MSIT) under 2021R1A2B5B01001412 and in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant Number 2018R1D1A3B07049300).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Dankook University Hospital (IRB No. 2020-11-015).

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available from the corresponding author upon reasonable request.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **The Effectiveness of Rehabilitation of Occupational Voice Disorders in a Health Resort Hospital Environment**

**Anna Sinkiewicz 1,\*, Agnieszka Garstecka 1, Hanna Mackiewicz-Nartowicz 1, Lidia Nawrocka 1, Wioletta Wojciechowska <sup>2</sup> and Agata Szkiełkowska <sup>3</sup>**


**Abstract:** Background: The aim of this study was to present a rehabilitation program of occupational voice disorders for teachers, conducted in the form of health resort stays, and evaluate its effectiveness depending on job seniority. Methods: The study included 420 teachers who participated in a complex vocal prophylactic and rehabilitation program carried out during a 24-day stay at a health resort hospital. Employment time varied from 4 to 45 years (mean 28.3 years). The participants were divided into three groups: employment time < 21 years (57 teachers), 21–30 years (182 teachers) and > 30 years (181 teachers). All of the subjects underwent maximum phonation time assessment as well as jitter, shimmer and NHR (noise to harmonic ratio) parameters assessment before and after the program; they also underwent perceptual evaluation using the GRBAS scale and voice self-assessment using the VHI-30 scale. Results: The perceptual evaluation using the GRBAS scale and self-report measures of voice function assessed using the VHI scale revealed improvement (*p* < 0.001). The parameters of jitter, shimmer and NHR improved significantly: jitter *p* < 0.001, shimmer *p* < 0.001 and NHR *p* < 0.003. Maximum phonation time increased slightly but significantly (*p* < 0.001). For all of the studied groups regardless of their employment time, maximum phonation time increased (*p* < 0.001). Initially, the lowest values of maximum phonation time were observed in teachers with longer job seniority, which improved after the rehabilitation but remained <15 s. Conclusions: Voice care for teachers is crucial regardless of their job seniority. Early prophylaxis for voice disorders is effective, as the results of rehabilitation are better in teachers with a shorter employment time.

**Keywords:** occupational voice disorders; prevention; prophylaxis; teachers; occupational health; voice training; balneotherapy

#### **1. Introduction**

For teachers, the ability to tolerate strain on their vocal organ is essential for safe and comfortable work. Vocal hygiene and stress resistance also play an important role. School teaching is considered to be a profession at a higher risk for developing voice disorders [1,2]. The percentage of teachers with voice problems ranges from 13% [3] to 94% [4]. Lack of sufficient preparation of some teachers for frequent use of their voice at work [5–7], difficult working conditions such as noise, working long hours without rest and poor climatic conditions in classrooms result in a higher prevalence of voice disorders than in the general population [8]. The influence of other significant factors on the occurrence of voice disorders, such as age and gender, is also important [9]. Long periods of treatment, surgical interventions and sick leave are associated with high financial costs [2]. This is a

A.; Mackiewicz-Nartowicz, H.; Nawrocka, L.; Wojciechowska, W.; Szkiełkowska, A. The Effectiveness of Rehabilitation of Occupational Voice Disorders in a Health Resort Hospital Environment. *J. Clin. Med.* **2021**, *10*, 2581. https://doi.org/10.3390/ jcm10122581

**Citation:** Sinkiewicz, A.; Garstecka,

Academic Editor: Renee Speyer

Received: 8 April 2021 Accepted: 8 June 2021 Published: 11 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

widespread social problem involving not only health but also economical aspects [10,11]. It is therefore important to search for effective methods of prevention and rehabilitation programs for occupational voice disorders.

The effectiveness of complex voice rehabilitation programs in ambulatory care has been assessed by many authors [12–15]. It was observed that vocal hygiene training significantly improves voice quality and reduces disorder symptoms [16,17]. Multicenter efforts to improve quality of care for persons professionally and strenuously using their voice resulted in the development of an interdisciplinary 24-day vocal prophylactic and rehabilitation program conducted in health resort hospitals [18].

The aim of this study was to evaluate the effectiveness of the prevention and rehabilitation program for voice disorders in teachers conducted in a health resort hospital, with analyses of the factors affecting the outcomes.

#### **2. Materials and Methods**

The study was completed in accordance with the ethical standards of the institutional research committee and principles of the World Medical Association Declaration of Helsinki Ethical Principles for Medical Research involving Human Subjects. Ethical approval for this study was obtained from the Ethics Committee of the Collegium Medicum, Nicolaus Copernicus University. Written informed consent was obtained from patients before the study.

This program has been implemented in 5 health resort hospitals in Poland, localized in places with a mild climate favorable to the treatment of respiratory diseases. A total of 3685 participants, 3440 female (93.3%) and 245 male (6.7%) participated in a complex rehabilitation program conducted in one health resort hospital between the years 2015–2019.

The study included teachers who had participated in a 24-day vocal prophylactic and rehabilitation program in 2019. The study group consisted of 420 participants aged 28–64 years (mean 51.4 years) with employment time that varied from 4 years to 45 years (mean 28.3 years). As the teaching profession in Poland is female dominated, all participants included in the study were females who were diagnosed with hyperfunctional dysphonia. Dysphonia had been diagnosed by a referring physician, and the diagnosis was confirmed by an initial phoniatric examination. In order to unify the assessment of voice rehabilitation results and the evaluation of voice acoustic parameters, the study group excluded males and females diagnosed with other diseases, such as glottic insufficiency or chronic hypertrophic laryngitis, which are often permanent voice disorders. Male teachers experience voice disorders less frequently and they constituted only 6.7% of the respondents.

Depending on their employment time, the participants were divided into three groups (Table 1).


**Table 1.** Employment time.

All of the study participants were subjected to the following initial medical examination: family history taking, laryngological and phoniatric examination. Maximum phonation time (MPT) was obtained as the maximum value of three subsequent trials for each participant to sustain the vowel /a/ for as long as possible using a comfortable pitch and volume [19]:

Perceptual voice evaluation of voice disorders were evaluated using the GRBAS scale: overall grade (G), the degree of hoarse throat intensity; roughness (R), rough voice; breathiness (B), puffing character of voice; asthenia (A), weak voice; and strain (S), voice

tension. Each parameter was evaluated on a 4-point scale: 0 (normal), 1 (mild), 2 (moderate), 3 (severe), (Figure 1). The following were also evaluated:


**Figure 1.** Changes in perceptual voice assessment on the GRBAS scale after the rehabilitation program in groups by their job seniority (the range for the GRBAS scale was 0–15 points).

The assessments were made during the initial examination by a phoniatrist and a speech therapist. The VHI voice self-assessment scale proposed by Jacobson et al. [20] in 1997, the Polish version of which was developed by Pruszewicz et al. in 2004 [21], comprises ten voice disorder variables in three domains: emotional, physical and functional. Patients are requested to note their frequency of each variable on a five-point scale (never, almost never, sometimes, almost ways, always). The score ranges from 30 (unaffected) to 120 (severely affected), (Figure 2) [20,21].

Analysis of voice acoustic parameters (Jitter, Schimmer, NHR) was performed using the DiagnoScope Specialist software [22], before and after the treatment.

The vocal prophylactic and rehabilitation program included educational lectures, voice therapy, physiotherapy and psychotherapy. Educational lectures consisted of vocal hygiene, voice emission mechanisms, voice control, proper voice emission and vocal effort, as well as disorders and laryngeal problems caused by voice abuse, misuse or overuse. The lectures were conducted by a phoniatrist and a speech therapist 5 times per week with durations of 45 min.

Voice rehabilitation consisted of individual and group classes, including relaxation techniques, proper breathing technique, posture, voice emission, articulation and activation of resonators. The aim of the exercises was to eliminate improper breathing, speech and articulation habits, and develop correct habits. Particular attention was paid to voice stabilization and the extension of the phonation time [23]. The exercises were conducted to gain and consolidate the ability to produce a soft voice attack, as well as to enhance the

upper vocal tract resonance. A speech therapist conducted individual exercises once a day for 20 min 5 days a week and 30-min group meetings twice a week.

**Figure 2.** Changes in voice self-assessment on VHI scale after the rehabilitation program in groups by their job seniority.

Physiotherapy included manual therapy, calcium iontophoresis and inhalations. Individual and group psychotherapy was an important part of the program, and focused on stress therapy and stress management techniques. Phoniatric assessment was carried out twice during the program.

All participants were taken care of by the same team of 2 phoniatrists, 3 speech therapists, 3 physiotherapists and 1 psychologist.

The data were statistically analyzed using the IBM SPSS 25.0.0.1 Toru ´n, Poland. Analysis of variance was conducted (the therapy effects were tested with the repeated measures; 3 groups depending on their employment time were compared using between group factor in analyses of variance). The Greenhouse–Geisser correction was used when the assumption of sphericity was violated.

#### **3. Results**

In the perceptual voice evaluation using the GRBAS scale, a statistically significant improvement after therapy (*F*(1417) = 730.33; *p* < 0.001, *η<sup>p</sup>* <sup>2</sup> = 0.64) was achieved in all voice qualities.

Voice self-assessment on VHI scale improved by more than 6 points after therapy in all the subjects, with a statistical significance of (*F*(1417) = 35.96; *p* < 0.001, *η<sup>p</sup>* <sup>2</sup> = 0.08).

In all groups, regardless of the employment time, MPT prolongation was observed (*F*(1417) = 39.48; *p* < 0.001, *η<sup>p</sup>* <sup>2</sup> = 0.09). The initial MPT was the shortest in the group with the longest job seniority. After the rehabilitation, MPT improved, as in the other groups, but remained <15 s. Job seniority had the main effect (*F*(1417) = 3.67; *p* = 0.026, *η<sup>p</sup>* <sup>2</sup> = 0.02). Group comparison showed that MPT in the group with job seniority of up to 20 years differed significantly (*p* = 0.038) from MPT in the patient group with job seniority of over 30 years (Figure 3).

**Figure 3.** Changes in MPT after the rehabilitation program in groups by their job seniority.

In the presented studies, the perceptual voice evaluation using the GRBAS scale, in all features combined, showed a statistically significant improvement and was consistent with both the results of voice self-assessment (VHI questionnaire) and the objective acoustic analyses of the jitter (*F*(1417) = 28.27; *p* < 0.001, *η<sup>p</sup>* <sup>2</sup> = 0.06), shimmer (*F*(1417) = 10.26; *p* = 0.001, *ηp* <sup>2</sup> = 0.02) and NHR parameters (*F*(1417) = 9.12; *p* = 0.003, *η<sup>p</sup>* <sup>2</sup> = 0.02), (Figure 4).

**Figure 4.** Changes in acoustic parameters after the rehabilitation program in groups by their job seniority: (**a**) jitter; (**b**) shimmer; (**c**) NHR.

#### **4. Discussion**

Complex voice rehabilitation in the form of stationary health resort treatments sets up conditions for focusing solely on this activity for 24 days, and gives the opportunity to combine systematic exercises, simultaneous physiotherapy and mental relaxation. It is important that the therapy does not cause any voice strain. A break from work without active voice rehabilitation is just a rest, and returning to work means a return to abnormal voice emission patterns and the recurrence of symptoms. Harmful habits, such as an uneconomical breathing pattern practiced for years, lack of control over the laryngeal muscles, speaking too loudly or clearing the throat by grunting, cannot be changed by a one-time recommendation from a physician.

The main problem of rehabilitation psychology is to stimulate the motivation to implement a rehabilitation program [24]. Health resort treatments give the opportunity to start and maintain a healthy lifestyle. This is facilitated by a comfortable sense of well-being related to rest and relaxation, as well as climatic conditions beneficial to the respiratory tract. An important part of the primary and secondary prevention of voice disorders is physical activity, which is often neglected by teachers. A survey by Rosłaniec et al. showed that over 40% of the respondents did not practice physical activity on a regular basis [25]. Other studies have revealed a relationship between the prevalence of voice disorders and a lack of physical activity. Teachers who did not practice physical activity were diagnosed with dysphonia more often than those who exercised three or more times a week [26]. The rehabilitation program offers daily breathing and relaxation exercises. Moreover, participants receive individual recommendations on how to continue exercising at home.

The conditions of health resort-based treatments are particularly conducive to health education, because highly qualified professionals have extensive experience in conducting lectures, talks or interactive workshops. The patients have also free time during their stay, and therefore are positive about participating in educational activities. An educated patient is more independent, has a better quality of life, understands medical recommendations better and turns to specialists for advice less frequently [27].

Data presented on the basis of extensive meta-analyses show that occupational voice disorders are not only caused by the excessive use of voice but are also related to working environments and general health, as well as psychological and sociodemographic factors [9,13,28]. The presented study did not show any worse results from the health resort treatment in patients with comorbidities according to the MPT, jitter, shimmer and NHR acoustic parameters and the GRBAS perceptual evaluation. On the other hand, better initial MPT values were found in teachers with the shortest job seniority, which made their phonation time the longest after the therapy, with a similar improvement in all study groups. The results of the study showed that voice rehabilitation is important in each group, regardless of the employment time; however, the initial breathing capacity and laryngeal muscles are better in younger patients.

A study by Vaca et al., showed that an age above 50 is associated with an increased risk of voice disorders [29]. Weaker tension of the respiratory and laryngeal muscles can have a negative impact on vocal endurance and voice quality, especially when both deficits occur concomitantly. Voice changes usually refer to difficulties in maintaining the fundamental frequency and shorter phonation time [30]. Patients with the longest work experience are less likely to achieve the desired outcomes of voice rehabilitation, which may not result only from the physiological changes related to age. The study by Rosłaniec et al., showed that teachers over 50 years of age complied with the rules of voice emission and hygiene to a much lesser extent than younger teachers. The VHI voice self-assessment questionnaire is a recognized and useful tool for assessing the progress of voice therapy [31–33]. Teachers' high sensitivity and expectations regarding their own voice make the VHI scale particularly useful in this professional group. However, it is not the numerical value of the VHI test itself that is important but the degree of improvement after treatment [34]. In the study group,

after the health resort stay, the voice self-assessment based on the VHI scale improved in all respondents by more than six points (*p* < 0.001).

An improvement in voice parameters after 24 days of an intensive complex rehabilitation program is an expected result. Many authors demonstrated an improvement in the voice of teachers undergoing outpatient rehabilitation [14,35,36].

Therefore, does the presented rehabilitation program allow the intended aim to be achieved more effectively?

Launching a preventive and rehabilitative program based on a health resort hospital environment requires the initial organization of a diagnostic and rehabilitation base with a team of specialists, and the development of a code of conduct. It is also important to adopt uniform criteria to qualify participants. According to the program assumptions, people with the greatest chance of improving their vocal endurance and voice quality should qualify for the program, which will then enable them to continue their professional career.

Based on over 5 years of experience with complex health resort-based rehabilitation and the meta-analysis by Byeon, it can be concluded that the essential preconditions for the effectiveness and durability of the treatment are: the condition of the vocal apparatus without permanent disorders, comorbidities that affect the vocal function of the larynx and active participation in all conducted activities [32].

Given the benefits of this type of therapy, but also limitations such as a 24-day absence from work and considerable costs of the stay and treatment, it is necessary to develop the optimal, possible frequency of participation in such a rehabilitation program. Repetition of health resort treatments offers a chance to consolidate acquired skills and habits, especially in patients with shorter job seniority.

#### **5. Conclusions**

In the search for effective methods of prevention and therapy of voice disorders in teachers, it should be recognized that health resort rehabilitation is an attractive form of treatment, as it combines vocal rest with active rehabilitation and health education. An additional advantage of such rehabilitation is climate therapy. Various studies confirmed the purposefulness of voice care at every career stage; however, from the perspective of health and labor economics, early prevention is more appropriate because there is a better chance for voice regeneration for people with shorter work experience.

The co-financing of such rehabilitation is also of great importance, as its multidisciplinarity is associated with considerable costs. In the end, however, the benefits outweigh the otherwise possible expenses related to illness treatment, sick leave and other problems related to the continuation of participants' professional careers.

**Author Contributions:** Conceptualization, A.G. and A.S. (Anna Sinkiewicz); Methodology, A.S. (Anna Sinkiewicz), A.G., H.M.-N. and L.N.; Literature Review, A.S. (Anna Sinkiewicz), A.G. and H.M.-N.; Data Analysis, A.S. (Anna Sinkiewicz), A.G. and H.M-N.; Original Draft Preparation, A.S. (Anna Sinkiewicz), L.N. and W.W.; Review and Editing: A.S. (Anna Sinkiewicz), A.G. and A.S. (Agata Szkiełkowska); Visualization, L.N. and W.W.; Validation, A.S. (Anna Sinkiewicz), A.G., W.W. and A.S. (Agata Szkiełkowska); Data Auration, H.M.-N. and W.W.; Supervision A.S. ( Anna Sinkiewicz); Project Administration, L.N. and A.S. (Agata Szkiełkowska). All authors have read and agreed to the published version of the manuscript.

**Institutional Review Board Statement:** The study was completed in accordance with the ethical standards of the institutional research committee and principles of the World Medical Association Declaration of Helsinki Ethical Principles for Medical Research involving Human Subjects. Ethical approval for this study was obtained from Ethics Committee of the Collegium Medicum, Nicolaus Copernicus University.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** All data used to support the finding of this study are available from the corresponding author upon request.

**Acknowledgments:** In this section, you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **T1a Glottic Cancer: Advances in Vocal Outcome Assessment after Transoral CO2-Laser Microsurgery Using the VEM**

**Wen Song 1, Felix Caffier 1, Tadeus Nawka 1, Tatiana Ermakova 2, Alexios Martin 3, Dirk Mürbe <sup>1</sup> and Philipp P. Caffier 1,\***


**Abstract:** Patients with unilateral vocal fold cancer (T1a) have a favorable prognosis. In addition to the oncological results of CO2 transoral laser microsurgery (TOLMS), voice function is among the outcome measures. Previous early glottic cancer studies have reported voice function in patients grouped into combined T stages (Tis, T1, T2) and merged cordectomy types (lesser- vs. larger-extent cordectomies). Some authors have questioned the value of objective vocal parameters. Therefore, the purpose of this exploratory prospective study was to investigate TOLMS-associated oncological and vocal outcomes in 60 T1a patients, applying the ELS protocols for cordectomy classification and voice assessment. Pre- and postoperative voice function analysis included: Vocal Extent Measure (VEM), Dysphonia Severity Index (DSI), auditory-perceptual assessment (GRB), and 9-item Voice Handicap Index (VHI-9i). Altogether, 51 subjects (43 male, eight female, mean age 65 years) completed the study. The 5-year recurrence-free, overall, and disease-specific survival rates (Kaplan–Meier method) were 71.4%, 94.4%, and 100.0%. Voice function was preserved; the objective parameter VEM (64 ± 33 vs. 83 ± 31; mean ± SD) and subjective vocal measures (G: 1.9 ± 0.7 vs. 1.3 ± 0.7; VHI-9i: 18 ± 8 vs. 9 ± 9) even improved significantly (*p* < 0.001). The VEM best reflected self-perceived voice impairment. It represents a sensitive measure of voice function for quantification of vocal performance.

**Keywords:** T1a glottic carcinoma; transoral laser microsurgery; treatment outcome; vocal function; objective voice diagnostics; vocal extent measure (VEM)

**1. Introduction**

Laryngeal cancer is the most frequent malignant tumor in the head and neck area and one of the most common tumors of the respiratory tract [1–3]. GLOBOCAN estimates that more than 177,000 people worldwide developed laryngeal cancer in 2018, with men being affected significantly more often than women (155,000 vs. 22,000) [4]. The prognosis depends mainly on the localization, the TNM classification and the R-status, but also the differentiation and the presence of lymphangiosis carcinomatosa are relevant predictors [5–7]. In the glottis, squamous cell carcinomas are the most frequent type (60 to 80%) compared to other tumor sites within the larynx [8–10]. In early glottic cancer, carcinoma in situ (Tis) must be differentiated from T1 and T2 laryngeal cancer. Invasive T1 glottic cancer is limited to one (T1a) or both (T1b) vocal folds (VF) with normal respiratory but impaired phonatory VF mobility.

T1 and early T2 glottic carcinomas have a very good prognosis due to the early symptom of hoarseness, which usually leads to a quick diagnosis and prompt initiation of

**Citation:** Song, W.; Caffier, F.; Nawka, T.; Ermakova, T.; Martin, A.; Mürbe, D.; Caffier, P.P. T1a Glottic Cancer: Advances in Vocal Outcome Assessment after Transoral CO2-Laser Microsurgery Using the VEM. *J. Clin. Med.* **2021**, *10*, 1250. https:// doi.org/10.3390/jcm10061250

Academic Editor: Renee Speyer

Received: 15 February 2021 Accepted: 15 March 2021 Published: 17 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

therapy. In addition, metastasis rates are low [11–13]. In the literature, the 5-year overall survival after therapy of early glottic cancer is reported to be in the 74–100% range [14,15]. Involvement of the anterior commissure is more likely to have higher local recurrence, lower laryngeal preservation, but no statistical difference in 5-year overall survival [16,17]. In Steiner's landmark study of 240 patients with laryngeal cancer, early-stage carcinomas had an overall 5-year survival rate of 86.5% (disease-specific 100%), 6% local recurrences, with 99.4% larynx preservation [18]. Ledda and Puxeddu evaluated the oncologic efficacy in 103 patients with early glottic carcinoma, reporting for T1 a 5-year recurrence-free rate of 96% (local control 98%, larynx preservation 100%) [19]. Canis et al. showed in 404 pT1a patients the following 5-year Kaplan-Meier estimates: local control 86.8%, overall survival 87.8%, disease-specific survival 98.0%, recurrence-free survival 76.1%, and larynx preservation 97.3% [20]. Batra et al. presented in 53 patients with Tis and T1 comparable results: local control 86.7%, ultimate local control (with CO2-laser alone) 90.5%, 3-year overall survival 92.4%, 3-year disease-specific survival and larynx preservation 98.1% [21]. An analysis of 2436 transorally treated T1/T2 carcinomas showed a 5-year overall survival of 82% [22]. For disease-specific survival after T1 and T2 transoral resection, 5-year survival rates of 89–100% are reported in the literature [23]. Meta-analyses on laryngeal preservation after transoral laser resection of T1 and T2 report rates of 83–100% [24].

Early detection of laryngeal cancer can minimize surgical trauma, improve therapeutic outcome and reduce mortality [25]. It is a general consensus that the larynx should be examined laryngoscopically in all patients with hoarseness lasting more than 3 to 4 weeks [26,27]. Videolaryngostroboscopy (VLS) can indicate invasive tissue growth by eliminated mucosal wave propagation and reduced or absent phonatory VF mobility [28,29]. Electronic chromoendoscopy can improve the recognition of tumor margins [30]. A recording of connected speech to document the impaired vocal function is considered a minimum requirement for functional assessment [31]. Small glottal findings suspected of malignancy such as precursor lesions, Tis, and T1a carcinomas, can be completely removed during diagnostic microlaryngoscopy to confirm the diagnosis by excision biopsy [32,33]. Apart from the health status, the quality of life in patients with T1 glottic cancer depends mainly on the voice quality and thus on the extent of the resected VF tissue [34–36]. Surgical therapy is preferred [37,38]; primary radiotherapy, however, can also be used as a conservative VF preserving procedure [39,40].

Transoral CO2-laser microsurgery (TOLMS) was introduced by Strong and Jako for the therapy of early laryngeal cancer in the 1970s [41], and Steiner gave further impetus in the propagation of this technique [18,42]. Today, TOLMS is established for the treatment of early glottic carcinoma with highly satisfying oncological and functional outcomes (e.g., [20,43,44]). However, many studies predominately focus on oncological results and not on functional outcomes. As the vocal outcome depends on the amount of removed tissue, the consistent classification of endoscopic cordectomies of the European Laryngological Society (ELS) allows interpretation of postoperative results with regard to the surgical strategy and comparison between different surgical centers [45]. The main objective of this exploratory study was to examine in detail the vocal outcome in patients with T1a glottic cancer. The hypothesis was that voice function can be preserved after TOLMS. Therefore, we planned to explore the pre- and postoperative vocal function using specific subjective and objective parameters including the vocal extent measure (VEM) based on the voice range profile (VRP) [46].

#### **2. Materials and Methods**

#### *2.1. Study Design and Patients*

Patients diagnosed with suspected T1a glottic carcinoma underwent direct microlaryngoscopy in general anaesthesia with TOLMS in a prospective study. Clinical examination and data acquisition took place at the initial pre-therapeutic visit, during operation, and at regular follow-ups postoperatively. The voice was examined the day before TOLMS and 3 months after in-sano resection and completed wound healing. Study participants

were patients presenting with hoarseness at the Department of Audiology and Phoniatrics, Charité–University Medicine Berlin, Germany. Altogether, 60 consecutive patients were recruited between June 2009 and October 2019. Selection criteria comprised histologically confirmed pT1a cN0 cM0 glottic carcinoma, complete treatment documentation, and informed consent. Patients with Tis, T1b and T2 glottic cancer were not included in this investigation.

#### *2.2. Surgical Procedure and Postoperative Regimen*

Microlaryngoscopy was conducted via the operating microscope type OPMI Sensera (Zeiss, Jena, Germany) and the Kleinsasser laryngoscope suspension system (Storz, Tuttlingen, Germany). TOLMS was performed with the AcuPulse 30W/40 ST CO2-laser system (Lumenis, Yokneam, Israel) using the following parameters: output power 2 to 5 watt, super pulse mode, continuous wave, spot size 200 μm, focal length 400 mm. Conventional intraoperative safety precautions were respected (patient covering with moist cloths, safety goggles, laser-resistant endotracheal tube, ventilation with oxygen concentration below 40%). After inspection and palpation under the microscope, saline containing epinephrine (1 mg/mL; 10 gtt. in 10 mL NaCl) was injected into the VF. As a result, stretching the epithelium allowed to assess the fixation of the lesion to deeper structures. The saline also protected the healthy surrounding VF tissue from thermal damage. Laser incisions were made at the site where the suspicious lesions could be distinguished from normal epithelium, considering a safety margin of at least 1 mm. Depending on the pre- and intraoperative findings, cordectomy was conducted. After having removed the suspicious cancerous tissue, the surgeon classified the resection type according to the cordectomy types of the ELS [45]. Lesions within the epithelial level without fixation or signs of infiltration were superficially removed en bloc. Marginal resections were taken if the complete tumor removal was uncertain. All excision biopsies were sent for histopathological examination. The guidelines of the American Joint Committee on Cancer (AJCC) were used for tumor staging [47]. Patients with histopathologically confirmed R1 status were rescheduled for follow-up resection. All TOLMS operations were performed by 5 experienced laryngologists. After surgery, patients were monitored on the ward for 1–2 nights. Before discharge, all treated patients received vocal hygiene counseling. In the event of recurring voice impairment, they were asked to present again between regular follow-up intervals. Postoperative voice rest was not recommended.

#### *2.3. Examination Instruments and Criteria*

The analysis of treatment outcome was based on postoperative histopathological findings, pre- and postoperative VLS, and voice function diagnostics. Digital 2D or 3D VLS was carried out via rigid transoral or flexible transnasal endoscopes with integrated microphones (XION GmbH, Berlin, Germany) [28,48]. According to the ELS protocol, voice function diagnostics consisted of established subjective (i.e., auditory-perceptual assessment, self-evaluation of voice) and objective procedures (i.e., VRP measurement, acoustic-aerodynamic analysis) [49–51]. Objective procedures quantify the investigated aspects of vocal function in an apparatus-based and neutral manner. Subjective tests describe the individual self-perceived vocal impairment from the examined person's point of view as well as auditory-perceptual assessments from the examiner's viewpoint.

Auditory-perceptual assessment of the recorded voice samples was conducted using the GRB system [31]. The perceived overall grade of hoarseness (G), roughness (R), and breathiness (B) were independently rated on a scale from 0 to 3 (0 = not existing, 1 = mild, 2 = moderate, 3 = severe) by two senior phoniatricians. From each audio recording the mean score of both GRB evaluations served for further analysis.

Subjective self-assessment of voice was obtained using the 9-item Voice Handicap Index (VHI-9i) including 9 questions rated on a scale from 0 to 4 (0 = never, 1 = almost never, 2 = sometimes, 3 = almost always, 4 = always) [52]. The VHI-9i reflects the functional, physical and emotional impact of the voice disorder on the patient's quality of life. Additionally, an estimation of the self-perceived overall vocal impairment (VHIs) at the time of questioning was scored between 0 and 3 (0 = normal, 1 = mild, 2 = moderate, 3 = severe).

VRP measurements and acoustic-aerodynamic analyses were performed with the DiVAS software (XION GmbH) to obtain objective quantitative data of the speaking and singing voice. The following parameters were collected: soft phonation threshold, highest and lowest pitch, maximum phonation time (MPT), jitter, dysphonia severity index (DSI) [53], and VEM [46]. The VEM is the logarithmised product of the area of the VRP (AVRP) and the quotient of the circumference of a circle with the same area and the actual VRP circumference (PVRP), supplemented by the addition of a coefficient (50) and an offset (−200). The mathematical formula is:

$$VEM = 50\ln\left(A\_{VRP}\frac{2\pi\sqrt{\frac{A\_{VRP}}{\pi}}}{P\_{VRP}}\right) - 200\tag{1}$$

The VEM quantifies the patient's dynamic performance and the frequency range as documented in the VRP. It expresses the vocal capacity as an interval-scaled value, mostly between 0 and 120. A high vocal capacity is characterized by a high VEM; conversely, a small VRP results in a small VEM.

#### **3. Data Analysis**

Descriptive statistics were used to describe the quantitative features of all pre- and postoperative parameters and their changes. As graphical techniques to display the data, we chose histograms and violin plots, i.e., box plots with kernel density plots rotated and surrounding them on each side. Being suitable for both continuous and ordinal variables, Spearman's rank-order correlation (rs) was used to investigate the strength and direction of association between the pre- und postoperatively measured characteristics and their differences. Wilcoxon signed-rank test was used to test whether vocal function parameters significantly improved as the result of TOLMS. Mean values and 95% confidence intervals for these changes were calculated. The impact of patient-related, tumor-related, and treatment-related factors on disease control and survival was analyzed using the Kaplan– Meier method. All statistical tests and graphics were done using R version 4.0.1 (GNU project, Free Software Foundation, Boston, MA, USA). The level of significance was set at α = 0.05. Due to the exploratory nature of the study no adjustment for multiple testing was performed. To show different significance levels, the following abbreviations were used: \* = 5%; \*\* = 1%; \*\*\* = 0.1%.

#### **4. Results**

#### *4.1. Sample Description and Preoperative Assessment*

From 60 patients initially recruited with histopathologically confirmed diagnosis of pT1a, six subjects (10.0%) were lost to follow-up and three subjects (5.0%) had to be excluded due to incomplete treatment documentation. In the remaining 51 patients, all diagnostic tests and therapeutic procedures were carried out as planned. The total sample consisted of 43 men and 8 woman, with a mean age of 65 years (range 31–84). At the time of intervention, women were on average 16 years younger than men (52 ± 14 vs. 68 ± 10, mean ± SD, *p* < 0.01). Regarding medical history, 39 subjects (76.5%) gave information about current or past tobacco abuse, with 12 subjects (23.5%) having smoked rarely or not at all. While 15.7% of the patients (8/51) never drank alcohol, 62.7% (32/51) reported regular and 21.6% (11/51) daily consumption of alcohol. Relevant preoperative patient characteristics within the examined cohort are shown in Table 1 (left side).

VLS revealed an almost equal distribution of tumor growth on both VF (28 right, 23 left). The lesions appeared flat and hyperkeratotic in 20/51 (39.2%), exophytic in 29/51 (56.9%), and ulcerating in 2/51 (3.9%) subjects. Concerning macroscopic assessment of tumor size at initial presentation, 51.0% of the patients (26/51) showed involvement of the entire VF, while in 27.4% (14/51) two-thirds and in 21.6% (11/51) one-third of the VF were affected. During phonation, phonatory VF mobility was reduced or absent on the affected tumor side in all subjects. Additionally, patients with bulged VF due to exophytic tissue growth displayed highly impaired glottal closure.

Subjective auditory-perceptual evaluation of patient's voices was categorized preoperatively with a mean of G2 R2 B1 (range 0–3). The VHI-9i had an average score of 18 ± 8, corresponding to moderate self-assessed patient complaints. The objective acoustic and aerodynamic parameters also indicated moderate impairment (e.g., VEM 64 ± 33; DSI 1.2 ± 2.4; MPT 13 ± 6 s). Correlation analysis performed on preoperative values showed that both VEM and DSI correlated with VHI-9i (rs = −0.62\*\*\* and rs = −0.29\*, respectively), G (rs = −0.42\*\* and rs = −0.34\*), R (rs = −0.41\*\* and rs = −0.37\*\*), B (rs = −0.47\*\*\* and rs = −0.30\*), and with each other (rs = 0.51\*\*\*).

**Table 1.** Patient characteristics (*n* = 51) before TOLMS (left) and after TOLMS (right). Unless otherwise specified, data expressed as number of patients and percentage of group.


#### *4.2. Postoperative Assessment*

Via TOLMS, 24 patients received subepithelial cordectomy (type I; 47.1%), 18 patients subligamental cordectomy (type II; 35.3%), and nine patients transmuscular cordectomy (type III; 17.6%). According to histopathology, the diagnosis confirmed in all subjects squamous cell carcinoma limited to one VF (pT1a). The grading classification revealed in most patients moderately differentiated tissue (G2; 66.7%), less frequent well differentiated (G1; 29.4%) and seldom poorly differentiated tissue (G3; 3.9%). Through primary operation, the pT1a was completely excised (R0 status) in 29 patients (56.9%). Following the piecemeal strategy, a second excision was necessary in 22 subjects (43.1%), as a residuum could not be ruled out (close tumor margin vs. R1 status). Of these 22 subjects with suspicious findings, 17 patients (77.3%) had no visual or histopathological malignant residue in the scheduled control TOLMS. Among the remaining five patients, the follow-up resections revealed residual invasive tumor in three patients (13.7%), Tis in one patient (4.5%), and a precursor lesion (squamous intraepithelial neoplasia SIN III) in the other patient (4.5%). All these lesions were completely excised during the second TOLMS.

The operative procedures were conducted without complications. Postoperatively, no patient complained about swallowing dysfunction. VLS check-ups showed fibrin formation on the wound surfaces followed by formation of scar tissue during healing. While extensive tumor growth was associated with larger glottal defects after removal, in smaller superficial findings treated via type I cordectomy a stable epithelium regenerated on the preserved lamina propria without relevant defects or scarring. In some patients, the scarred VF developed after about 6 months a restored phonatory mobility. Figure 1 gives an impression of pre- and postoperative VLS findings with videostrobokymographic illustration of VF oscillations.

**Figure 1.** Videolaryngostroboscopic pictures and videostrobokymographic illustration of vocal fold anatomy and function, preoperative (**upper row**) vs. postoperative (**lower row**). Example A (**left side**): 45-year-old male professional theater actor with a flat hyperkeratotic lesion of the right vocal fold. Example B (**right side**): 32-year-old female medical doctor with an exophytic tumor of the right vocal fold. Findings three months postoperatively show: pT1a completely removed, healing process finished, vocal folds with straight margin, complete glottal closure, and restored phonatory mobility (A: normalized, regular and symmetric oscillations; B: oscillations with scarring-related reduced amplitude and phase shift).

Within the mean postoperative observation period of 45 ± 26 months (median: 41 months), 10 patients (19.6%) suffered from a local recurrence (1× Tis, 7× rpT1a, 1× rpT1b, 1× cT3) with an average tumor-free interval of 15 months (median 10 months). Eight of these subjects had only one recurrence within the follow-up period. Among the remaining two, further recurrences occurred: one patient with the initial diagnosis of pT1a (G3) suffered from two recurrences of rpT1a after 17 and 80 months. The other subject with the initial diagnosis of pT1a (G2) had altogether four recurrences; after 13 (rpT1a), 27 (rpT2), 44 (rT3), and 92 months (rpT4a). During follow-up, a secondary glottic pT1a on the contralateral VF was detected in two patients after an interval of 1 and 3 years after removal of the primary tumor, respectively. All recurrent and secondary laryngeal carcinomas were successfully treated: Tis, T1 and T2 via secondary TOLMS, both T3 recurrences via radio-chemotherapy, and the T4 recurrence via total laryngectomy. One subject died due to a secondary pancreas carcinoma, another one died intercurrently. The 5-year recurrence-free, overall, and disease-specific survival rates (Kaplan–Meier method) were 71.4%, 94.4%, and 100.0% (Figure 2). Relevant postoperative and oncological patient characteristics are shown in Table 1 (right side).

**Figure 2.** Five-year Kaplan–Meier estimates for recurrence-free survival, overall survival, and disease-specific survival.

Three months after TOLMS, vocal function improved considerably compared to the preoperative measurements (Table 2). With respect to auditory-perceptual GRB evaluation, the pre- vs. post-therapeutical comparison revealed that the voices were less hoarse (1.9 ± 0.7 vs. 1.3 ± 0.7), rough (1.8 ± 0.7 vs. 1.2 ± 0.7), and breathy (1.0 ± 0.6 vs. 0.6 ± 0.6). The subjective vocal self-assessment via VHI-9i questionnaire demonstrated a mean reduction from 18 ± 8 to 9 ± 9 points. The VHIs criterion indicated a change from moderately (2 ± 1) to mildly disturbed voices (1 ± 1). The improvements regarding all these subjective parameters were found significant at the 0.1% level (*p* < 0.001). The subjective vocal parameters both pre- and postoperatively are displayed by histograms in Figure 3.

**Figure 3.** Subjective vocal parameters before and after pT1a removal. Upper row: Comparison of pre- and postoperative voice parameters according to the GRB-classification. Lower row: Comparison of pre- and postoperative VHI-9i and VHIs scores.


**Table 2.** Pre- and posttherapeutic parameters of vocal function in all patients and all cordectomy types (mean ± SD), their mean therapeutic differences (Diff) and 95% confidence intervals (CI) for changes in vocal measures three months after pT1a removal.

B: breathiness; DSI: dysphonia severity index; G: (overall) grade of hoarseness; MPT: maximum phonation time; R: roughness; VEM: vocal extent measure; VHI-9i: 9-item voice handicap index, VHIs: self-perceived overall vocal impairment. The level of significance is indicated as follows: \* significant at *p* < 0.05; \*\* significant at *p* < 0.01; \*\*\* significant at *p* < 0.001 (Wilcoxon signed-rank test).

> Regarding objective measures, the VEM improved significantly in the total cohort (from 64 ± 33 to 83 ± 31; *p* < 0.001), in both genders (males *p* < 0.01; females *p* < 0.05) and all cordectomy types (*p* < 0.05). In contrast, the decrease of jitter (0.9 ± 1.1 to 0.6 ± 0.4) and the increase of DSI (1.2 ± 2.4 to 1.5 ± 2.3) did not reach the level of significance in the total group, only in females (*p* < 0.05) and cordectomy type III (*p* < 0.05). VEM and DSI correlated significantly with each other also postoperatively (rs = 0.62\*\*\*). The VEM showed a significant negative correlation with VHI-9i (rs = −0.29\*) but not with age (rs = −0.18), while the DSI correlated significantly with age (rs = −0.39\*\*) but not with VHI−9i (rs = −0.11). Selected objective parameters before and after pT1a removal are graphically displayed via boxplots in Figure 4 with regard to the total cohort and cordectomy type.

> To provide insights into the magnitude of changes induced by TOLMS, Table 2 also presents the mean differences (and 95% confidence intervals) between pre- and posttherapeutic values. As a result, the numeric outcome of all subjective and objective parameters was larger in women compared to men. Similarly, the improvement of these parameters in cordectomy type III was higher compared to the other cordectomy types.

**Figure 4.** Objective acoustic parameters VEM, DSI, and jitter before and after pT1a removal concerning the total cohort and cordectomy types. Data are compared pre- vs. postoperatively via violin plots, i.e., box plots with kernel density plots rotated and surrounding them on each side. The boxplots display the median, quartiles, and the range of values covered by the data. The density curves display the full distribution of the data including any outliers. The level of significance is indicated as follows: \* significant at *p* < 0.05; \*\* significant at *p* < 0.01; \*\*\* significant at *p* < 0.001 (Wilcoxon signed-rank test).

#### **5. Discussion**

Given the established favorable oncological results of CO2-TOLMS in T1a glottic carcinoma, functional aspects should be another treatment objective. We successfully examined the oncological and functional outcomes after TOLMS in pT1a patients, focusing on the evaluation of voice with subjective and objective parameters. Our T1a cohort is consistent with the literature in terms of patient characteristics, treatment methods, and oncological results (see Table 1, Figure 2). Therefore, a closer look at our vocal outcomes is warranted compared to the results of previous investigations.

Many studies were conducted to compare TOLMS with radiotherapy in patients with early glottic cancer [54–56]. The vocal outcomes were either superior in radiotherapy [57,58] or in TOLMS [59,60], or they did not show relevant differences between both treatment groups [61–64]. In general, pre-therapeutic voice data was often not collected [57–59,61,63–69]. In these investigations, it is impossible to relate the postoperative voice function to the pretherapeutic baseline. Some studies evaluated vocal function before and after TOLMS according to the cordectomy type [70–74]. Mainly, voice quality differed depending on the amount of tissue resected: vocal outcomes after lesser-extent cordectomies (ELS type I, II) were superior compared to larger-extent cordectomies. However, a multidimensional, detailed pre- and post-therapeutic documentation and evaluation of voice was only carried out in a few studies [62,70,71,74,75]. To compare the vocal outcomes after TOLMS, Table 3 summarizes the main results of previous investigations including the number of T1a patients treated and the parameters used for evaluation.






phosphate; MFR—mean flow rate; MPT—maximum phonation time; NHR—noise-to-harmonic ratio; NNE—normalized noise energy; N/S—not specified; PSS-H&N—performance status scale for head & neck cancer patients; RBH—Roughness, Breathyness, (overall grade of) Hoarseness; SNR—signal to-noise ratio; SPI—soft phonation index; SPL—sound pressure level; UW-QoL—University of Washington Quality of Life questionnaire; VAS—visual analogue scale; VC—vital capacity; VHI—voice handicap index; VHI-10—10-item VHI; VHI-12—12-item VHI; VoiSS—voice symptom scale; VRP—voice range profile; V-RQOL—Voice-Related Quality-of-Life survey.

The comparability of published studies is limited due to the lack of standardization regarding (1) vocal outcome assessment (different parameters, follow up), (2) patient selection (e.g., all early glottic cancer patients, low number of T1a), as well as (3) inclusion and treatment criteria (e.g., combined T stages and cordectomy types).

The usefulness of objective acoustic measures has been questioned. Some studies indicated that TOLMS results in an increase of F0, jitter, shimmer, and a moderate decrease of MPT in extended cordectomies when compared with healthy controls (e.g., [79]). Other studies found either a TOLMS-associated improvement [74,75,77], or no relevant changes throughout the postoperative course [70,78]. In our investigation, the patients revealed in all objective and subjective parameters postoperative changes. Similar to the literature, subjective parameters improved significantly [71,72,77,79]: GRB, VHI-9i and VHIs substantially improved in our total cohort, both genders, and in each cordectomy group. Among objective measures, the MPT showed non-specific, undirected changes without any significance. This is in concordance with the results of Hamzany et al., confirming that aerodynamic parameters seem to be less suitable for outcome assessment in T1a glottic carcinoma [70]. Regarding acoustic parameters, VEM seems to be very well suited to assess the resulting voice function after T1a excision compared to other objective acoustic parameters, as only this measure responded significantly in the total cohort and in all subgroups. Among cordectomy types, the larger the resections, the greater the postoperative subjective numerical benefit (Table 2). Similarly, the improvement of acoustic parameters in cordectomy type III was bigger compared to the other cordectomy types. This is related to the fact that larger tumors are associated with more severe voice impairment preoperatively. In contrast, better voice function in smaller tumors results in less postoperative numerical benefit, even if the final voice outcome is better. The relevant differences in the cordectomy groups (types I–III) suggest that pooling these types, as in previous studies of the literature, does not seem appropriate. Although all subjective and objective improvements were larger in women than men, we cannot draw general conclusions due to our limited number of female patients.

While the VEM is not yet widely applied in voice diagnostics, the multidimensional DSI represents an established parameter of instrumental voice evaluation based on a weighted combination of highest possible frequency, lowest intensity, MPT and jitter [53]. Former investigations showed that the DSI might be influenced by using different registration programs, as well as by age or gender [80,81]. These age and gender effects were also confirmed in our study. The DSI appears susceptible to extreme measures (e.g., highest frequency, lowest intensity), which are likely to be influenced by age or gender. In contrast, the VEM, calculated from area and shape of the VRP, is less affected by the above-mentioned extreme measures. Since VEM correlated highly significantly with DSI, both measurements can be seen as related and comparable parameters. Part of their shared variance could be accountable to age, although the linear relationship with age is considerably weaker for the VEM compared to the DSI. However, the VEM as a positive criterion characterizes the vocal abilities and enables a classification of voice performance, while the DSI as a negative criterion particularly describes the severity of dysphonia [80,82]. Among both parameters, the VEM better reflected the subjective vocal impairments. However, DSI, VEM, VHI, and GRB represent different aspects of the voice: They are complementary in objective and subjective evaluation of voice quality, vocal performance, or perceived vocal handicap.

Depending on preoperative T1a tumor characteristics, individual postoperative voice function might be better, similar, or slightly reduced. In general, objective and subjective voice quality improved during long-term postoperative follow-up. This is in line with the results of previous investigations [70,83]. Although voice diagnostics according to ELS protocol is more time-consuming, we consider this effort justified for evidence-based therapy and necessary for documentation of voice preservation. To preserve voice function, the intraoperative laser power should be selected as low as possible to avoid thermal damage in the surrounding healthy tissue. In addition, focused excision achieves better vocal outcomes than defocused vaporization [62]. The application of the KTP laser may be

able to offer improved voice preservation with similar oncological control compared to CO2- TOLMS [76,77]. The focus on voice preservation may increase the number of interventions in cases with histologically questionable tumor margins [84,85]. Our experience confirms the literature, that re-operation can sometimes be avoided by close monitoring of local control using VLS [44,66].

#### *Study Strengths and Limitations*

Our study is characterized by the application of multidimensional voice evaluation, extended by the objective VEM. Further strengths comprise cohort homogeneity restricted to T1a instead of all early glottic cancer patients, and evaluation of specific cordectomy types in a sufficient number of patients rather than generalization or grouping into lesser- vs. larger-extent cordectomies. Applying the ELS protocols both for cordectomy classification and multidimensional voice evaluation enables a systematic comparison of our results with the outcomes of future studies.

Some limitations must be considered before drawing general conclusions. First, our results are investigations of a mono-centre study. To prevent centre bias, multicentre trials with a larger number of subjects are needed. Second, females are underrepresented in our study; thus, there may be participation bias. With a limited number of female patients, general gender-specific conclusions cannot be drawn. Our study sample reflects the well-known prevalence of laryngeal cancer in male patients, though. Third, a more precise preoperative assessment of the exact extent of the pathology would be useful. The importance of tumor size and shape should not be underestimated regarding voice function. The histopathologically determined tumor extent does not replace this information, because resections via TOLMS are not always performed en bloc and may lead to thermal tissue artefacts (e.g., shrinkage, coagulation, vaporization). Fourth, there were differences regarding the individual amount of interventions as well as rehabilitation strategies. Voice therapy could influence the vocal outcome in operated patients. Having neglected this may also result in a performance bias. Lastly, some factors influencing the VRP registration have to be considered. One limitation is the fact that in aphonic patients no perimeter of the VRP can be measured. However, in our study no T1a patient suffered from aphonia. Other factors comprise the routine of the examiner, motivation of the patients, and varying quantities of registered tones. Most of these influential factors are of minor importance in our investigation because all VRPs were recorded by one experienced examiner under practically equal conditions. Since precise VEM calculation is based on the actual VRP shape and circumference, future multicenter studies should be standardized by defining the number of registered tones per interval.

#### **6. Conclusions**

TOLMS has been proven to be an established and safe standard oncologic therapy for T1a glottic carcinoma with satisfactory preservation of vocal function both subjectively and objectively. Among objective voice parameters, the VEM seems to best reflect selfperceived subjective voice impairment showing significant changes after T1a treatment that incorporates phonosurgical principles. It represents a sensitive, positive measure of voice function, as well as an understandable and easy-to-use parameter for quantifying vocal performance as documented in the VRP. Therefore, it is reasonable to include the VEM as a diagnostic addition to the established voice measures of the ELS protocol.

**Author Contributions:** Conceptualization, T.N. and P.P.C.; Methodology, W.S., A.M. and P.P.C.; Literature Review, W.S., T.N. and P.P.C.; Investigation, T.N., A.M. and P.P.C.; Data Analysis, W.S. and T.E.; Original Draft Writing, W.S., F.C. and P.P.C.; Draft Review & Editing, W.S., T.N., D.M. and P.P.C.; Visualization, F.C. and T.E.; Supervision, D.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of Charité–Universitätsmedizin Berlin, Berlin, Germany (reference number: EA4/140/10).

**Informed Consent Statement:** Informed consent was obtained from all study participants.

**Data Availability Statement:** All data of the study are available in the Department of Audiology and Phoniatrics, Charité–Universitätsmedizin Berlin, Berlin, Germany.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


#### *Review*

## **Neurostimulation in People with Oropharyngeal Dysphagia: A Systematic Review and Meta-Analyses of Randomised Controlled Trials—Part I: Pharyngeal and Neuromuscular Electrical Stimulation**

#### **Renée Speyer 1,2,3,\*, Anna-Liisa Sutt 4,5, Liza Bergström 6,7, Shaheen Hamdy 8, Bas Joris Heijnen 3, Lianne Remijn 9, Sarah Wilkes-Gillan <sup>10</sup> and Reinie Cordier 2,11**


**Abstract:** *Objective.* To assess the effects of neurostimulation (i.e., neuromuscular electrical stimulation (NMES) and pharyngeal electrical stimulation (PES)) in people with oropharyngeal dysphagia (OD). *Methods.* Systematic literature searches were conducted to retrieve randomised controlled trials in four electronic databases (CINAHL, Embase, PsycINFO, and PubMed). The methodological quality of included studies was assessed using the Revised Cochrane risk-of-bias tool for randomised trials (RoB 2). *Results.* In total, 42 studies reporting on peripheral neurostimulation were included: 30 studies on NMES, eight studies on PES, and four studies on combined neurostimulation interventions. When conducting meta analyses, significant, large and significant, moderate pre-post treatment effects were found for NMES (11 studies) and PES (five studies), respectively. Between-group analyses showed small effect sizes in favour of NMES, but no significant effects for PES. *Conclusions.* NMES may have more promising effects compared to PES. However, NMES studies showed high heterogeneity in protocols and experimental variables, the presence of potential moderators, and inconsistent reporting of methodology. Therefore, only conservative generalisations and interpretation of meta-analyses could be made. To facilitate comparisons of studies and determine intervention effects, there is a need for more randomised controlled trials with larger population sizes, and greater standardisation of protocols and guidelines for reporting.

**Keywords:** deglutition; swallowing disorders; RCT; intervention; neuromuscular electrical stimulation; pharyngeal electrical stimulation; PES; NMES

#### **1. Introduction**

The aerodigestive tract facilitates the combined functions of breathing, vocalising, and swallowing. Any dysfunction in this system may lead to oropharyngeal dysphagia (OD)

**Citation:** Speyer, R.; Sutt, A.-L.; Bergström, L.; Hamdy, S.; Heijnen, B.J.; Remijn, L.; Wilkes-Gillan, S.; Cordier, R. Neurostimulation in People with Oropharyngeal Dysphagia: A Systematic Review and Meta-Analyses of Randomised Controlled Trials—Part I: Pharyngeal and Neuromuscular Electrical Stimulation. *J. Clin. Med.* **2022**, *11*, 776. https://doi.org/10.3390/ jcm11030776

Academic Editor: Michael Setzen

Received: 7 December 2021 Accepted: 27 January 2022 Published: 31 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

or swallowing problems [1]. OD can be the result of underlying diseases such as stroke or a progressive neurological disease (e.g., Parkinson's disease, multiple sclerosis) or an adverse effect after head and neck oncological interventions (e.g., radiation or surgery) or intensive care treatment (e.g., intubation and tracheostomy). Prevalence estimates of OD have been reported to be as high as 50% in cerebral palsy [2], 80% in stroke and Parkinson's disease, and over 90% in people with community-acquired pneumonia [3]. OD can have a severe impact on a person's health as it may lead to dehydration, malnutrition, and even death. Research has identified inverse bidirectional relationships between decreased health-related quality of life and increased OD severity [4].

Traditional OD therapy may include physical interventions such as: bolus modification and management (e.g., adjusting the viscosity, volume, temperature and/or acidity of food and drinks); oromotor exercises; body and head postural adjustments; and swallow manoeuvres (e.g., manoeuvres to improve food propulsion into the pharynx and airway protection) [1]. Therapy may also include sensory stimulation, which involves applying techniques like thermal stimulation and chemical stimulation using natural agonists of polymodal sensory receptors (e.g., capsaicin, the spicy component of peppers) [5].

Another type of stimulation considered to be beneficial for promoting rehabilitation of swallowing dysfunction is acupuncture. This practice emerged from traditional Chinese medicine and exerts therapeutic effects by inserting thin needles at strategic places, termed acupuncture points, on the body surface aiming to rebalance the flow of energy or life force ('qi'). Needles are then activated through specific manual movements or electrical stimulation. Although stimulation of acupuncture points seems to be associated with places where nerves, muscles, and connective tissues may be stimulated [6], their intrinsic mechanisms are still part of a continuing scientific debate on acupuncture.

Recently, an increasing number of studies have been published on alternative interventions aiming to enhance neural plasticity by using non-invasive brain stimulation (NIBS) techniques. Repetitive transcranial magnetic stimulation (rTMS) and transcranial direct current stimulation (tDCS) are cortically or centrally applied NIBS techniques. Using electromagnetic induction, rTMS results in depolarisation of post-synaptic connections, whereas tDCS uses direct electrical current to shift the polarity of nerve cells [7]. Alternatively, electrical stimulation techniques like pharyngeal electrical stimulation (PES) and neuromuscular electrical stimulation (NMES) target the peripheral neural pathways [8]. NMES aims to strengthen muscular contractions during swallowing and uses stimulation by electrodes placed on the skin over the anterior neck muscles to activate sensory pathways [9–11]. In contrast, PES has been shown to drive neuroplasticity in the pharyngeal motor cortex through direct stimulation of the pharyngeal mucosa via intraluminal catheters [7].

Over the past decade, several reviews have been published on the effects of neurostimulation in patients with OD. Most of these reviews focused on selected types of neurostimulation: NMES [10,12], rTMS [13,14], tDCS [15], or rTMS and tDCS [16,17]. Only two systematic reviews included both cortical (rTMS and tDCS) and peripheral neurostimulation (PES and NMES) [18,19]. All reviews targeted interventions in post-stroke populations except one review that broadened inclusion criteria to patients with acquired brain injury including stroke [16]. To date, all systematic reviews on neurostimulation as a treatment for OD set boundaries for inclusion based on medical diagnoses.

The aim of this systematic review is to determine the effects of neurostimulation in people with OD without excluding populations based on medical diagnoses. Findings are based on the highest level of evidence only, namely randomised controlled trials (RCTs), and summarised by conducting meta-analyses. The results of this review will be presented in two companion papers. This paper (Part I) reports on pharyngeal and neuromuscular electrical stimulation (PES and NMES) while the second paper (Part II) will report on brain stimulation (i.e., rTMS and tDCS).

#### **2. Methods**

The methodology and reporting of this systematic review were based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement and checklist (Supplementary Tables S1 and S2) which aim to enhance the essential and transparent reporting of systematic reviews [20,21]. The protocol for this review was registered at PROSPERO, the international prospective register of systematic reviews (registration number: CRD42020179842).

#### *2.1. Information Sources and Search Strategies*

Literature searches to identify studies were conducted on 6 March 2021, across four databases: CINAHL, Embase, PsycINFO, and PubMed. Publication dates of coverage ranged from 1937–2021, 1902–2021, 1887–2021, and 1809–2021, respectively. Additional searches, including checking the reference lists of eligible articles, were performed. Two main categories of terms were used in combination: (1) dysphagia and (2) randomised control trials. Search strategies were performed in all four electronic databases using subheadings (e.g., MeSH and Thesaurus terms) and free text terms. The full electronic search strategies for each database are reported in Table 1. To identify other literature beyond that found using these strategies, the reference lists of each eligible article were checked.

**Table 1.** Search strategies.


#### *2.2. Inclusion and Exclusion Criteria*

Studies were included in this systematic review if they met the following criteria: (1) participants had a diagnosis of oropharyngeal dysphagia; (2) the study included noninvasive neurostimulation interventions aimed at reducing swallowing or feeding problems; (3) the study included a control group or comparison intervention group; (4) participants were randomly assigned to one of the study arms or groups; and (5) the study was published in the English language.

Interventions such as non-electrical peripheral stimulation (e.g., air-puff or gustatory stimulation), pharmacological interventions and acupuncture, were considered out of the scope of this review, and thus were excluded. Invasive techniques and/or those that did not specifically target OD (i.e., deep-brain stimulation studies after neurosurgical implementation of a neurostimulator) were also excluded. Conference abstracts, doctoral theses, editorials, and reviews were excluded.

Finally, only studies reporting on peripheral neurostimulation (i.e., PES and NMES) were included in this review (Part I). Studies on brain neurostimulation (i.e., rTMS and tDCS) will be reported on in a companion paper (Part II).

#### **3. Systematic Review**

#### *3.1. Methodological Quality and Risk of Bias*

The methodological quality of the included studies was assessed using the Revised Cochrane risk-of-bias tool for randomised trials (RoB 2) [22]. The RoB 2 tool identifies five domains to consider when assessing where bias may have been introduced into a randomised trial: (1) bias arising from the randomisation process; (2) bias due to deviations from intended interventions; (3) bias due to missing outcome data; (4) bias in measurement of the outcome; and (5) bias in selection of the reported result. The RoB 2 gives a series of signalling questions for each domain whose answers give a judgement (i.e., "low risk of bias," "some concerns," or "high risk of bias"), which can be evaluated to determine a study's overall risk of bias [22].

#### *3.2. Data Collection Process*

A data extraction form was created to extract data from the included studies under the following categories: participant diagnosis, inclusion and exclusion criteria, sample size, age, gender, intervention goal, intervention agent/delivery/dosage, outcome measures, and treatment outcome.

#### *3.3. Data, Items and Synthesis of Results*

Titles and abstracts of included studies were screened for eligibility by two independent reviewers, after which the eligibility of selected original articles was assessed by these same two reviewers. If agreement could not be reached between the first two reviewers, a third reviewer was consulted to reach consensus. Two independent researchers also assessed the methodological study quality and, where necessary, consensus was reached with involvement of a third reviewer. As none of the reviewers have formal or informal affiliations with any of the authors of the included studies, no evident bias in article selection or methodological study quality rating was present.

Data points across all studies were extracted using comprehensive data extraction forms. Risk of bias per individual study was assessed using the RoB 2 tool [22]. Data were extrapolated and synthesized using the following categories: participant characteristics, inclusion criteria, intervention conditions, outcome measures and intervention outcomes. Effect sizes and significance of findings were the main summary measures for assessing treatment outcome.

#### **4. Meta-Analysis**

*Data Analysis.* Data were extracted from each study to compare the effect sizes for the following: (1) pre-post outcome measures of OD and (2) mean difference between neurostimulation and comparison controls in outcome measures from pre- to post-intervention. Control groups may receive no treatment, sham stimulation and/or traditional dysphagia therapy (DT; e.g., bolus modification, oromotor exercises, body and head postural adjustments, and swallow manoeuvres). Only studies using instrumental assessment (e.g., videofluoroscopic swallow study (VFSS) or fiberoptic endoscopic evaluation of swallowing (FEES)) to confirm OD were included.

Data collected using outcome measures based on visuoperceptual evaluation of instrumental assessment were preferred over clinical non-instrumental assessments. Oral intake measures were only included if no other clinical data were available, whereas screening tools and patient self-report measures were excluded from meta-analyses altogether. When selecting outcome measures for meta-analyses, reducing heterogeneity between studies was a priority. Consequently, measures other than the authors' primary outcomes may have been preferred if these measures contributed to greater homogeneity.

To compare effect sizes, group means, standard deviations, and sample sizes for pre- and post-measurements, data were entered into Comprehensive Meta-Analysis Version 3.3.070 [23]. If only non-parametric data were available (i.e., medians, interquartile ranges), data were converted into parametric data for meta-analytic purposes. Studies

with multiple intervention groups were analysed separately for each experimental-control comparison. If studies included the same participants, only one study was included in the meta-analysis. For studies providing insufficient data for meta-analysis, authors were contacted by e-mail to request additional data.

Effect sizes were calculated in Comprehensive Meta-Analysis using a random-effects model since it was unlikely that studies would have similar true effects due to variations in sampling, participant characteristics, intervention approaches, and outcome measurements. Heterogeneity was estimated using the *Q* statistic to determine the spread of effect sizes about the mean and *I* <sup>2</sup> was used to estimate the ratio of true variance to total variance. *I* 2-values of less than 50%, 50% to 74%, and higher than 75% denote low, moderate, and high heterogeneity, respectively [24]. Effect sizes were generated using the Hedges' *g* formula for standardized mean difference with a confidence interval of 95%. Effects sizes were interpreted using Cohen's *d* convention as follows: *g* ≤ 0.2 as no or negligible effect; 0.2 < *g* ≤ 0.5 as small effect; 0.5 < *g* ≤ 0.8 as moderate effect; and *g* > 0.8 as large effect [25].

Forest plots of effect sizes for OD outcome scores were generated for PES and NMES separately: (1) pre-post neurostimulation and (2) neurostimulation interventions versus comparison groups. Subgroup analyses were used to explore effect sizes as a function of various moderators depending on neurostimulation type. For example, outcome measures, medical diagnoses, total treatment duration, total neurostimulation time, and stimulation characteristics (e.g., pulse duration, pulse rate, electrode configuration). To account for the possibility of spontaneous recovery during the intervention period, only between-subgroup meta-analyses were conducted using post-intervention data.

Comprehensive Data Analysis software was utilized to evaluate publication bias. The Begg and Muzumdar's test [26] was used to calculate the rank correlation between the standardised effect size and the ranks of their variances. The Begg and Muzumdar test calculates both a tau and a two tailed *p* value, with values of close to zero indicating no correlation, while results closer to 1 suggest a correlation. Where asymmetry is the result of publication bias, high standard error values would correspond with larger effect sizes. Where larger effects correspond to low values, tau would be positive (with the inverse also being true). Conversely, when larger effects correspond to high values, tau would be negative.

Publication bias was also evaluated utilising a fail-safe N test. This measure addresses the question of how many omitted studies would be necessary to nullify the effect. It refers to the number of studies where the effect size was zero being included in the meta-analysis prior to the result becoming statistically insignificant [27]. When this value is comparably low, there may be reason to treat the results with caution. When the value is comparably high, however, it can be reasonably concluded that the treatment effect is not nil, although it may be increased due to the omission of some studies.

#### **5. Results**

#### *5.1. Study Selection*

A total of 8059 studies were identified through subject heading and free text searches from the four databases: CINAHL (*n* = 239), Embase (*n* = 4550), PsycINFO (*n* = 231), and PubMed (*n* = 3039). Removing duplicate titles and abstracts (*n* = 1113) left a total of 6946 records. A total of 261 original articles were assessed at a full-text level, with articles grouped based on type of intervention. Four additional studies were found through reference checking of the included articles. At this stage, no studies were excluded based on type of intervention (e.g., behavioural intervention, neurostimulation). Of the reviewed 261 articles, 58 studies on neurostimulation were identified that satisfied the inclusion criteria. As this systematic review reports on PES and NMES interventions only, a final number of 42 studies reporting on peripheral neurostimulation were included in this review. Figure 1 presents the flow diagram of the reviewing process according to PRISMA.

**Figure 1.** Flow diagram of the reviewing process according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA).

#### *5.2. Description of Studies*

All included studies are described in detail within Tables 2 and 3. Specifically, Table 2 presents data on study characteristics including methodological study quality, inclusion and exclusion criteria, and details on participant groups. The following information is provided for all study groups (control and intervention groups): medical diagnosis, sample size, age and gender. Table 3 reports on intervention goals of included studies, intervention components, outcome measures, intervention outcomes, as well as main conclusions.



**Table 2.** Study characteristics of studies on NMES and PES interventions for people with oropharyngeal dysphagia.































 explicitly system; palsy; CT–computed tomography; accident; DOSS–dysphagia outcome and severity scale; DT–dysphagia therapy; FEES–fiberoptic endoscopic evaluation of swallowing; FOIS–functional oral intake scale; ICH–intracranial haemorrhage; MMSE–Mini-Mental State Exam; MRI–magnetic resonance imaging; MS–multiple sclerosis; NIHSS–National Institutes of Health Stroke Scale; NMES–neuromuscular electrical stimulation; OD–oropharyngeal dysphagia; OST–oral sensorimotor treatment; PAS–penetration–aspiration score; PES–pharyngeal electrical stimulation; rTMS–repetitive transcranial magnetic stimulation; SAH–subarachnoid haemorrhage; sEMG–surface electromyography; SLT–Speech and Language Therapist; TBI–traumatic brain injury; tDCS– transcranial direct current stimulation; TOR-BSST–Toronto Bedside Swallowing Screening test; VFSS–videofluoroscopic swallowing study.

