1. Introduction
Speech and voice disorders become more and more common in the 21st century. The voice is formed by the oscillation of the two vocal folds within the larynx. The vocal folds oscillate on average between 100 Hz (males) and 300 Hz (females) for normal phonation, but can reach up to 1581 Hz during singing [
1]. A normal voice or phonation is assumed to be produced by symmetric and periodic vocal fold oscillations [
2,
3]. Additionally, glottis closure during vocal fold oscillations is assumed to be important for normal voice (see
Figure 1) [
4]. To capture and assess vocal fold oscillations, digital high-speed video systems have now been used for more than 30 years [
5]. Many studies applied HSV imaging to subjectively assess and judge vocal fold vibrations [
6,
7]. To quantitatively assess and judge vocal fold oscillations in HSV data, image-processing techniques have been suggested to segment the glottal area or detect the vocal fold edges over time (see
Figure 1), being requisite for subsequent computation of quantitative parameters [
8,
9,
10,
11,
12]. The first image-processing approaches go back to the 1990s, where classical image-processing techniques such as region growing were suggested [
13]. Since then, many other classical image-processing techniques as thresholding [
14], edge detection [
15,
16], or active contours [
17,
18] have been successfully applied. These classical image-processing techniques have been further developed [
19] and combined with machine learning methods, e.g., active contours with k-means-clustering [
20]. Machine learning methods and especially computationally expensive deep neural networks (DNN) have become more and more popular due to the computational performance increase of computers and, in particular, the effective use of graphics processing units (GPUs) [
21]. Specifically, convolutional neural networks (CNNs) based on the U-Net architecture [
22] are a popular and commonly used method for glottis segmentation in HSV videos [
23,
24,
25].
The main advantage of DNNs is that, although they have high computational costs during the training process, they are much faster during application when performing segmentation tasks. Kist et al. [
26] reported a <1 min segmentation time for 1000 HSV frames (<0.06 sec/image) for their DNN on a GPU (GeForce GTX 1080 Ti) in contrast to ref. [
27], who reported a mean segmentation time of 3.8 sec/image for their fully automated wavelets and active contour-based method on a CPU (Intel
® Core
TM i5-2400, 2 GB RAM). Although user friendly semi-automatic glottis segmentation is highly reliable [
28], the expenditure of time is also significantly higher (approx. 0.9 sec/image) than for current DNNs [
26]. Another big advantage is that DNNs are highly reliable even for image quality degradation caused by factors such as blurring or poor light conditions [
26]. The current DNN-based methods report segmentation accuracies of over 80%, e.g., refs. [
18,
24,
25]. The current approaches also successfully apply DNNs for automatic glottis midline detection in HSV videos [
29]. A comprehensive overview of recent machine learning and DNN approaches for HSV image segmentation is provided in ref. [
21].
To the best of our knowledge, except for the BAGLS data set [
30], all previous studies considered only one HSV camera system. Naturally, the trained DNNs may be biased towards other HSV systems using varying camera manufactures (see
Figure 2), CCD sensors, spatial resolutions (from 256 × 256 to 1024 × 1024), light sources, and endoscopes. This may be a disadvantage for other researchers or clinicians who want to use existing DNN-based image processing but have different HSV systems than the system the DNN was trained on. In addition, new HSV systems will be developed in the coming years, which will also have different recording modalities, leading to so called “concept drifts” in the resulting images [
31]. Especially for these new and hence unknown HSV systems, the segmentation accuracy might significantly decrease, requiring existing DNNs to be adapted [
32]. One possibility, although time-consuming, is the (re-)training of a model from scratch [
31,
33]. The other option is provided by so-called re-training or fine-tuning methods, allowing for easy and fast adaption of existing and pre-trained neural networks.
In this work, we suggested, discussed, and analyzed re-training approaches for HSV image segmentation. To the best of our knowledge, the effect and usefulness of re-training strategies on laryngeal HSV segmentation have not been investigated yet. However, re-training has to be kept in mind and will have to be considered in HSV image processing to enable sufficient accurate segmentation for new camera systems in the future.
2. Materials and Methods
2.1. Data Set
The BAGLS data set contains 59,250 annotated images from 640 HSV videos. Seven international cooperation partners contributed to the data set, yielding a high diversity in recording modalities. A detailed description of the BAGLS data set can be found in ref. [
30].
The new BAGLS-RT data set contains 267 HSV videos from eight different cameras and institutions, yielding 21,050 annotated images. The BAGLS-RT data set expands the BAGLS data set with five new cameras (
Figure 2), four new light sources, one flexible endoscope, one new frame rate, and 14 new spatial resolutions, see
Table A1,
Table A2,
Table A3,
Table A4 and
Table A5 for details. The subject distribution is as follows: mean age 42 ± 20 years, age range 18–93 years, 177 females and 90 males, 154 patients with healthy voices, and 123 patients with various pathologies, see
Table A6. All recordings were performed during sustained phonation.
2.2. U-Net Architecture
U-Net (3.2): The U-Net is a commonly used convolutional neural network for image segmentation [
22]. Using skip-connections within the encoder–decoder architecture allows for effective and fast learning based on a relatively small data base [
34]. The basic structure of the U-Net is illustrated in
Figure 3.
In the following, for better understanding for those readers who are not familiar with deep learning, some essential terms are shortly described:
Training data: The data used for training a model on the task, herein glottis segmentation: BAGLS (54,750 images) and BAGLS-RT (18,250 images).
Validation data: During training, the segmentation quality is judged on certain data not being used for training or testing, herein 5% of each training set.
Test data: After the training is finished, the performance evaluation of the final model is performed on so-far unknown test data: BAGLS (4000 images) and BAGLS-RT (2800 images).
Batch: The share of training data that is used for training a model. Batches can contain the entire available training data or parts of it. In this work, we used batch sizes of b = {25%, 50%, 75%, 100%} of the available training data within the corresponding BAGLS or BAGLS-RT data.
Epoch: One learning cycle, i.e., adaption or optimization of the U-Net parameters (i.e., parameter update within the U-Net) overall included training data (i.e., the defined batch size). This network parameter optimization (backpropagation algorithm) does not use the entire batch at once, but splits it up in smaller subsets, herein 8 images.
Evaluation of segmentation performance: For judging image segmentation performance we used the commonly applied Intersection over Union (IoU) [
26]. The IoU is a metric that quantifies the overlap between the ground truth (manually annotated data) and the prediction of the U-Net. It divides the overlapping pixels of prediction and ground truth by the sum of all pixels (
Figure 4). Thereby, IoU = 1 means perfect prediction.
The U-Net was implemented in Python using TensorFlow 2.5.0 and trained on a NVIDIA GeForce RTX 3080 GPU.
2.3. Data Preprocessing
Before training, the images were preprocessed with the following two methods.
The U-Net within the TensorFlow framework requires standardized image sizes: To meet the internal pooling operations of the U-Net, Gomez et al. [
30] resized the training and validation images to 512 × 256 pixels (Resize Method). This 2:1 proportion was chosen because it approximates the glottis dimensions. However, this often yielded an undesired deformation of the images resulting in unnatural glottis geometry (
Figure 5). Hence, we now suggest a different method called the Region of Interest (ROI) method.
Region of Interest Method (ROI): For resizing the images to the desired 2:1 scale (based on the glottis geometry), the following new approach was performed. Within each video, bounding boxes were generated and combined to B
Ref, enclosing all included segmentation masks (
Figure 6a). Afterwards, the smallest (B
2:1 ≥ B
Ref) and largest possible bounding boxes in the desired 2:1 scale were determined, defining the boundaries of available ROIs (
Figure 6b,c). The region of interest (ROI) for each image may now be an automatically and randomly chosen box B
Var within the defined area, yielding more variety in training data regarding the position of the glottis in the image as well as surrounding information (
Figure 6d,e).
2.4. U-Net Training
U-Net training: If not otherwise specified, hyperparameters were chosen as provided in ref. [
30]. First, model parameters were initialized randomly, forming the initial model M
0. Validation data comprised 5% of training data. A 3-fold cross validation was performed. For model training, an ADAM optimizer with a cyclic learning rate between 10
−3 and 10
−6 was used. The mini-batch size was set to 8 images. Training was restricted to max. 100 epochs with early stopping, i.e., if the
Dice Loss (i.e., overlap of prediction and ground truth) [
35] did not improve after 10 epochs on the validation data, training was terminated. Final segmentation quality was then computed over the mean IoU (mIoU) on the test data.
Augmentation: To enhance the variability of the data, images were augmented using Python Package
Albumnetations. Variations were stochastically performed with brightness and contrast (
p = 0.75), gamma value (
p = 0.75), Gaussian noise value (
p = 0.5), blurring (
p = 0.5), random rotation between 0° and 30° (
p = 0.75), and horizontal mirroring (
p = 0.5) [
36]. Augmentation was performed for each epoch during training, yielding different training data for each epoch. In addition, for the previously described ROI method, different ROIs were generated for each epoch. Such augmentation approaches help to avoid overfitting of the model on the training data [
37] and may also improve model performance [
38].
2.5. Re-Training the U-Net
The following re-training strategies were tested to investigate segmentation quality on existing data (BAGLS) and new data (BAGLS-RT).
2.5.1. Re-Training from Scratch
Here, the entire U-Net was newly trained. For training, both data sets BAGLS and BAGLS-RT were used. The validation data contained 5% of each data set. The BAGLS-RT training set was split in batches
b of different sizes
b = {25%,50%,75%,100%} and individually added to the entire BAGLS training set. This allows investigation of how the amount of new BAGLS-RT data influences the training and hence the segmentation performance on BAGLS and BAGLS-RT. The training process is illustrated in
Figure 7.
2.5.2. Incremental Finetuning
Here, a baseline model, trained solely with BAGLS data, was used as the starting model. Then, only the BAGLS-RT data were used to re-train this model, commonly known as finetuning. To simulate continuous new data, based on this finetuning concept, incremental learning was simulated using different batch sizes
b = {25%, 50%, 100%} of BAGLS-RT (
Figure 8). This means for, e.g., b = 25% that four incremental finetuning steps were performed, as shown in the bottom row of
Figure 8.
2.5.3. Incremental Finetuning Using a Mixed Data Set
Again, the baseline model was used as the starting model. Then, the BAGLS and BAGLS-RT data were used to re-train this model using different batch sizes
b = {25%, 50%, 100%}, where half of the data was from BAGLS-RT and the other half was a representative share of BAGLS data, i.e., for
b = 100%, the entire BAGLS-RT training set (18,250 images) and 18,250 images from BAGLS were included. This approach uses the same finetuning process as shown in
Figure 8, except that the temporary models M
Ti are trained with BAGLS and BAGLS-RT. This approach was chosen since considering only new data (BAGLS-RT) in the re-training might yield decreased segmentation accuracy for old data (BAGLS), as will be seen in the Results section for the incremental finetuning abovementioned.
2.5.4. Finetuning with Knowledge Distillation-FKD
Here, the model performance is judged by considering the
Dice Loss of both the baseline model previous to re-training (
teacher-model) and the
Dice Loss for the model currently being re-trained (
student-model). The influence of the
teacher-model is controlled by a parameter α ∈ [0, 1]. The higher the α value is chosen, the smaller the influence of the
teacher-model during re-training, i.e., α = 1.0 corresponds to the same incremental finetuning as described above in
Section 2.5.2. For a detailed description of the FKD approach, we refer to refs. [
39,
40,
41]. The training data were chosen as described in
Section 2.5.2, i.e., only BAGLS-RT. We chose a balanced value with α = 0.5 and investigated three batch sizes
b = {25%, 50%, 100%}. Additionally, we considered (1) a static model, where the
teacher-model is always the baseline model, and (2) a dynamic model, where the
teacher-model is replaced after each batch with the
student-model (
Figure 9). Consequently, the static and dynamic model for the batch size
b = 100% are equivalent.
4. Discussion
Several methods for re-training purposes were discussed and applied to HSV data. In summary, the results showed that diverse training data already enables the model to deal with new modalities to a large amount. Here, this was achieved by random ROI selection and image augmentation. Although smaller than achieved by the new ROI preprocessing method, the subsequent re-training methods showed further improvements. When re-training is performed, the phenomenon of catastrophic forgetting should be kept in mind. Results showed that finetuning with dynamic knowledge distillation seems most promising for re-training with laryngeal HSV data, even outperforming re-training from scratch. Further, this re-training strategy is rather convenient, since no old data are necessary for re-training and therefore do not have to be stored. However, it is also evident that re-training with new data, being not significantly different than the existing training data, can be avoided when the first model training is already based on data with great variety. As
Figure 10 shows, re-training for HSV data provided the best results when new data with significant modality changes were considered, e.g., flexible endoscopic HSV data with honeycomb patterns induced by the light fibers.
Regarding future model adjustments, finetuning with knowledge distillation can be adapted depending on the use case. Results showed that using FWD with a static or dynamic teacher model seems to be beneficial towards old or new data due to the respective adjustment procedure for the teacher model. Therefore, depending on whether users prioritize old or new camera systems, a static or dynamic model should be selected. Secondly, the overall influence of the teacher model is controlled by the parameter α. For this paper, we chose a balanced value of α = 0.5. However, if the performance of the model is to be increased primarily for old or new data, this value of α can be decreased or increased accordingly.
The segmentation models resulting from this work will be integrated in the Glottis-Analysis-Tools (GAT) [
26] and OpenHSV [
11] and made available for other research groups. A limitation of the study is that the used U-Net may have too few parameters (i.e., the model is too simple) to achieve further performance improvements by incorporating new training data. Hence, future work may concentrate on more complex deep learning models containing more parameters [
42] that may then lead to further improvement of segmentation performance utilizing additional information in the HSV images that has not been considered by the U-Net. Additionally, deep learning approaches should be applied to the three-dimensional dynamics of the vocal folds [
43], potentially enabling an improved insight on the correlation of vocal fold dynamics and acoustic voice quality.