3.1. Datasets
We used the following three datasets in our experiments:
The Google Speech Commands dataset [
14] was released in August 2017 under a Creative Commons license. The dataset contains around 100,000 one second long utterances of 30 short words by thousands of different people, as well as background noise samples such as pink noise, white noise, and human-made sounds. Following the Google implementation [
14], our task is to discriminate among 12 classes: “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”, unknown, and silence.
The Russian dataset is a private dataset that contains around 400,000 one second long utterances of 80 words by 100 different people. These utterances were recorded on mobile devices. The dataset lacks background noise samples, so we reused the samples from the Google Speech Commands dataset [
14]. We discriminate the following 12 classes: “oдин” (one), “двa” (two), “тpи” (three), “чeтыpe” (four), “пять” (five), “дa” (yes), “нeт” (no), “cпacибo” (thanks), “cтoп” (stop), “вклoчи” (turn on), unknown, and silence.
Furthermore, specifically for this and future works on voice activation in Lithuanian, we collected a similar dataset for Lithuanian [
23]. The collection methodology is described in
Section 4. This dataset consists of the recordings of 28 people. Each of them uttered 20 words on a mobile phone. These recordings were segmented into one second long files. The segments between words were used as background noise samples. They contained silence, human-made sound, and background audio such as street or car noises. We chose the following 15 classes: “ne” (no), “ačiū” (thanks), “stop” (stop), “įjunk” (turn on), “išjunk” (turn off), “į viršų” (top), “į apačią” (bottom), “į dešinę” (right), “į kairę” (left), “startas” (start), “pauzė” (pause), “labas” (hello), “iki” (bye), unknown, and silence.
3.2. Model
We used two types of audio features and two types of neural network architectures in our experiments. We used either log-Mel filter banks or pre-trained audio features from the wav2vec model [
22].
The log-Mel filter banks features are often chosen for building voice activation or speech recognition systems [
1,
32]. We used the
kaldi [
33] implementation of feature computation with the following parameters: frame width—25 ms, frame shift—10 ms, number of bins—80. Thus, we got a
feature matrix by computing log-Mel filter banks on one second samples from the datasets. The method
torchaudio.compliance.kaldi.fbank can be used in PyTorch [
34] to reproduce this computation.
We used the following neural network architectures: a three-layer fully-connected neural network and residual neural networks (ResNets) as described in [
32].
Our fully-connected neural network consisted of the following blocks:
fully-connected layer of size 128,
rectified linear unit (ReLU) as an activation function [
35],
fully-connected layer of size 64,
ReLU,
flattening of a matrix in a vector, where T is the number of frames in a sample (98 in all our experiments),
fully-connected layer of size C, where C is the number of classes to discriminate,
softmax layer.
This neural network architecture is presented on
Figure 1.
The ResNets that we used in our experiments were based on [
36] and repeat the solutions found in [
32]. The authors of [
36] proposed that it may be easier to learn the residual
instead of the true mapping
, since it is empirically difficult to learn the identity mapping for F when the model has an unnecessary depth. In ResNets, residuals are expressed via connections between layers (see
Figure 2), where the input of some layer is added to the output of some downstream layer.
The architectures that we used from [
32] consisted of the following blocks:
bias-free convolutional layer,
optional average pooling layer (e.g., layer in res8),
several residual blocks consisting of repeated
convolutions, ReLUs, and batch normalization layers [
37] (see
Figure 2b),
convolutional layer,
batch normalization layer,
average pooling layer
fully-connected layer of size C, where C is the number of classes to discriminate,
softmax-layer.
All the layers were zero-padded. For some variants, dilated convolutions were applied to increase the receptive field of the model. The parameters of all used ResNet architectures can be seen in
Table 1.
The number of trainable parameters in the architectures used are reported in
Table 2. More details about the residual architectures can be found in [
32].
3.3. Training Procedure
Our experiments followed exactly the same procedure as the TensorFlow reference for the Google Speech Commands dataset [
14]. The Speech Commands Dataset was split into training, validation, and test sets, with 80% training, 10% validation, and 10% test. This resulted in roughly 80,000 examples for training and 10,000 each for validation and testing. For the Russian dataset, these numbers were roughly 320,000 and 40,000. For the Lithuanian dataset [
23], we had 326 records for training, 75 for validation, and 88 for testing (we skewed the distribution to ensure more stable test results). For consistency across runs, the SHA1-hashed name of the audio file from the dataset determined the split.
To generate the training data, we followed the Google preprocessing procedure by adding background noise to each sample with a probability of at every epoch, where the noise was chosen randomly from the background noises provided in the dataset.
Accuracy was our main metric of quality, which is simply measured as the fraction of classification decisions that are correct. For each input utterance, the model outputs the most likely predicted class.
We ran an extensive random hyperparameter search [
38] for all experiments in order to reliably compare audio features and architectures. We used stochastic gradient descent with initial learning rate
L, momentum
, and mini-batch size BS (see
Appendix A for the specific values of the hyperparameters). The validation metrics (cross-entropy loss and accuracy) were computed every
S steps of optimization. The minimal validation accuracy was stored. If the new validation accuracy was bigger than the minimal or if the cross-entropy loss obtained a “not a number” value, the weights of the best (by validation accuracy) step were loaded, but the learning rate dropped by a factor of
. The training process stopped when the learning rate drop happened the sixth time. The test accuracy was computed exactly once: on the best model by the validation accuracy at the end of the training process. We report the test accuracy in this work.
We chose from , from , from , and from , where U is a uniform distribution (discrete uniform distribution in the case of S and BS).
3.4. Results
In this section, we present only the test metrics in order to not clutter the description. For the hyperparameters’ choice, see
Appendix A.
In order to get baseline metrics, we ran experiments on full datasets with both log-Mel filter banks and wav2vec features. The best results of these runs are presented in
Table 3 for the English dataset. We got slightly better results than in [
32]. This can be explained by the following reasons:
We made the following conclusions from the results:
wav2vec audio features give a competitive result for the voice activation problem with very simple downstream models such as the feedforward neural network,
the profit of unsupervised pre-training vanishes as the model gets more sophisticated and deep.
We repeated the same experiments for the Russian dataset and got similar results (
Table 4):
ff and
ResNet8-narrow as the simplest models got better results with wav2vec audio features. However, the overall best result was still with log-Mel filter banks: 97.22%. The best result of wav2vec runs was 96.62%, which was worse, but still very competitive.
It is worth noting that wav2vec model was trained on the Librespeech dataset [
39], which contains only English audio books. It is promising that using this model, it was possible to get good accuracy both on the Russian and Lithuanian datasets (see
Table 5). Moreover, we got better results on the Lithuanian dataset using wav2vec than using log-Mel filter banks (90.77% vs. 89.23%).
Next, we ran experiments with a small amount of training data. In order to do that, we limited the number of training samples per keyword by 3, 5, 7, 10, and 20 for all the datasets. Note that the limit of 20 is effectively the same as using the whole dataset for the Lithuanian language. The size of the validation and test sets remained the same in order to get reliable and comparable results. We used random search with all the models and report the test accuracy of the best runs. The motivation of these experiments goes as follows. First of all, the authors of [
22] reported that they got state-of-the-art results in automatic speech recognition with unsupervised pre-training in the case when limited training data were available. Secondly, our first set of experiments showed that wav2vec audio features are superior when the machine learning model is simple. Simpler models tend to perform better when a dataset is small. Therefore, it might be profitable to use unsupervised pre-trained audio features in this scenario. The results of these experiments are summarized in
Table 6.
It can be seen that the use of pre-trained audio features as wav2vec increases the system accuracy by approximately 10% when up to 10 samples are used per keyword both for the English and Russian language despite the fact that the model was only trained on English audio records. The increase is even bigger if five samples are used. It almost vanishes if up to 20 samples are used.