A Bidirectional Context Embedding Transformer for Automatic Speech Recognition
Abstract
:1. Introduction
- We propose to explore different options and implement an improved speech transformer that relies on a single decoder equipped with BCE for bidirectional decoding. This significantly minimizes computation complexity compared to methods that employ two decoders.
- We trained the BCE, end to end, with unique sentence start tokens for each decoding direction, allowing the model with its single decoder to directly generate a right-to-left output without first generating a left-to-right output. This method alleviates the possible information leakage in the attention mechanism, which was encountered by other works.
- A BBS method that generates bidirectional output sequences was implemented in the decoding stage, and used to perform extensive analysis to show the effectiveness of the model with different beam sizes. We also analyzed the performance of the model on different sequence lengths.
2. Related Works
2.1. Existing Works on Transformers for ASR
2.2. Existing Works on Transformers with Bidirectional Decoders for ASR
3. Materials and Methods
3.1. Overview of Transformers for ASR
3.1.1. Dot-Product Self-Attention
3.1.2. Position-Wise FFN
3.1.3. Positional Encoding
3.2. Bidirectional Context Embedding Transformer (Bi-CET)
3.2.1. Structure of Bi-CET
3.2.2. Setup of the BCE
- Avoiding the idea of always waiting for a second input to switch the decoding direction. It is possible to change the decoding direction with just one input.
- Minimizing the decoding time in cases where only a right-to-left output is needed.
3.2.3. Masking Method
3.2.4. Character Decoding
Algorithm 1: Bidirectional Beam Search Method |
Data: source(x), targets (SLR, SRL), beam size (β), max length (), score (s1, s2) |
1 Initialize: ← {〈0, SLR〉}, ← {〈0, SRL〉} |
2 while not converged do |
3 for t in range do //update left-to-right beam |
4 SLR ← empty |
5 for (〈s1, u〉) ∊ do //u is the current state |
6 if u.last() == EOS then |
7 SLR.add(〈s1, u〉) //s1 is the score for u |
8 continue |
9 for n ∊ N do //N contains neighbors of u |
10 s1 = score(x, u ∘ n) |
11 SLR.add(〈s1, u ∘ n 〉) |
12 end |
13 ← SLR.top(β) |
14 end |
15 SLR.path ← SLR.max() |
16 for t in range do //update right-to-left beam |
17 SRL ← empty |
18 for (〈s2, v〉) ∊ do //v is the current state |
19 if v.last() == EOS then |
20 SRL.add(〈s2, v〉) //s2 is the score for v |
21 continue |
22 for v ∊ V do //V contains neighbors of v |
23 s2 = score(x, v ∘ v) |
24 SRL.add(〈s2, v ∘ v 〉) |
25 end |
26 = SRL.top(β) |
27 end |
28 SRL.path ← SRL.max() |
29 OUTPUT ← best score from (SLR.path, SRL.path) |
4. Experiment
4.1. Dataset
4.2. Setup
5. Results and Discussion
5.1. Unidirectional vs. Bidirectional
5.2. Discussion
5.3. Further Analysis
5.3.1. Effect of Beam Size
5.3.2. Effect of Sequence Length
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gonzalez-Dominguez, J.; Eustis, D.; Lopez-Moreno, I.; Senior, A.; Beaufays, F.; Moreno, P.J. A real-time end-to-end multilingual speech recognition architecture. IEEE J. Sel. Top. Signal Processing 2014, 9, 749–759. [Google Scholar] [CrossRef]
- Bosch, L.T.; Boves, L.; Ernestus, M. Towards an end-to-end computational model of speech comprehension: Simulating a lexical decision task. In Proceedings of the INTERSPEECH, Lyon, France, 25–29 August 2013. [Google Scholar]
- Chorowski, J.; Bahdanau, D.; Cho, K.; Bengio, Y. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. arXiv 2014, arXiv:1412.1602. [Google Scholar]
- Chan, W.; Jaitly, N.; Le, Q.V.; Vinyals, O. Listen, Attend and Spell. arXiv 2015, arXiv:1508.01211. [Google Scholar]
- Emiru, E.D.; Xiong, S.; Li, Y.; Fesseha, A.; Diallo, M. Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings. Information 2021, 12, 62. [Google Scholar] [CrossRef]
- Wang, X.; Zhao, C. A 2D Convolutional Gating Mechanism for Mandarin Streaming Speech Recognition. Information 2021, 12, 165. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Zhou, S.; Dong, L.; Xu, S.; Xu, B. Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese. arXiv 2018, arXiv:1804.10752. [Google Scholar]
- Zhang, Q.; Lu, H.; Sak, H.; Tripathi, A.; McDermott, E.; Koo, S.; Kumar, S. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7829–7833. [Google Scholar]
- Karita, S.; Yalta, N.; Watanabe, S.; Delcroix, M.; Ogawa, A.; Nakatani, T. Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Miao, H.; Cheng, G.; Gao, C.; Zhang, P.; Yan, Y. Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6084–6088. [Google Scholar]
- Chen, X.; Zhang, S.; Song, D.; Ouyang, P.; Yin, S. Transformer with Bidirectional Decoder for Speech Recognition. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Wu, D.; Zhang, B.; Yang, C.; Peng, Z.; Xia, W.; Chen, X.; Lei, X. U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition. arXiv 2021, arXiv:2106.05642. [Google Scholar]
- Zhang, C.-F.; Liu, Y.; Zhang, T.-H.; Chen, S.-L.; Chen, F.; Yin, X.-C. Non-autoregressive Transformer with Unified Bidirectional Decoder for Automatic Speech Recognition. arXiv 2021, arXiv:2109.06684. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5884–5888. [Google Scholar]
- Paul, D.B.; Baker, J.M. The Design for the Wall Street Journal-based CSR Corpus. In Proceedings of the HLT, Harriman, NY, USA, 23–26 February 1992. [Google Scholar]
- Le, H.; Pino, J.; Wang, C.; Gu, J.; Schwab, D.; Besacier, L. Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation. arXiv preprint 2020, arXiv:2011.00747. [Google Scholar]
- Shi, Y.; Wang, Y.; Wu, C.; Fuegen, C.; Zhang, F.; Le, D.; Yeh, C.-F.; Seltzer, M.L. Weak-Attention Suppression For Transformer Based Speech Recognition. arXiv preprint 2020, arXiv:2005.09137. [Google Scholar]
- Xu, M.; Li, S.; Zhang, X.-L. Transformer-based end-to-end speech recognition with local dense synthesizer attention. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5899–5903. [Google Scholar]
- Luo, H.; Zhang, S.; Lei, M.; Xie, L. Simplified self-attention for transformer-based end-to-end speech recognition. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 75–81. [Google Scholar]
- Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X. A comparative study on transformer vs. rnn in speech applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar]
- Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F. Transformer-based acoustic modeling for hybrid speech recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6874–6878. [Google Scholar]
- Tsunoo, E.; Kashiwagi, Y.; Kumakura, T.; Watanabe, S. Transformer ASR with contextual block processing. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 14–18 December 2019; pp. 427–433. [Google Scholar]
- Wu, C.; Wang, Y.; Shi, Y.; Yeh, C.-F.; Zhang, F. Streaming transformer-based acoustic models using self-attention with augmented memory. arXiv preprint 2020, arXiv:2005.08042. [Google Scholar]
- Li, M.; Zorila, C.; Doddipatla, R. Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 1–7. [Google Scholar]
- Huang, W.; Hu, W.; Yeung, Y.T.; Chen, X. Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition. arXiv 2020, arXiv:2008.05750. [Google Scholar]
- Jiang, D.; Lei, X.; Li, W.; Luo, N.; Hu, Y.; Zou, W.; Li, X. Improving Transformer-based Speech Recognition Using Unsupervised Pre-training. arXiv 2019, arXiv:1910.09932. [Google Scholar]
- Lu, L.; Liu, C.; Li, J.; Gong, Y. Exploring transformers for large-scale speech recognition. arXiv preprint 2020, arXiv:2005.09684. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint 2014, arXiv:1409.1556. [Google Scholar]
- Bleeker, M.; de Rijke, M. Bidirectional Scene Text Recognition with a Single Decoder. arXiv 2020, arXiv:1912.03656. [Google Scholar]
- Wang, C.; Wu, Y.; Du, Y.; Li, J.; Liu, S.; Lu, L.; Ren, S.; Ye, G.; Zhao, S.; Zhou, M. Semantic Mask for Transformer based End-to-End Speech Recognition. arXiv 2020, arXiv:1912.03010. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Meister, C.; Cotterell, R.; Vieira, T. Best-First Beam Search. Trans. Assoc. Comput. Linguist. 2020, 8, 795–809. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Hsu, W.-N.; Lee, A.; Synnaeve, G.; Hannun, A.Y. Semi-Supervised Speech Recognition via Local Prior Matching. arXiv 2020, arXiv:2002.10336. [Google Scholar]
- Kahn, J.; Lee, A.; Hannun, A.Y. Self-Training for End-to-End Speech Recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7084–7088. [Google Scholar]
- Lüscher, C.; Beck, E.; Irie, K.; Kitza, M.; Michel, W.; Zeyer, A.; Schlüter, R.; Ney, H. RWTH ASR Systems for LibriSpeech: Hybrid vs. Attention—w/o Data Augmentation. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Ling, S.; Liu, Y.; Salazar, J.; Kirchhoff, K. Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6429–6433. [Google Scholar]
Direction | Test Clean | Dev Clean | Test Other | Dev Other |
---|---|---|---|---|
left-to-right | 21.77 | 20.64 | 35.58 | 34.79 |
right-to-left | 20.20 | 21.07 | 36.39 | 35.98 |
bidirectional | 16.83 | 17.67 | 29.18 | 29.93 |
Direction | Test Clean | Dev Clean | Test Other | Dev Other |
---|---|---|---|---|
left-to-right | 10.82 | 10.76 | 22.44 | 22.97 |
right-to-left | 9.98 | 10.64 | 23.12 | 23.26 |
bidirectional | 7.65 | 7.85 | 18.97 | 19.33 |
Hours | Model | Train Type | Test | Dev | Network | ||
---|---|---|---|---|---|---|---|
Clean | Other | Clean | Other | ||||
100 | Hsu et al. [36] | Supervised | 14.85 | 39.95 | 14.00 | 37.02 | Seq2Seq/TDS |
Kahn et al. [37] | Supervised | 14.90 | 40.00 | 14.00 | 37.00 | TDS/Attention | |
Lüsher et al. [38] | Supervised | 14.70 | 40.80 | 14.70 | 38.5 | E2E/Attention | |
Bi-CETsmall | Supervised | 16.83 | 29.18 | 17.67 | 29.93 | Transformer | |
460 | Ling et al. [39] | Semi-supervised | 7.11 | 24.31 | - | - | BLSTM/CTC |
Hsu et al. [36] | Supervised | 7.99 | 26.59 | 7.20 | 25.32 | Seq2Seq/TDS | |
Bi-CETbig | Supervised | 7.65 | 18.97 | 7.85 | 19.33 | Transformer |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liao, L.; Afedzie Kwofie, F.; Chen, Z.; Han, G.; Wang, Y.; Lin, Y.; Hu, D. A Bidirectional Context Embedding Transformer for Automatic Speech Recognition. Information 2022, 13, 69. https://doi.org/10.3390/info13020069
Liao L, Afedzie Kwofie F, Chen Z, Han G, Wang Y, Lin Y, Hu D. A Bidirectional Context Embedding Transformer for Automatic Speech Recognition. Information. 2022; 13(2):69. https://doi.org/10.3390/info13020069
Chicago/Turabian StyleLiao, Lyuchao, Francis Afedzie Kwofie, Zhifeng Chen, Guangjie Han, Yongqiang Wang, Yuyuan Lin, and Dongmei Hu. 2022. "A Bidirectional Context Embedding Transformer for Automatic Speech Recognition" Information 13, no. 2: 69. https://doi.org/10.3390/info13020069