2.1.2. Speech Recognition Module

The speech recognition module converts voice data from people into text and sends the recognized text to the scenario manager module. A non-directional close-talking microphone is used to capture voices, and Dragon Speech 11 (Nuance Communication Inc., Burlington, MA, United States) is used for speech recognition. The module also detects utterance sections using a threshold of sound pressure. Specifically, the module measures an average of sound pressure sampling of 44.1 kHz for every millisecond. When an average pressure over the specified threshold has been detected for more than 500 milliseconds continuously, the module determines that people are in talking status. When an average pressure under the threshold has been detected for more than 1500 milliseconds in the talking status, the module determines that the person has finished speaking. The module also sends flags to represent the talking and talk-end status to the scenario management module. To avoid detecting noises, the module sends text recognized by Dragon Speech only during the talking status.

#### 2.1.3. Text-To-Speech Module

The text-to-speech module converts text received from the scenario manager module to sound data and plays it. Using the AITalk Custom Voice (AI Inc., Tokyo, Japan) as a voice synthesizer, the module can do real-time synthesis, in which the synthesizing of voice data and playback on a device are executed in parallel. The module receives three types of orders with the text from the scenario manager module: start playing, stop playing, and pause playing. The module can not only stop/pause playing voice data immediately but also optionally stop/pause it at the end of a current phrase. The module has variable parameters for the start playing order, including volume, speed, pitch, range, length of short pauses in the sentence, length of long pauses in the sentence, and the length of the end pause of the sentence.
