Situated Speech Synthesis: Beyond Text-to-Waveform Mapping

A special issue of Multimodal Technologies and Interaction (ISSN 2414-4088).

Deadline for manuscript submissions: closed (1 March 2018)

Special Issue Editors


E-Mail Website
Guest Editor
Department of Modern Languages, University of Helsinki, Helsinki, Finland
Interests: speech synthesis; phonetics; speech evolution; prosody

E-Mail Website
Guest Editor
Department of Modern Languages, University of Helsinki, Helsinki, Finland
Interests: speech synthesis; embodied cognition; prosodic variation; speech production; auditory processing

Special Issue Information

Dear Colleagues,

A typical text-to-speech synthesis (TTS) system, in essence, mimics a special type of human performance, namely reading a given text aloud, with only a rudimentary understanding of what is being said. This “newsreader scenario” generally takes no account of the listener and of the situation in which the spoken word is delivered. Consequently, most current TTS technologies fall behind in terms of applicability: Modern technologies, such as dialogue systems, voice-enabled assistants, and gaming, require artificial speech to be situated and expressive. Synthetic speech should take into consideration the audience and flexibly adapt to the given context in which the interaction with humans takes place.

The focus of this Special Issue is on the current trends in developing speech synthesis systems that go beyond the traditional TTS paradigm. In particular, we encourage submissions presenting novel synthesis approaches, which address some of the following inter-related issues:

  • situatedness, that is ability to adapt to the environment and situation in which the speech is produced,
  • adaptivity to the audience, i.e., human or artificial listener,
  • expressivity, that is ability to produce rich and meaningful prosodic variations using appropriate signal attributes ranging from pitch to voice quality,
  • non-trivial cognitive abilities in understanding and co-creating the content of the produced speech.

Submissions addressing these questions at any relevant stage of the synthesis process, including corpus design and annotation, text processing, feature extraction and representation, learning scenarios, and signal generation, are welcome.

We encourage authors to submit original research articles, reviews, and theoretical perspectives. Of particular interest are articles that explore the situated speech synthesis issues using the state-of-the-art signal processing techniques and machine learning methods, such as deep neural networks.

Prof. Dr. Martti Vainio
Dr. Juraj Simko
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Multimodal Technologies and Interaction is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • speech synthesis
  • situatedness
  • expressivity
  • prosodic variation
  • deep learning
  • signal processing
  • text understanding
  • artificial intelligence
  • mulitimodal interaction
  • spoken interaction

Published Papers (1 paper)

Order results
Result details
Select all
Export citation of selected articles as:

Research

21 pages, 738 KiB  
Article
Interactive Hesitation Synthesis: Modelling and Evaluation
by Simon Betz, Birte Carlmeyer, Petra Wagner and Britta Wrede
Multimodal Technol. Interact. 2018, 2(1), 9; https://doi.org/10.3390/mti2010009 - 2 Mar 2018
Cited by 17 | Viewed by 7302
Abstract
Conversational spoken dialogue systems that interact with the user rather than merely reading the text can be equipped with hesitations to manage dialogue flow and user attention. Based on a series of empirical studies, we elaborated a hesitation synthesis strategy for dialogue systems, [...] Read more.
Conversational spoken dialogue systems that interact with the user rather than merely reading the text can be equipped with hesitations to manage dialogue flow and user attention. Based on a series of empirical studies, we elaborated a hesitation synthesis strategy for dialogue systems, which inserts hesitations of a scalable extent wherever needed in the ongoing utterance. Previously, evaluations of hesitation systems have shown that synthesis quality is affected negatively by hesitations, but that they result in improvements of interaction quality. We argue that due to its conversational nature, hesitation synthesis needs interactive evaluation rather than traditional mean opinion score (MOS)-based questionnaires. To validate this claim, we dually evaluate our system’s speech synthesis component, on the one hand, linked to the dialogue system evaluation, and on the other hand, in a traditional MOS way. We are thus able to analyze and discuss differences that arise due to the evaluation methodology. Our results suggest that MOS scales are not sufficient to assess speech synthesis quality, leading to implications for future research that are discussed in this paper. Furthermore, our results indicate that synthetic hesitations are able to increase task performance and that an elaborated hesitation strategy is necessary to avoid likability issues. Full article
(This article belongs to the Special Issue Situated Speech Synthesis: Beyond Text-to-Waveform Mapping)
Show Figures

Graphical abstract

Back to TopTop