**4. Participation of Multiple Robots**

Although the question–answer–response dialogue model is expected to be robust against speech recognition failures, the dialogue generated by the model might be tedious for users. This is because the dialogue system is a one-way model; the robot asks a question, the user answers it, and the robot responds to the answer. If this continues for a while, the user might feel bored and stop the dialogue early.

We let two robots participate in the dialogue to decrease the tediousness of the model. There have been reports of advantages of using multiple robots in a dialogue. For example, when multiple robots participate in a conversation, a user seems to become insensitive to unnaturalness about consistency in a dialogue [46,47]. In addition, the user tends to feel that he or she can talk easily with multiple robots [48,49] and the user is likely to experience eye contact with the robots [50]. Karatas et al. [51] developed a multi-agent system that interacts with a driver in a car and showed that using multiple agents reduces cognitive loads of the driver compared to using a single agent. Sakamoto et al. [52] conducted a field trial in which two robots provide information to passersby at the station and found that passersby were more likely to stop when two robots were talking than when a single robot was talking. Iio et al. [47] demonstrated that visitors in an exhibition hall tended to have a longer dialogue with multiple robots than a single one.

In this study, we defined the participation framework of a dialogue in which two robots and one user participated, according to Goffman [53]. The participation framework contains a speaker, an addressee, and a side-participant. The turn-taking system was implemented based on Sacks et al. [54]. The rules of turn-taking of each state in our system are described below.

In the question state, the rule of selecting a speaker differs between the first time and the second time or after. In the first time, either of the robots is selected as a speaker randomly. Another robot becomes a side-participant. After the first time, the speaker of the previous question is selected as a speaker. However, when the topic changes from the previous question, the side-participant of the previous question is selected as a speaker.

This approach is based on the concept of common ground [55] among the participants. When a speaker asks a question on a certain topic, the addressee and the side-participant would have common ground that the speaker appears to be interested in the topic. Such common ground would allow the speaker to continue asking questions on the same topic. Therefore, it is reasonable for the speaker of the previous question to continue questions. However, when the topic of a question changes, it is not easy to interpret the sudden topic shift on the common ground. Arimoto et al. [46] reported that the unnaturalness of the sudden topic shift would be alleviated by changing the speaker. Therefore, it is reasonable for the side-participant of the previous question to become a speaker when the topic changes from the previous question.

In the answer state, the user becomes a speaker. The speaker of the previous question would be regarded as an addressee from the viewpoint of the concept of adjacency pair [56].

In the backchannel state, the speaker of the previous question is selected as a speaker again. This is grounded on the concept of the sequence-closing third [57]. Since the backchannel state is regarded as post-expansion of the question–answer adjacency pair, it is reasonable that the speaker of the previous question, which were addressed in the previous answer state, become a speaker.

In the comment state, a speaker depends on a speech recognition result of the previous answer state. Although the speaker of the previous question is basically selected as a speaker, the side-participant of the previous question is selected when the result has a kind of negative expression (including "No", "Nothing", "Never", etc.) or timeout. When the side-participant is selected, the side-participant speaks to the speaker; in other words, the speaker is selected as an addressee. In this manner, the speaker can easily continue asking a question in the next question state. A user's answer with negative expressions might indicate low interest in the topic. Here, if the side-participant expresses interest in the topic by commenting on the previous question, it appears to be reasonable for the speaker to continue asking a question on the same topic from the viewpoint of common ground [55] because at least one participant shows the interest in the topic.

### **5. System**

We developed a twin-robot dialogue system including the two features: the question– answer–response dialogue model and the participation of two robots in a dialogue. The hardware components of the system are shown in Figure 3 and the system architecture is shown in Figure 4.

A microphone array collects sounds. The sounds are integrated though noise a reduction process by a microphone array. The integrated sound is then sent to the automatic speech recognition module, which recognizes the user utterance. We used a cloud speech recognition service provided by NTT Docomo. The service receives a voice and returns the voice recognition results, which are sent to the utterance selection module. According to the selection rules (see Section 3.3), the utterance selection module selects an utterance from the database. The selected utterance is sent to the robot controllers. The voice recognition results are also sent to the nodding generation module as a signal of user speech. The nodding generation module sends a nodding motion to the robot controllers. Nodding is a motion for expressing that the robot is listening to the user in the nodding generation module. This motion is always executed in the answer state whenever the system received a speech recognition result. The robot controllers interpret the utterance with motions and execute them. After the execution is completed, the completion signal is sent to the utterance selection module. The utterance selection module selects a next utterance. As such, the system repeats selecting and executing an utterance according to a speech recognition result and its own behavior execution.

A social-conversational robot developed by VSTONE, CommU, was used as the dialogue partner in our system. This robot is desktop sized at 304 mm high, 180 mm wide, and 131 mm deep, weighing 938 g. CommU has three degrees of freedom (df) for its waist, 3 df for its neck, and 2 df for each eyes. The robot has two LEDs in its cheeks. The robot controller was a software server, which received a command, such as "speak" or "nod". The robot controller controlled the robot according to a received ommand.

**Figure 3.** Hardware components of the system and the connections between them.

**Figure 4.** The system architecture.
