1. Introduction
Because humans are extremely communicative, any species that cohabits with us was likely selected to take advantage of this characteristic [
1]. Dogs (
Canis familiaris) are a good model for studying interspecific communication because they can develop and use a flexible signaling system when dealing with humans [
2]. Zimen [
3] showed that this capacity is derived from wolves. Both subspecies use visual and acoustic signals; however, many canine visual signals involve tail and/or ear movements, which have no counterpart in humans, making the understanding and use of such signals more challenging for us. Likewise, human visual cues often include hand movements for which there is no canine parallel. With regard to acoustic signals, some can result in a lasting communicative pattern based on repeated interactions between sender and receiver [
4] and, in the case of dogs, this ability was thought to have been intensified through domestication [
5]. Studies have pointed to a synergy between phylogeny and ontogeny in the development of dog abilities, including interspecific communication [
6].
Considering that humans are a very vocal species, it is plausible that the communicative approach towards dogs is spontaneously based on such a channel. In Western cultures, it is common for humans to adopt a special type of speech when they “talk” to pets [
7], the pet-directed speech (PDS). It shares some of the acoustic characteristics of infant-directed speech (IDS), including an increase in pitch and vowel articulation, in an exaggerated and affected manner, as well as the decrease in the rhythm of words compared to adult-directed speech (ADS). PDS and IDS may show similarities because both dogs and babies are non-verbal listeners, and the affective bond between owners and dogs are known to reflect the human parent–baby bond [
8]. In addition, both owners and dogs have been shown to experience oxytocin secretions after a brief period of petting [
9], and a study highlighted brain activation of connected areas when mothers viewed images of both their child and their dog [
10]. Furthermore, it has been shown that the acoustic characteristics of PDS attract the attention of dogs significantly more than the ADS [
8].
Ben-Aderet and colleagues [
11] were the first to investigate both the production of speech used specifically for dogs (Dog Directed Speech, DDS), and the behavioral responses of puppies, adult, and old dogs to DDS. They found that although humans produce DDS for animals of different age groups, the preference of dogs for this kind of speech decreases with age: puppies showed greater behavioral responses to DDS than to ADS, while adult and old dogs showed no preference for either type of speech. For the authors, targeting DDS to adult and old dogs may simply constitute a “spontaneous attempt to facilitate interactions with non-verbal listeners”. This interpretation may be related to the “hyperphony” hypothesis [
12], according to which broadcasters use optimized speech patterns to improve speech intelligibility with animals, which are expected to be more sensitive to this special modulation of the voice tone.
However, alternative explanations for the lack of effect of DDS on adult dogs still deserve investigation. Ben-Aderet and colleagues [
11] suggested that adult dogs may need additional cues (e.g., gestures) to respond to unfamiliar speakers. In their initial study, DDS and ADS were recorded, and only their playbacks were used in the experiments. Therefore, adult dogs did not have the opportunity for receiving extra cues. This condition may have affected communication by influencing sound reception, and precluding dogs from having social interaction; therefore, they may have not seen any social benefit of reacting preferably to any kind of speech. In contrast, puppies, having little experience in terms of environment and social interactions, may lack focus on this aspect, responding to DDS even in the absence of a physical experimenter.
Taking this matter as a starting point, Benjamin & Slocombe [
13] carried out an experiment to investigate the possible effects of DDS and ADS on the levels of attention and affiliation of adult dogs. The authors also investigated whether possible behavioral preferences were modulated by prosody and/or content, under conditions ecologically more relevant compared to the study by Ben-Aderet and colleagues [
11]. Adult dogs attended more and sought more proximity to an experimenter whose speech had dog-relevant content (e.g., good boy/girl!) and was spoken with elevated pitch and exaggerated prosody than to the experimenter whose speech lacked any of these characteristics. This finding suggested that DDS might fulfill a dual function: improving attention and social connection. This last aspect is in line with the current understanding reached by research with children, which suggests that not only is IDS crucial for the development of meaningful social relationships with caregivers [
14], but that it is useful for facilitating language acquisition [
15]. Another study has corroborated the results of Benjamin & Slocombe [
13] by pointing out that dogs have shown preferences towards a target object associated with DDS in the DDS versus ADS condition [
16]. Despite the existence of studies investigating the perception and processing of human vocalizations by dogs, little is known about how their ancestors, the wolves, respond to these stimuli.
In line with the fact that DDS is often characterized by high pitch [
7], Ben-Aderet and colleagues [
11] also found that dogs reacted more to high pitch than low pitch. Interestingly, however, pitch did not seem to determine their ability to differentiate between DDS and ADS nor did it explain preferences for DDS [
16], questioning the importance of single acoustic parameters.
We here focus on communication between humans and animals during Positive Reinforcement Training (PRT), a type of Operant Conditioning technique, which was shown to promote improvements in animal welfare, possibly through the association between the animals’ behavior and its pleasant consequences [
17,
18]. In essence, PRT may benefit animals by providing positive feedback and, eventually, by promoting opportunities for the animal to exert some control over their environment [
19]. However, the beneficial effects recorded with PRT might also be promoted by the opportunity to interact with a familiar person with whom the animal may have developed a social bond [
20].
Vasconcellos and colleagues [
18] investigated the effects of regular interactions of equally raised and kept timber wolves and mixed-breed dogs with familiar humans during PRT sessions. In addition to their training performance, the animals’ behavioral and physiological responses (salivary cortisol variations) were evaluated. Apart from a fine training performance, a reduction in salivary cortisol concentrations was recorded in both subspecies, as well as low rates of non-training-related behaviors (NTBs) [
21]. Interestingly, up to 22.8% of the variation of the animals’ responses were due to trainer identity. These results showed, for the first time, that not only dogs but also wolves may benefit from PRT.
Considering that the way humans communicate/interact with dogs has been shown to affect the owner-dog relationship, animal behavior, and possibly animal welfare [
22], in the current study, we explored how animals’ responses might differ in accordance to trainer communication style within a training session. We reasoned that one important point might be the use of voice and thus set out to evaluate the effects of the duration and frequency (number of occurrences) of types of speech used within a training session, and the average acoustic characteristics of trainers’ voices during sessions on animals’ responses. We focused on two hypotheses: (1) the communication style (duration/frequency of nice, neutral or reprehensive speech) and specific acoustic parameters (such as pitch) that characterized the different speech types are associated with different behavioral and physiological responses of dogs and wolves during training; (2) the domestication process affected the perception and responses of animals to the communication style of humans and acoustic sound parameters. From these hypotheses, we developed two predictions: (1) the duration/frequency of “nice” speech and/or high pitch will be associated with behaviors indicative of affability and interest in training, such as increased duration of tail wagging, attention and performance in the animals, and greater proximity of the dyad, whereas (2) the duration/frequency of “reprehensive” speech will be associated with an increase in behaviors unrelated to training, greater distance in the dyad, and duration of the tail being retracted for both subspecies.
2. Materials and Methods
2.1. Ethics Statement
All study animals were kept at the Wolf Science Center (
www.wolfscience.at: license n°: AT00012014). The CITES (
www.cites.org) import permits for the animals are 2008: Zoo Herberstein, Austria: AT08-B-0998, AT08-B-0996, AT08-B-0997; 2009: Triple D Farm, USA: AT09-E-0018; 2012: Minnesota Wildlife Connection, USA: 12AT330200INEGCJ93. The animals were housed in accordance with the Austrian Federal Act on the Protection of Animals (Animal Protection Act—TSchG, BGBl. I Nr. 118/2004). Hence, in accordance with the Austrian Animal Experiments Act (BGBl. I Nr. 114/2012, Tierversuchsgesetz 2012—TVG 2012), no ethical approval was officially required, but we still obtained one from the University of São Paulo, Brazil (Committee of Ethics for Animal Research from the Institute of Psychology, University of São Paulo, Brazil, approval number 016.2009).
2.2. Subjects
Eighteen animals were studied: nine timber wolves (mean age = 15 months ± 2.04) and nine dogs (mean age = 21 months ± 3.37), all born in captivity, raised and kept following the same protocol [
23] at the Wolf Science Center, an institution located in the Game Park Ernstbrunn, Austria.
Table 1 shows the animals’ names, sexes, and ages at the onset of the study.
All study animals were hand-raised and maintained in close contact with the five participating trainers during the first 20 weeks of life. From the third week onwards, sporadic contact with conspecifics (including adult individuals, wolves in the case of wolf pups and dogs in the case of dog pups) were provided, while at this age we also started with more formal PRT interactions.
2.3. Saliva Collection
Prior to the beginning of the study, dogs and wolves were trained with the use of PRT techniques to allow saliva collection. Saliva was used for the physiological assessment of stress via measuring salivary cortisol. The collection procedure included the introduction of two surgical hydrocellulose sponges (Sorbette, by Salivette
®) in the animal’s cheek pouch for sufficient time to get these soaked. Immediately after saliva collection, the sponges were transferred to a plastic tube (Sarstedt
®) and stored at −20 °C until analysis by enzyme immunoassay [
18]. During the sessions of familiarization with the collection procedures, as well as during saliva collection and the training sessions, the animals were rewarded exclusively with pieces of Gouda cheese, to control for the influence of protein in the saliva samples [
24]. Saliva collection was performed 2–4 min before, and 15 min after the end of each training session.
2.4. Procedures
Each study animal participated in 15 training sessions of 5 min each (3 sessions with each of the 5 trainers), totaling 270 sessions, 135 with dogs and 135 with wolves. The training sessions were run between May 2010 and March 2011 and were filmed with a video camera. These were conducted with the animals isolated from the pack, between 8:30 a.m. and 5:00 p.m., in a training room (63.6 m2) located close to their enclosure. The training room did not allow visual contact with other animals or humans and was almost empty except for a raised platform on one side. No animal participated in more than one training session per day. Each trainer, however, worked with four to nine animals on each training day, in a randomized order.
All wolves and dogs were already familiar with the cues used during the training sessions (sitting, laying down, turning around, walking around the trainer, giving a paw, allowing the placement of a muzzle, allowing the placement of a harness, rolling, standing, staying, and looking into the trainer’s eyes). The vocal interactions during the sessions occurred in a naturalistic context. Trainers were instructed to remain standing, in a relaxed posture, emitting cues in random order and rewarding the animals when the cues were correctly followed. Animals’ responses were rewarded with cheese in a continuous reinforcement regime. The study was developed to be minimally invasive, in such a way that every animal participated voluntarily in the training sessions, being invited by name to enter the test room. In case of reluctance or discomfort with the procedure, the animal would be reintroduced to the group, where it remained until it was ready to make part of the session.
2.5. Animals’ Behaviors and the Types of Speech
We evaluated all trainer vocalizations: speeches (phrases uttered by the trainers during the interactions), animal names, and laughs produced by the trainers during the 270 five-minute training sessions, excluding the commands used for training. The commands were not analyzed because their pronunciation was standardized, always in a neutral tone of voice. The start and end of each vocalization were selected both acoustically (by listening to the records), and visually, with the use of spectrograms (
Figure 1).
Table 2 contains the description of seven types of trainer vocalization recorded in the videos from the sessions, whose duration or frequency of use was summed up for each training session and considered here as explanatory variables: nice, neutral and reprehensive speeches and names; and laugh. As response variables, we used the same ones evaluated by Vasconcellos et al. [
18]: correct responses, latency, visual orientation to trainer, time at less than 1 m from trainer, Non-Training Behaviors (NTB; representing counter-productive behaviors, regarding training), and cortisol variation, in addition to the animals’ tail position/movements (
Table 3).
All behavioral parameters considered here were coded from the videos by only one person (author M.G.B.F.—Melissa Gabriela Bravo Fonseca) through Focal Sampling, with Continuous Recording of behaviors, using the Solomon Coder program (Beta version 19.08.02, 2019, by András Péter). To obtain a measure of reliability of the behavior coding, 20% of all videos were re-coded (M.G.B.F.), and the scores compared with those of the first viewing through a Spearman rank correlation; the results indicated a good agreement, with correlation coefficients ranging from 0.86 to 0.99. The variables nice names, neutral names, reprehensive names, correct responses and repetitions were evaluated in terms of frequency (number of occurrences); all other variables were analyzed in terms of duration (seconds).
In order to contribute to the understanding about the possible effects of different types of speech on the animals’ responses, additional exploratory analyses were performed: a comparison among acoustic parameters of the three types of speech, and an analysis of the effects of these separate parameters on the animals’ responses.
2.6. Acoustic Comparison among Types of Speech
To investigate acoustic differences between the three types of speech classified in the first stage of the study, we defined six acoustic variables (
Table 4). Although all the speeches present all the studied acoustic components (minimum frequency, maximum frequency, peak frequency, average power), we separated the speeches into these components for analysis.
2.7. Animals’ Behaviors and Voice Acoustic Characteristics
We extracted audios from the videos recorded during all the 270 training sessions with the Any Video Converter program, in WAV format. The recordings were analyzed using the Raven Pro 1.5 software (Cornell Laboratory of Ornithology—Cornell University), through the analysis of spectrograms.
The fundamental frequency (F0) of the first harmonic of each speech was selected in the spectrogram (
Figure 1), and the following parameters were considered for the analysis of F0: (a) Visualization: spectrogram 1; (b) Channel: 1; (c) Brightness: 50%; (d) Contrast: 50%; (e) Size of the spectrogram viewing window: 512; (f) Time interval (“x” axis): 200 ms; and (g) Frequency in kHz (“y” axis): 0 to 2.80. Separate spectrograms for each of the three types of speech analyzed in our study (nice, neutral, and reprehensive) can be seen in the
Supplementary Material (Figure S1). For this analysis, the category laugh was considered as a positive non-verbal emotional vocalization [
25], and therefore was included in the category “nice”.
For each training session, the averages of the six acoustic variables used to analyze the vocalizations emitted by the trainers (
Table 4) constituted the explanatory variables for this stage of the analysis. As response variables in these analyses, we used the same ones described above: correct responses, latency, visual orientation to trainer, time at less than 1 m, Non-Training Behaviors (NTB), and cortisol variation, in addition to the animals’ tail position/movements (
Table 3).
2.8. Statistical Analysis
The statistical analyses were performed by Generalized Linear Mixed Models (GLMMs), with a Poisson distribution, adjusted for repeated measures using the R software. By adjusting for repeated measures, we could use data from all sessions (3 sessions of each animal with each of the 5 trainers), controlling for trainer and animal repeatability (random factor), summing up 270 training sessions. We used “lme4” [
26], “MASS” [
27], “car” [
28] and “tidyverse” [
29] packages to fit GLMM models in R statistical software, version 4.0.1. All results were analyzed based on statistical significance (α ≤ 0.05). We used the iterative method (i.e., starting with the full model and removing explanatory variables with no effect on the response variable). We built two models: GLMM1 to investigate the responses of animals to the three types of speech, and GLMM2 to analyze the animal’s responses to acoustic characteristics of the trainers’ vocalizations.
In the GLMM1 we evaluated the effects of the explanatory variables subspecies, nice names and speeches, neutral names and speeches, reprehensive names and speeches, and laugh on the response variables correct responses, repetitions, latency, visual orientation to trainer, time at less than 1 m, NTBs, tail position/movements, and cortisol variations. With the GLMM2, we evaluated the effects of subspecies, as well as the acoustic variables of the trainers’ vocalizations (minimum frequency, maximum frequency, average power, delta time, peak frequency, and number of speeches) on the response variables correct responses, repetitions, latency, visual orientation to trainer, time at less than 1 m, NTBs, tail position/movements, and cortisol variations.
In both GLMMs, the first model was always built including all data from dogs and wolves, as well as possible two-way interactions between variables.
Tables S1 and S2 present the full (initial) models of GLMM1 and GLMM2, respectively. If any explanatory variable showed interaction with the variable subspecies, separate models were run for each subspecies. Pearson correlations were run between the explanatory and the response variables that did not allow the GLMMs to be run (e.g., due to data distribution).
For the acoustic characterization of the three types of speech (nice, neutral, and reprehensive), means and standard deviations of the variables (minimum frequency, maximum frequency, average power, delta time, and peak frequency), in addition to the total number of speeches, were calculated for each type of speech. Subsequently, repeated measures ANOVA with Tukey’s post hoc (for normal distribution of data) or Friedman’s with Dunn’s post hoc (for non-normal data) were run to check for differences among the types of speech, considering each of the acoustic parameters of sound mentioned above.
3. Results
3.1. Animals’ Behaviors and the Types of Speech
Table S3 presents the GLMM models with the analyses of the animals’ responses to the types of speech, with both subspecies together.
Table 5 and
Table 6 present the variables that had effects only for dogs and only for wolves, respectively. The response variables retreating and tail retreated, evaluated by Pearson correlations for the two subspecies, are available in
Table 7.
For both dogs and wolves, we identified positive correlations between exploration by the animals and the emission of nice names by the trainer, and also between tail wagging and nice speeches (
Figure 2). Regarding tail wagging, while for dogs we identified positive correlations between this variable and neutral speech, for wolves this association was negative. We also recorded negative association between tail wagging and neutral names for dogs, and positive association of tail wagging and neutral names for wolves.
For dogs only, there were direct associations between (a) jumping and neutral speeches, (b) retreating and nice and neutral speeches, and (c) tail retreated and nice names. Negative associations were identified between (d) correct responses and reprehensive speeches and (e) tail wagging and reprehensive speeches.
For wolves only, we identified inverse relations between the orientation to trainer and the names emitted in the three types of voice (nice, neutral, and reprehensive). Negative correlations were also observed between the names in the three intonations and the time the wolves spent at less 1 m from the trainer. Still regarding the proximity between the dyad, nice speeches correlated positively with the time the wolves spent within 1 m from the trainers (
Figure 3). We also found negative correlations between tail wagging and reprehensive names, and a positive correlation between tail wagging and laughing. Positive correlations were also identified between retreating versus neutral and reprehensive names, as well as between tail retreated and neutral names.
3.2. Acoustic Comparison among the Types of Speech
The acoustic comparison among of the three types of speech (nice, neutral, and reprehensive) demonstrated measurable differences in acoustic characteristics (minimum frequency, maximum frequency, delta time, peak frequency, and number of speeches;
Table S4). Nice speech was used more often (10.391 times), being more energetic at higher frequencies, and having higher intensity than the neutral speech, but lower intensity than reprehensive speech. Compared to the other categories, the latter had the most extreme values in terms of duration, intensity, sound amplitude, and frequencies. Although we recorded just 47 reprehensive speeches in the 270 sessions, their duration was longer than that of the other types of speech. The neutral speech presented intermediate characteristics between the other two types of speech, being deeper and more energetic at low frequencies.
Figure S2 shows the average acoustic characteristics (minimum frequency, maximum frequency, peak frequency, average power, delta time, and number of speeches) of the three different types of speech (nice, neutral, and reprehensive) for each of the five trainers involved in the project (T1–T5).
3.3. Animals’ Behaviors and Voice Acoustic Characteristics
Some of the acoustic characteristics of the trainers’ voices showed interaction with subspecies when analyzed for dogs and wolves together, but the effect was not significant in the analysis separated for dogs and wolves.
Table S5 presents the full final model for analysis of the animals’ responses to the acoustic characteristics of the trainers’ voices (dogs and wolves together);
Table 8 and
Table 9 show the final statistical results of the GLMMs for dogs and wolves, respectively. The response variables retreating and tail retreated were evaluated by separate Pearson correlations for each subspecies (
Table 10).
For both dogs and wolves, we identified positive correlations of tail wagging with maximum frequency and delta time, and an inverse correlation of this variable with average power and peak frequency (
Figure 4). In other words, the duration of tail wagging was associated with higher-pitched, longer, and softer speeches (with greater intensity at lower frequencies).
Correct responses, for dogs, were positively related to the measurementes of minimum frequency (higher pitched voices). In this case, this means that in sessions in which the speeches had higher pitch, more correct responses were obtained from the dogs. For wolves, correct responses were inversely associated with minimum frequency and delta time, and directly correlated with peak frequency and the number of speeches, i.e., more correct responses when the session had more speeches, shorter, and in a low pitched tone.
Jumping correlated with maximum frequency and delta time in both subspecies: for dogs jumps were positively associated with higher pitched and longer vocalizations, and, for wolves, this associacion was negative. For wolves, we also identified a direct association between the number of jumps and average power. This means there was more jumping in sessions with more low pitched, intense, and short speeches. Retraction was directly related to the number of speeches (more speeches) in sessions with dogs, and with the mean average power in sessions with wolves (more retraction in the sessions with a greater use of high pitch).
In sessions with wolves, we observed that (a) the visual orientation to trainer was negatively related to minimum frequency, average power and delta time, and directly correlated to peak frequency (greater use of low-pitched, short speeches); (b) the time spent within one meter from the trainer was positively associated to peak frequency, and negatively related to minimum frequency, maximum frequency, and delta time (low-pitched, and short speeches); and (c) tail retracted was directly correlated with maximum frequency (high-pitched speeches).
3.4. Cortisol
As described for Vasconcellos and colleagues [
18], the cortisol concentrations in the saliva samples taken after the training sessions were lower compared to the samples taken before the sessions in both wolves and dogs (t = −2.864,
p = 0.004). The mean values for these concentrations were 1023.03 ± 75.99 ng/mL (wolves before training); 820.13 ± 64.03 ng/mL (wolves after training); 2280.87 ± 153.2 ng/mL (dogs before training); and 1851.99 ± 162.9 ng/mL (dogs after training). However, none of the explanatory variables investigated here had a measurable effect on salivary cortisol.