**Contents**



## **About the Editors**

**Daniela Sammler**, PD Dr., is the leader of the research group Neural Bases of Intonation in Speech and Music at the Max Planck Institute for Human Cognitive and Brain Sciences in Leipzig, Germany. She combines behavioral, neuroimaging, and brain stimulation techniques in healthy and brain damaged adults and professional musicians to elucidate the mechanisms of music/language perception and production and their neural overlap. Sammler received her Ph.D. (2008) and P.D. (2018) from the University of Leipzig and was a recipient of the Otto Hahn Award of the Max Planck Society.

**Stefan Elmer**, Dr., received his Ph.D. (2010) from the University of Zurich, where since 2012 he has been responsible for the Auditory Research Group Zurich at the Department of Neuropsychology. He has extensive experience in using brain imaging and neurophysiology techniques, and in the last few years, he has contributed to a better understanding of the electrophysiological, functional, and anatomical markers of music and language expertise, transfer effects, language learning, as well as plasticity in the auditory system in general.

## *Editorial* **Advances in the Neurocognition of Music and Language**

**Daniela Sammler 1,\* and Stefan Elmer 2,\***


Received: 27 July 2020; Accepted: 30 July 2020; Published: 2 August 2020

**Abstract:** Neurocomparative music and language research has seen major advances over the past two decades. The goal of this Special Issue "Advances in the Neurocognition of Music and Language" was to showcase the multiple neural analogies between musical and linguistic information processing, their entwined organization in human perception and cognition and to infer the applicability of the combined knowledge in pedagogy and therapy. Here, we summarize the main insights provided by the contributions and integrate them into current frameworks of rhythm processing, neuronal entrainment, predictive coding and cognitive control.

**Keywords:** music; language; brain; rhythm; prosody; musical training; dyslexia; reading; oscillations; statistical learning; cognitive control

The scholarly fascination for the relationships between music and language (M&L) is as old as antiquity. To this day, continuous methodological progress and, in part, radical conceptual shifts paved the way for new directions of research. In the 1990s, technological revolutions in neuroimaging revealed partial neural overlap between the two domains [1], despite dissociable clinical deficits in M&L [2]. Together with known benefits of music for speech and language functions [3] this nurtured the idea that—once we understand what holds M&L together at their biological core—music interventions could constitute a bridge to prevent, alleviate, or even reverse speech and language disorders [4,5].

This Special Issue took stock of recent advances in the neurocognition of M&L to examine the current status of this vein of research. Sixteen research papers and reviews from 48 experts in linguistics, musicology, cognitive neuroscience, biological psychology and educational sciences demonstrate that research has been active on all fronts. As we will see, the studies follow two burgeoning trends in M&L research: First, they focused on common auditory processing of temporal regularities [6–9] that are thought to promote higher-level linguistic functions [8,10–14], possibly via mechanisms of neuronal entrainment [15]. Second, they explored top-down modulations of common auditory processes [16–18] by domain-general cognitive [19,20] and motor functions in both perception and production [21]. These topics were addressed using a broad toolkit of well-designed behavioral and computational approaches combined with functional magnetic resonance imaging (fMRI), near-infrared spectroscopy (NIRS) or electroencephalography (EEG) in different cohorts of participants.

The starting point for most of the included studies was that speech and music have similar acoustic [9,18] and structural features [6–8,13,15–17,19]. As argued in the review article of Reybrouck and Podlipniak [9], some of these sound features and their common preconceptual affective meanings may even reflect joint evolutionary roots of M&L that still prevail today, for example, in musical expressivity and speech prosody. Notably, a feature that was particularly central to half of all contributions is the *temporal* structure of M&L, i.e., the patterning of strong and weak syllables or beats that make up rhythm, meter and prosodic stress [6–8,10–13,15].

The rhythmic patterning of both speech and music has been proposed to draw on domain-general abilities which are required to perceive and process temporal features of sound [22,23]. Accordingly, three studies present data in line with common rhythm processing resources in M&L. First, Lagrois et al. [6] found that individuals with beat finding deficits in music—so called "beat-deaf" individuals—also show deficits in synchronizing their taps with speech rhythm, and more generally, in regular tapping without external rhythms. The authors argue that this pattern of deficits may arise from a basic deficiency in timekeeping mechanisms that affect rhythm perception across domains. Second, Boll-Avetisyan et al. [7] used multiple regression analyses and found that musical rhythm perception abilities predicted rhythmic grouping preferences in speech in adults with and without dyslexia. Similarly, in an EEG study, Fotidzis et al. [8] found that musical rhythmic skills predicted children's neural sensitivity to mismatches between the speech rhythm of a written word and an auditory rhythm. Interestingly, both studies further report connections between rhythm perception in music and reading skills. Hence, these findings not only speak for a common cross-domain basis of rhythmic processes in M&L but also suggest that deficient or enhanced rhythmic abilities may have an impact on higher-level language functions.

Potential downstream effects of general rhythmic processing skills on higher-order linguistic abilities are currently being extensively investigated, particularly in the context of first language acquisition (for a recent review, see [24]). Accordingly, several studies in this Special Issue probe whether the acoustic properties of speech rhythm can serve as scaffolding for the acquisition of stable phonological representations [12], for the segmentation of words from continuous speech and the construction of lexical representations [13], for the recognition of syntactic units in sentences [10] and for reading [7,8,11]. For example, Richards and Goswami [10] explain that prosody, particularly the hierarchical structuring of stressed and unstressed syllables, provides reliable cues to the syntactic structure of speech [25] and can hence facilitate learning of syntactic language organization [26]. Early perturbations at this rhythm-syntax interface may, in turn, hinder normal language acquisition, such as in developmental language disorders (DLD). The authors found that children with DLD indeed had difficulties in noticing conflicting alignments between prosodic and syntactic boundaries in rhythmic children's stories, and that these deficits coincided with enhanced perceptual thresholds for acoustic cues to prosodic stress. With these data at hand, Richards and Goswami support the assertion that basic processing of rhythmic-prosodic cues may be a key foundation onto which higher aspects of language are scaffolded during development.

In a similar vein, rhythmic-prosodic sensitivity has been proposed as fundamental stepping stone into literacy [27–29] as well as implicit driver for skilled reading [30]. Breen et al. [11] and Fotidzis et al. [8] present converging EEG evidence for implicit rhythmic processing in silent reading of words in literate adults and children. In particular, they both found a robust fronto-central negativity in response to stress patterns in written words that mismatched the rhythm of silently read limericks [11] or auditory click trains [8]. These results suggest that rhythmic context—no matter whether implicit in written text or explicit in sound—can induce expectations of prosodic word stress that facilitate visual word recognition and reading speed.

Current neurophysiological models assume that speech and music processing as well as the catalytic role of rhythm in language development are based on the synchronization of internal neuronal oscillations with temporally regular stimuli [27,31–33]. The review article by Myers et al. [15] summarizes the current state of knowledge about neuronal entrainment to the speech envelope reflecting quasi-regular amplitude fluctuations over time. This neural tracking occurs simultaneously at multiple time scales corresponding to the rates of phonemes, syllables and phrases [34,35]. In this context, Myers and colleagues argue that the slowest rate—corresponding to prosodic stress and rhythmic pacing in the delta range (~2Hz)—constitutes a particularly strong source of neuronal entrainment which is crucial for normal language development. Correspondingly, atypical entrainment

to rhythmic prosodic cues due to deficits in fine-grained auditory perception may constitute a risk for the development of speech and language disorders such as DLD and developmental dyslexia (DD) [24,36].

If rhythmic processing disabilities are indeed the basis of speech and language disorders, then useful avenues for prevention and intervention could lie in (i) increasing the regularity of stimuli, or (ii) strengthening individual rhythmic abilities with the aim at improving neuronal entrainment [37–39]. Several studies in this Special Issue deal directly or indirectly with these ideas, either by exploring processing benefits of rhythmically highly regular stimuli such as songs [13,14] or poems [10,11], or by discussing potential protective or curative effects of music-based rhythm training on language skills [7,8,10,12,15,16]. Even though the results are promising, they also raise a number of questions. For example, using EEG Snijders et al. [13] found that 10-month-old infants were able to segment words in natural children's songs. However, they did equally well in infant-directed speech. Similarly, Rossi et al. [14] found no differences between speech and songs in a combined EEG-NIRS study on semantic processing in healthy adults. Taken together, these data suggest that the presentation of verbal material as song may not be sufficient to enhance vocabulary learning or language comprehension in healthy individuals (but see [40]). The longitudinal study of Frey et al. [12] zoomed in on training effects. Using EEG, the authors demonstrate that 6 months of music but not painting training positively influenced the pre-attentive processing of voice onset time in speech in children with DD. However, no effects were found in behavioral measures of phonological processing or reading ability. This raises the questions of how much training is required and which aspects the training should include to translate to behavior, both inside and outside the laboratory setting. Clearly, the identification of optimal interventions is a joint mission for future research that goes hand in hand with the development of solid conceptual [41,42] and neurophysiological frameworks [27] to identify the key variables underlying the amelioration of speech and language processing through rhythm and music [43–46].

The studies of this Special Issue introduced so far primarily focused on links between M&L that are bottom-up driven by shared acoustic features between the two domains. The remaining articles took a different approach and examined domain-general top-down modulations of M&L from both the perspectives of perception and production. Four articles illustrate the continuous interaction between bottom-up and top-down processes. In line with significant trends in predictive coding [47,48], Daikoku [16] reviews the conceptual, computational, experimental and neural similarities of statistical learning in M&L acquisition and perception with links to rehabilitation. Bidirectional interactions between perceptual (bottom-up) and predictive (top-down) processes are a core feature in the framework of statistical learning. Experimental evidence for the top-down adjustment of M&L perception is provided by the behavioral modelling study of Silva et al. [17] who found that listeners placed break patterns in ambiguous speech-song stimuli differently depending on whether they believed they were listening to speech prosody or contemporary music. Similarly, the fMRI study of Tsai and Li [18] found that the strength with which an ambiguous stimulus was perceived as song rather than speech depended not only on the acoustics of the stimulus itself, but also on the sound category of the preceding stimulus. Finally, Mathias et al. [21] show with EEG that pianists gradually anticipated the sounds of their actions during music production, similar to mechanisms of auditory feedback control during speech production [49,50]. Taken together, these studies suggest that the listening context, one's own motor plans as well as statistical and domain-specific expectations may influence the top-down anticipation and perception of acoustic features in speech and music.

Finally, the last two articles focus on the relevance of domain-general cognitive functions for M&L interactions. Lee et al. [19] argues that well-known syntax interference effects between M&L [51,52] may emerge from shared domain-general attentional resources. Accordingly, they show that the top-down allocation of attention similarly modulated EEG markers of syntax processing in M&L, particularly at late processing stages associated with cognitive reanalysis and integration. Otherwise, Christiner and Reiterer [20] found that links between musical aptitude and phonetic language abilities

in pre-school children (i.e., imitation of foreign speech) were mediated by domain-general working memory resources. While none of these studies denies auditory-perceptual connections between M&L, they remind us that what we have seen so far is perhaps only the tip of the iceberg, with more complex entwinements still to be discovered.

To sum up, this Special Issue indicates that questions have shifted from mapping to mechanisms. Initial descriptions of M&L analogies have turned into a determined search for explanations of M&L links in human neurophysiology, general perceptual principles and cognitive computations. Accordingly, the obvious next questions are of a mechanistic nature: Can musical training enhance the neuronal entrainment to speech (and vice versa)? How exactly does entrainment promote higher-order linguistic functions? How can working memory and attention be included in the equation? These are only a few questions, but we are confident that the joint efforts of this multidisciplinary field of research will be rewarded by a better understanding of the M&L interface and the necessary tools to optimize interventions for music- and language-related dysfunctions.

**Author Contributions:** D.S. and S.E. edited this Special Issue, D.S. wrote the original draft of the editorial, S.E. and D.S. revised the editorial. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Review* **Preconceptual Spectral and Temporal Cues as a Source of Meaning in Speech and Music**

#### **Mark Reybrouck 1,2,\* and Piotr Podlipniak <sup>3</sup>**


Received: 23 January 2019; Accepted: 26 February 2019; Published: 1 March 2019

**Abstract:** This paper explores the importance of preconceptual meaning in speech and music, stressing the role of affective vocalizations as a common ancestral instrument in communicative interactions. Speech and music are sensory rich stimuli, both at the level of production and perception, which involve different body channels, mainly the face and the voice. However, this bimodal approach has been challenged as being too restrictive. A broader conception argues for an action-oriented embodied approach that stresses the reciprocity between multisensory processing and articulatory-motor routines. There is, however, a distinction between language and music, with the latter being largely unable to function referentially. Contrary to the centrifugal tendency of language to direct the attention of the receiver away from the text or speech proper, music is centripetal in directing the listener's attention to the auditory material itself. Sound, therefore, can be considered as the meeting point between speech and music and the question can be raised as to the shared components between the interpretation of sound in the domain of speech and music. In order to answer these questions, this paper elaborates on the following topics: (i) The relationship between speech and music with a special focus on early vocalizations in humans and non-human primates; (ii) the transition from sound to meaning in speech and music; (iii) the role of emotion and affect in early sound processing; (iv) vocalizations and nonverbal affect burst in communicative sound comprehension; and (v) the acoustic features of affective sound with a special emphasis on temporal and spectrographic cues as parts of speech prosody and musical expressiveness.

**Keywords:** preconceptual meaning; affective vocalizations; action-oriented embodied approach; affect burst; speech prosody; musical expressiveness

#### **1. Introduction**

The problem of meaning extraction in speech and music has received a lot of concern in different fields, such as infant-directed speech and singing, the origins of music perception and cognition, and the primary use of acoustic cues in emotion-driven and affect-laden preverbal communication. This kind of research saw its heyday in the 1990s with major contributions in the field of early music perception [1,2] and preference. Many efforts have been directed towards the study of *motherese* and *infant-directed speech* and *singing* [3–7], the acoustic basis of young children's preference for such kinds of vocal communication [8,9], the musical elements in early affective communication between newborns and caregivers [10–12], and the role of prosodic features in preverbal and early musical communication [13–15]. Most of this research has stressed the extreme sensitivity of young infants for acoustic features of speech and music [16] as well as the existence of early musical predispositions [17–19].

Many of these studies—and also some subsequent ones—have emphasized certain commonalities between language and music [19–21], most of them being related to the *prosodic* and *paralinguistic features* of language, which can be considered as being musical to some extent. Besides, there are lots of empirical results which indicate that both music and language can prime the meaning of a word and that music meaning is represented in a very similar fashion to language meaning in the human brain [22,23]. This observation suggests that propositional semantics that is specific solely to language can be based on broader meaning categories. which are less precise, but not language specific.

Recent developments have provided additional evidence from the domains of *comparative* [24,25] and *evolutionary musicology* [26–33], which deal with the evolutionary origins of music by adopting a comparative approach to vocal communication in animals and an evolutionary psychological approach to the emergence of music in the hominin line [34]. These approaches make it possible to tease apart those processes that appear to be innate from those that develop with maturation or acculturation [35]. The animal research provides insights in the role of acoustic cues in nonverbal and preverbal communication [36,37], which are related to affective speech and which can be considered emotional antecedents of music and language [15]. Some of these findings seem to corroborate, to some extent, Darwin's hypothesis on musical protolanguage, which stated that speech and music originated from a common precursor that developed from "the imitation and modification of various natural sounds, the voices of other animals, and man's own instinctive cries" [38] (p. 3). Such a primitive system would have been especially useful in the expression of emotion and music, as we know it nowadays, and should be a behavioral remnant of this early system of communication (see also [39–41]). It is a hypothesis which has been elaborated and restated by modern researchers under the umbrella of the musical protolanguage hypothesis [20,24,40,42,43].

#### **2. Meaning Before Language and Music**

Meaning can be considered as the result of the interpretation of stimuli by the nervous system. Such interpretation is often described in terms of internal mental representations that animals have of the things, events, and situations in their environment, and it is evolutionary older than their corresponding expressions in language [44] (pp. 4–6). From a phylogenetic perspective, there are three major kinds of meaning which have evolved over time: Meaning as a way to orient oneself in the environment [45], emotions and emotional communication as integral decision mechanisms [46] and as motivational states which are meaningful for survival by providing also a primordial way of interpretation of the external world [47], and referential meaning as the outcome of the appearance of a conceptual mind [48]. Meaning, moreover, can be considered as a basis for communication, using sound as the main source of information as is the case in primate and human vocalizations [49,50]. The latter, however, should not be identified solely with speech and singing, which can be contrasted clearly with nonverbal utterances, such as laughter, crying, and mourning. As such, there is a hierarchy in the kind of meaning that is conveyed by sound: There is, first, a distinction between the digital (speech and music) and analog (prosody, laughter, etc.) usage of sound [51]; there is, second, a hierarchical distinction between preconceptual and conceptual meaning with a first level of simple spectral and temporal cues as a source of reflexes and a second level of expressive dynamics and emotional communication [51,52]; there is, third, a level of meaning that is conveyed by means of syntax, such as, e.g., tonality in music and grammatical correctness in speech [53]; and finally, there is a fourth level of propositional semantics and associative meaning [48], such as language's lexicons and, most probably, chimpanzees' referential grunts [54].

Speech is closely related to vocal production and can be studied from a broader stance, including the articulatory, linguistic, and information-conveying point of view. The *articulatory approach* describes lexical units in terms of gestures that are characterizations of discrete, physical events that unfold during the speech production process [55]. They can be considered basic units of articulatory action, allowing phonology to be described as a set of relations among "phonological events" [56]. These basic units of articulatory routines are discrete gestures, which emerge pre-linguistically as early and gross

versions of their adult use [55], calling forth the *linguistic level* of speech processing with articulatory routines that gradually develop into higher-level phonological units that can be contrasted with each other [57]. Linguistic meaning, however, is discrete-digital rather than analog-continuous. It relies on propositional knowledge without direct coupling with the speech signals—as sounding and thus sensory phenomena—and combines referential meaning with particular sound patterns that function as vehicles to convey symbolic meaning. Such a "vehicle mode" of meaning involves referential meaning, which is a representational mode of conveying information, as against the "acoustic mode", which refers merely to the local modulations in sound that are involved in expressive communication of meanings [58].

Speech, as opposed to language as a system, is articulated in real time. As such, it is a sensory rich stimulus. It provides information across multiple modalities, combining both the auditory and visual modalities, as exemplified most typically in the facial expression of audio-visual emotional speech. The latter, together with prosody, cannot be reduced to the control of voice qualities alone, but is closely related to the integration of sensory modalities—with facial and vocal expressions reinforcing each other [13]—and even with the movements of body [59]. Much research on emotional speech (see e.g., [60]), however, has been oriented rather narrowly to facial expressions since it has been hypothesized over a long period of time that judges are more accurate in inferring distinct emotions from facial expressions than from vocal ones. Acoustic cues, on the other hand, have been considered merely as additional features to facial expression, marking only levels of physiological arousal that are less distinctive than those expressed by the face. This conclusion, however, has proved to be erroneous since previous studies have studied only a limited number of acoustic cues, and the arousal differences within emotion families have also been largely neglected [61]. This has been shown in recent studies that used a comprehensive path model of vocal emotion communication, encompassing encoding, transmission, and decoding processes [62,63] to empirically model data sets on emotional expression and recognition from two different cultures and languages. Results of their extended Brunswikian "lens model" [64]—lens equations, hierarchical regression, and multivariate path analysis—, all reflect the strong evidence from past work on the role of arousal in affective communication that vocal sounds primarily convey the arousal state of the sender. It was stated that the "voice is the privileged modality for the expression and communication of arousal and activation, whereas the face is vastly superior with respect to valence" [62] (p. 24).

Additional evidence comes from studies of infants' reactions to parental communicative signals, which have stressed their outstanding discriminative abilities for timing patterns, pitch, loudness, harmonic interval, and voice quality [65]. It seems, moreover, that newborns are very sensitive also to facial expressions, vocalizations, and hand movements, which they can largely imitate to some extent. Such a kind of *communicative musicality*, as it has been coined [11,66], shows children's awareness of human communicative signals. It is a faculty which is comprehensive, multimodal, and coherent at birth and in the first months after birth [67]. It stresses the conflation of perceptual and motor aspects in speech recognition and vocal expression, bringing together audio-visual, visual-motor, and audio-motor integration.

Music is related to this preverbal communicative expressivity. It precedes or bypasses verbal communication by stressing the sensory richness of the stimuli. As such, it is directed primarily to itself with meaning being self-referential rather than referring to something else. Contrary to the centrifugal tendency of linguistic meaning, where the attention is directed away from the text proper (centrifugal) to grasp the meaning of what is referred to, music has a centripetal tendency in directing the listener's attention to the auditory material of the sounding music itself [68,69]. As such, there seems to be a major distinction between language and music, though there are also some commonalities, which stress a number of shared components. This applies in particular to vocal music and its communicative possibilities.

Music, seen from an evolutionarily point of view, is one of the most ancient forms of human communication, with the human voice being probably the most ancestral instrument in human music [70]. It can even be questioned in this regard whether music and speech are different, and if so, to what extent [71]. There are two ways to address this question, either by intraspecies or interspecies comparison. An example of the former is the study of para-musical elements in language and para-lingual elements in music [72], such as the use of lexical tone in tone languages and prosody (para-musical) or the use of Leitmotive in music (para-lingual) [20]. Also, the languages based on musical tone systems, such as drum and whistle languages, can be studied in this context [73]. The interspecies comparison, on the other hand, is still more challenging and embraces a very extensive body of research. It has been hypothesized, for example, that singing could have evolved from loud calls by nonhuman primates, such as the Old-World monkeys and apes, which have been considered to be possible precursors of human singing and music. Gibbons, in particular, use vocalizations that elicit emotional responses from human listeners by using acoustic characteristics, such as loudness, acceleration of note rhythm, a final slow-down in rhythm, sounds consisting of alternated exhalation and inhalation, higher pitch frequencies in the central section of the call, pure tone of notes, and frequent accompaniment with piloerection and locomotor displays [36]. All these elements, however, are also used to a different degree in speech.

As such, there is an ability of communication by means of sounds that touches on an evolutionarily old layer of sound communication, which is older than singing and speech. This level is involved in the development of functional sensitivity to a specific class of sounds in ancestral vertebrates both as an aid in identifying and localizing predators and for capturing prey [74]. It is exemplified most typically in the use of alarm calls, which can be considered as a class of punctuate sounds that tend to be short, with sharp and abrupt signal onset, dramatic frequency and amplitude fluctuations, and a chaotic broadband spectral content. There is also a broad class of vocalizations that has been labeled "squeaks, shrieks, and screams" and which have direct impact on animal perception [75]. Their specific designs make them stand out against background noise so as to make them easy to localize. Moreover, they may provoke immediate orienting reactions by other animals in the direction of the calls, in combination with reflexive movements that prepare for flight [76]. Such generalized startle responses are induced also in very young infants, even in the absence of significant previous experience. They are, in fact, reducible to the operation of low-level brainstem and subcortical processes, which are associated with sound localization, orienting, and autonomic responding [77,78]. These vocalizations, however, can be exemplary of an intentional, communicative use of sounds which differ functionally from simple auditory sensations, which are prelinguistic default labels of sound sources [79], such as sensation of loudness and low pitch, as a tag of a big animal.

Such vocalizations by animals are not gratuitous. They are used frequently by youngsters as an opportunity to influence the behavior of older and larger individuals by engaging their attention, arousal, and concomitant behavior, sometimes in a very compelling way [80]. It can be questioned, however, whether primates have a theory of mind or act intentionally to influence others. A tentative answer can be found in comparable research in humans into to the neurocognitive mechanisms (auditory prosodic activations) that allow listeners to read the intentions of speakers from vocal prosodic patterns, and which illustrates their anchoring at the interface between auditory and social cognition, involving the cooperation of distributed auditory prosodic, sociocognitive, and cingulo-opercular brain areas [81].

These attention-capturing sounds in animals are often characterized by loud protracted bouts of harsh and variable vocalizations, which include rapidly varying combinations of loud, noisy screams and piercing high-frequency tonal cries, with dramatic amplitude and frequency modulations, which together are able to increase the arousal state of the mother, including human ones [74,82]. It has been shown, moreover, that screaming is one of the most relevant communication signals in humans for survival. By using a recently developed, neurally informed characterization of sounds (modulation power spectrum) see [83,84], it has been demonstrated that human screams cluster within a rather restricted portion of the acoustic space between about 30 and 150 Hz, which corresponds to the perceptual attribute of roughness. This acoustic roughness has been found also to engage subcortical structures, which are critical to the rapid appraisal of danger [85].

The vocal repertoire of most primate species, however, is not limited to these attention-capturing sounds. There is also an additional class of sounds, which are referred to as "sonants and gruffs" and which may be considered as structural opposites of these arousal-increasing sounds [74]. Instead of being unpatterned and chaotic, they are tonal and harmonically rich, with a more diffuse regularly patterned broadband spectral structure. Rather than having direct impact on listener's arousal and affect, they seem to induce a less inherent affective force. Their richly structured spectra, moreover, make them even suited for revealing clear cues to the caller's identity since their individual idiosyncrasies impart individually distinctive voice cues that are associated either with the dynamic action of the vocal folds or with the resonance properties of the vocal tract cavities [86,87]. Chimpanzees, likewise, are able to intentionally use grunts as referential calls and to learn new calls from other individuals [54], which represents most probably an early stage of the evolution of lexical meaning (but see [88]). However, although the monkeys' vocal tract is ready to generate speech sounds [89], language and music seem to necessitate more elaborate neural processing mechanisms and vocal control [46].

#### **3. Affective Sounds and Vocalizations**

Speech—at least in its most primitive appearance—and music seem to share a common affective substrate. Studying emotional communication by means of speech and music, therefore, can benefit from a thorough investigation of their underlying mechanisms. One field of research that has been particularly fruitful in this regard has been the study of auditory affective processing that was conducted in the context of *speech prosody* [13]. It has been argued, in fact, that two separate neuroanatomic channels with different phylogenetic histories participate in human acoustic communication to support either nonverbal affective vocalization or articulate speech [90,91]. This *dual-pathway model* of human acoustic communication clearly distinguishes the propositional and emotional contents of spoken language, which rely on channels that are seated in separate brain networks that create different data structures, which are known as analogue versus digital (see below). Both channels, however, must coordinate to some extent, but the functional mechanisms and neuroanatomic pathways underlying their intertwined integration are still not totally clear [92].

Affective prosody, further, is opposed to the discrete coding of speech, which is used in the case of phonemes, words, and those aspects of music that consist of pitches and durations. Its expressive dynamics can be modelled more effectively by continuous variables, as is the case with emotional gestures that are shared not only by all humans, but also by a broader group of animals, including many taxa of mammals and even other vertebrates [51]. The same dynamics of affective prosody—as an evolutionarily old form of communication—are to be found, in fact, in the prosody of human language and in the vocal expressions of different mammalian species, which could mean that its use in human acoustic communication has deep phylogenetic roots that are present in the vocal communication systems of nonhuman animals as well. Consistent structures, in fact, can be seen in acoustic signals that communicate affective states, such as high-pitched, tonal sounds in expressions of submission and fear, and low, loud, broadband sounds in expressions of threats and aggression. Animal signals may thus have direct effects on listeners. They may not simply provide information about the caller, but may effectively manage or manipulate the behavior of listeners [93] (see also [76]). This *prehuman origin hypothesis* of affective prosody locates its grounding in innate mechanisms, which have a prehuman basis and which are used to discriminate between different emotions, both qualitatively (anger, fear, joy, sadness, boredom, etc.) and quantitatively (affect intensity) [52]. It has been shown, moreover, that there exists a functional dissociation between brain regions that process the quality of acoustically conveyed emotions (orbitofrontal cortex) and those that process the intensity of that emotion (amygdala) [94]. Current research has also revealed a high degree of acoustic flexibility in

attention-attracting sounds in nonhuman mammalian species, which points in the direction of more complex acoustic signaling and processing mechanisms [95].

As such, it can be argued that the study of the faculties of language and music can benefit from a comparative approach that includes communication and cognition in humans and nonhuman animals alike [46]. The capacity to learn language, in fact, requires multiple, separable mechanisms, which include the ability to produce, perceive, and learn complex signals as well as to interpret and control them. Some of them seem to have figured already in the common ancestors of both humans and animals, some others evolved later. Relying on comparative data from living animals, therefore, may be definitively helpful to address these issues. Acoustic signaling in humans, in this view, may have roots in the vocal production, auditory perception, and cognitive processing capabilities of nonhuman mammals, and the study of affective prosody, as a shared component of human speech, music, and nonverbal acoustic communication, in particular, may shed some light on the evolutionary roots of human speech and music as well as the evolution of meaning itself. It is important, in this regard, to consider also the role of *iconicity*—the similarity between some aspects of sound to some aspects of meaning—in linking the sound to meaning in language. It should be noted, in fact, that affective prosody is considered a paralinguistic property, which accompanies the semantic meaning arising from the symbolic system of human language. The question of how meaning emerges from symbolic signs, therefore, cannot be fully understood by focusing only on prosodical features of language, which work in parallel to the semantic processing. Here, an iconic relationship between sound and the meaning of words that has traditionally been considered as only a marginal property of language (e.g., onomatopoeia, and to some extent also phonaesthemes, i.e., a phoneme or group of phonemes, which has recognizable semantic associations as the result of appearing in a number of words with similar meanings, such as, e.g., the English onset /sn-/ in snarl, snout, sniff, snuffle), has been assumed to serve as an interface for accomplishing the need to map linguistic form to human experience as a vital part of meaning making. Iconicity, thus, has been shown to play an important role for both phylogenetic language evolution (e.g., [96]) and ontogenetic language development (e.g., [97]). This holds in particular for the correspondences between the sound and meaning of words in the affective domain, termed *affective iconicity* [98], which have been supported by recent empirical results indicating that the specific sound profile of a word can be attributed to a specific affective state, which, in turn, can contribute to the perception of the affective meaning of that word, such as, e.g., whether it designates something positive/negative or arousing/calming [99]. Importantly, the affectivity in the sound of words in a language has been shown to be processed in similar brain regions that are involved in processing other types of affective sounds, such as emotional vocalization and affective prosody [100,101]. In addition, such affective potential in the sound of words is even capable of interacting with higher cognitive processes, such as affective evaluation of the words' meaning [102]. All this suggests that consciously experienced meaning is inferred from a number of cues that reflects a hierarchy of sound processing.

It is possible, further, to conceive of this hierarchy in the processing of sounds, reflecting the evolutionary history of human sound communication from early mammals, showing an extension of the perceivable spectrum of sound frequency related to the evolution of the mammalian ear [103], to primates. Non-human primates and early hominins, for example, are an especially interesting group in which to consider the potential affective influence of vocalizations on listeners. Because of their large brains and their phylogenetic proximity to humans, traditional research has focused mostly on "higher-level" cognitive processes that organize communication in higher primates. Yet, they still can rely on the neurophysiological substrates for affective influence, which are still very broadly conserved. It is likely, therefore, that affective influence is an important part of the vocal signals of non-human primates [74]. As such, it is possible to conceive of hierarchical levels of affective signaling, starting from loud calls and vocalizations of early hominids, over prelinguistic affective processing of sound by neonates to infant-directed speech, affective speech, and even music. The step via onomatopoeia and iconicity, finally, could be added as a last step from affective to referential signaling.

The loud calls of *early hominins* are exemplified most typically in a broad class of vocalizations with acoustic features that have direct impact on animal perception, as mentioned already above: Sharp signal onsets, dramatic frequency and amplitude fluctuations, and chaotic spectral structures [104]. *Neonates* are another interesting group for the study of prelinguistic affective processing of sound. They have been shown to possess complex endowments for perceiving and stimulating parental communicative signals by discriminating timing patterns, pitch, loudness, harmonic interval, and voice quality [65]. They also seem to react to the human voice and display imitations of facial expressions, vocalizations, and hand movements, showing an awareness of human signals that is already comprehensive, multimodal, and coherent at birth [67]. As a result, people, all over the world, have capitalized on this sensitivity by developing *infant-directed speech* or *motherese* (see below), which is obviously more simplified than adult speech, and which involves exaggerated prosodic features, such as wider excursions of voice pitch, more variable amplitude, tempo, and delivery, and more varied patterns of word stress [74]. All these features have been the subject of research on auditory affective processing, which has been conducted mainly in the context of speech prosody, which has been coined also the "third element of language" [105]. Vocal emotion perception in speech, further, has been studied by using test materials consisting of speech, spoken with various emotional tones by actors, and nonverbal interjections or *affect bursts*, such as laughter or screams of fear [106] (see for an overview). These vocal expressions, which usually accompany intense emotional feelings, along with the corresponding facial expressions, are closely related to *animal affect vocalizations* [107], which can be defined as short, emotional non-speech expressions, which comprise both clear non-speech sounds (e.g., laughter) and interjections with a phonemic structure (e.g., 'Wow'), but which exclude verbal interjections that can occur as a different part of speech (like 'Heaven', 'No', etc.)" [108].

These nonverbal affect bursts have proven to be useful for the study of meaning. They provide an interesting class of affective sounds, which have been collected in validated sets of auditory stimuli—such as the Montreal Affective Voices (MAV) [106] and the "Musical Emotional Burst (MEB) for musical equivalents [109]. Using nonverbal sounds, moreover, presents several advantages over verbal ones: The stimuli do not contain semantic information, there are no linguistic barriers, the expression of emotion is more primitive and closer to the affect expressions of animals or human babies, and they are more similar to the Ekman faces [110] used in the visual modality than emotional speech. As such, they avoid possible interactions between affective and semantic content, they can be used for the study of cross-cultural differences, and they allow better comparisons across modalities, as well as studies of cross-modal emotional integration [106].

Affect bursts, however, are limited in their semantic content, but are able to communicate by sound [51,111]. Being evolutionarily older than singing and speech, they have been considered as their precursors to some extent. Singing is one of the interesting ways of sound expression, which goes beyond the transmission of semantic information. It can be questioned, however, whether every kind of music—as an evolved and cultural product—exploits such pre-existing perceptual sensitivities, which were originally evolved thanks to a variety of auditory functions, such as navigating sonic environments and communication by means of singing. Cultural evolution, in this regard, has led to increasingly complex and cumulative musical developments through processes of sensory exploitation [112].

#### **4. Calls, Vocalizations, and Human Music: Affectively-Based Sound–Meaning Relationships**

Music has inductive power. It can move listeners emotionally and physically by means of the information-processing mechanisms it engages. The majority of these mechanisms, however, did not evolve as music-specific traits. Some of them are related to the processing of sound that is recognized as being similar to voices, objects that are approaching, or the sounds of animals. As such, this processing seems to involve cognitive processes of attraction and cultural transmission mechanisms that have cumulatively and adaptively shaped an enormous variety of signals for social relationships [112]. Music, in this view, is an inherently social phenomenon, and the same holds true for loud calls of

nonhuman primates, especially those of the Old-World monkeys, which, most likely, were the substrate from which singing could evolve [36].

This brings us to the question of the origins of language and music and their mutual relationship. It has been hypothesized, e.g., that language seems to be more related to logic and the human mind, whereas music should be grounded in emotion and the human body [113] (see for an overview). This dichotomous approach has been questioned, however, in the sense that language and music could evolve from common roots, a common musical protolanguage [24,42]. Especially, the *loud calls* in modern apes and music in modern humans seem to be derived from such a common ancestral form. The calls are believed to serve a variety of functions, such as territorial advertisement, inter-group intimidation and spacing, announcing the precise locality of specific individuals, food sources, or danger, and strengthening intra-group cohesion. The most likely function of early hominin music, on the other hand, was to display and reinforce the unity of a social group toward other groups [36]. This is obvious in vocalizing and gesturing together in time, where the ability to act musically underlies and supports human companionship. It seems likely, moreover, that the elements of communicative musicality are necessary for joint human expressiveness to arise and that they underlie all human communication [11,66].

As such, it seems that a major ancestral function of calls, protolanguage, and music may be related to several kinds of signaling, attention capturing, affective influence, and group cohesion rather than conveying propositional knowledge that is related to higher level cognitive processes that are involved in the communication of contemporary humans. This brings us to the role of *affective semantics*, as the domain that studies semantic constructs that are grounded in the perceptual-affective impacts of sound structure [74]. Empirical grounding for that kind of signaling has been provided by a typical class of primate vocalizations, which are known as *referential emotive vocalizations* [58] and separation calls [114]. There are, in fact, a number of important affective effects of sounds and vocalizations, such as, e.g., attention capturing mechanisms, which are used also in speech directed to young infants with the function to focus and maintain attention and to modulate arousal by using dramatic frequency variations. As such, there is a whole domain of acoustic signals which goes beyond the lexico-semantic level of communication and which is shared between humans and non-human animals. There are, as such, acoustic attributes of aroused vocalizations which are shared across many mammalian species and which humans can use also to infer emotional content. Humans, as a rule, use multiple acoustic parameters to infer relative arousal in vocalizations, but they mainly rely on the fundamental frequency and spectral centre of gravity to identify higher arousal vocalizations across animal species, thus suggesting the existence of fundamental mechanisms of vocal expressions that are shared among vertebrates, and which could represent a homologous signaling system [115].

Such core affective effects of vocal signals may be functional. Yet they do not undercut the role of cognition and the possibility of more complex communicative processes and outcomes, such as speech communication in people. The latter can be seen as a refinement of phylogenetically older vocal production and perception abilities that are shared with non-human animals [91]. These abilities may scaffold, in part, an increasing communicative complexity, which means that at least some of the semantic complexity of human language might capitalize on affectively-based sound–meaning relationships. It is probable, therefore, that evolutionarily older ways of interpreting acoustical cues can be involved in the construction of more complex meaning. Such preprepared or early acquired sound–sense relationships represent a form of intrinsic or original meaning that provides a natural foundation from which increasingly complex semantic systems may be constructed, both developmentally and evolutionarily. This foundation can explain the universal tendency first observed by Köhler [116] (pp. 224-225) to associate pseudowords, such as *takete* or *kiki*, with spiky shapes whereas *malumba* or *bouba* are associated with round shapes [117]. It has been shown, moreover, that the communicative importance of the affective influence of vocal signals does not disappear when brains get larger and their potential for cognitive, evaluative control of behavior increases. It is likely, therefore, that complex communicative processes exploit and build on the phylogenetically-ancient and widespread affective effects of vocal signals [74] (p. 183).

#### **5. Sound Communication, Emotion, and Affective Speech**

Sounds can have a considerable affective effect on listeners and this holds true also for non-human animals that use many of their vocal signals precisely to exert these effects. There is, as such, a relationship between the acoustic structure in animal signals and the communicative purposes they purport [74,112]. This is obvious in vocalizations of non-human primates, which bear the mark of design for direct effects on the listener's affect and behavior, as exemplified most typically in alarm vocalizations that are produced during encounters with predators [91]. These alarm calls tend to be short, broadband calls, with an abrupt-onset, standing out against background noise, thus being easy to localize. As such, they display acoustic features for capturing and manipulating the attention and arousal in listeners. They have been studied already in the 1970s in the context of agonistic vocalizations that are involved in confrontations or competitions with others. Among their most important features is a low fundamental frequency (F0) and a tendency towards aperiodicity, with a possible explanation that low, broadband sounds with a wide frequency range are often tied to body size and hostile intent. Such sounds, presumably, can induce fear in the receivers. High pitched sounds with tone-like high F0, on the contrary, are related to appeasement and are often produced to reduce fear in listeners [118,119]. This illustrates again how sound is often more important than semantic meaning in animals' signals.

Similar findings have been reported also for humans. Prohibitive utterances across cultures, for example, contain similar acoustic features, such as a fast rising amplitude, lowered pitch, and small repertoires [112]. A more elaborated field of research, however, is the study of *motherese* or *infant-directed speech* [65]. Mothers, as a rule, speak in short bursts and talk in an inviting sing-song manner with the baby occasionally answering back. Young infants, moreover, stimulate their caregivers to a kind of musical or poetic speech, which can move into wordless song with imitative, rhythmic, and repetitive nonsense sounds. Such baby–mother interactions imply communicative interactions, which have also been called "communicative musicality" [11]. They suggest an awareness of human signals which is present at birth, with newborns reacting to the human voice and imitating facial expressions, vocalizations, and hand movements. It means that young infants possess complex endowments for perceiving and stimulating parental communicative signals by discriminating timing patterns, pitch, loudness, harmonic interval, and voice quality [65]. Effective communication, in this view, must be held by means other than lexical meaning, grammar, and syntax, with mothers and babies being highly "attuned" to the vocal and physical gestures of the mother. Both seem to explore pitch-space in a methodical manner over short and long intervals of time [11]. This has been reported extensively by the Papoušeks [6,19], who both have stressed the importance of early childhood musical behaviors as forms of play to nurture children's exploratory competence. They have studied intensively infant–caregiver interactions and focused on the musicality of these interactions, stressing the indivisibility of music and movement. It has been found, in fact, that music and movement share a dynamic structure that supports universal expressions of emotion as exemplified in particular in infants' predispositions for perceptual correspondences between music and movement. This ability, further, seems to be possible by the existence of prototypical emotion-specific dynamic contours, but also by isomorphic structural relationships between music and movement [120].

They found out that the parent's multimodal stimulation is, so to say, tailored to the infant's early competence for perceiving information through different senses and that "regular synchronization of vocal and kinaesthetic patterns provides the infant with multimodal sensory information including tactile, kinaesthetic and visual information." [6] (p. 100). Similar findings have been reported by Trevarthen [121], who has centered on the temporal characteristics of the infant–caregiver interaction. The rhythmicity of this interaction can be described as the capacity of the infant to follow and respond to temporal regularities in vocalization and movement, and to initiate temporally regular sets of

vocalizations and movements. What he proposes is a conceptual framework to explore the expression and development of communication or intersubjectivity through empirical observations and analyses of infant–caregiver interaction. It enables the sharing of patterned time with others and facilitates harmonizing the affective state and interaction [27].

As such, there seems to be an evolutionarily old layer of sound communication that exists in speech, but that arouses emotion in singing as well. This happens in a hierarchic order with the evolutionarily older elements being most basic and effective, and those which are acquired in processes of socialization being most subtle and conventional. Primitive affective vocalizations, therefore, are considered as more authentic and more truly felt information than conventional and ritual information [10,122], and a great deal of music is also designed specifically to give rise to these affective effects [74].

#### **6. Sound/Speech Understanding and the Gestural Approach**

Language and music can be considered as sound-signal using communication systems. There is, however, a distinction with respect to their respective semantics, which can be either lexico-semantic or action-oriented. In language, as well as in music, the vocal or acoustic characteristics may help to convey an impression, but it has been shown that the position of the eyebrows and the facial expression as a whole, may have the same function [119]. Many facial gestures, in fact, are part of a multi-modal array of signals, and facial expressions may even influence the acoustic cues of the expression by vocal tract deformation [13].

This brings us to the question of bimodality and audiovisual integration of emotional expressions [123]. Even in visible emotion, for example, the auditory modality can carry strong information, which is not only related to the consequences of the facial gestures [13]. In this context, it is important to remind the musicality of infant–caregiver interactions with synchronous stimulation that provides continuous multimodal sensory information (see above). This multimodal stimulation, further, entails processes of affective and behavioral resonance in the sense that the neurophysiological organization of behavior depends on a reciprocal influence between systems that guides both the production, perception, interpretation, and response to the behavior of others, somewhat reminiscent of the discovery of mirror and canonical neuron systems in primate brains [124]. This means that seeing an object or an action performed by someone else can activate the same neurons as when one is performing this action oneself. However, the multimodal stimulation can be even stronger. It has been shown, for example, that if acoustic speech is the main medium for phonetic decoding, some integration with the visual modus cannot be avoided [125]. As such, there is a lot of interest in the role of the co-occurrence of sight and sound, with a special focus on research on emotion effects on voice and speech [61].

Multimodal stimulation entails interactions between individuals, which is obvious in the ability to vocalize and gesture together—as in synchronous chorusing and gesturing—both in humans and nonhuman primates [126]. The ability to act musically and to move sympathetically with each other, accordingly, seems to be the vehicle for carrying emotions from one to someone else. It underlies human companionship in the sense that elements of communicative musicality are necessary for joint human expressiveness to arise [11].

Speech, as a later evolutionarily development, pays tribute to this interactive, gestural approach. It is a basic claim of articulatory phonology, which states that articulatory gestures and gestural organization can be used to capture both categorical and gradient information [55]. They can be described as events that unfold during speech production and whose consequences can be observed in the movements of the speech articulators. Gestures, in this view, are dynamic articulatory structures, which consist of the formation and release of constrictions in the vocal tract. As such, they can be described in terms of task-dynamics, which have been used to model different kinds of coordinated multi-articulator actions, such as reaching and speaking. It means also that the same gestural structures

may simultaneously characterize phonological properties of the utterance (contrastive units and syntagmatic organization) and physical properties.

#### **7. Sound Comprehension in Speech and Music: Spectral and Temporal Cues**

Articulatory gestures are situated at the productive level of vocal communication. There is, however, also the receptive level, which is related to the recognition of acoustic parameters, such as, for example, spectral cues when we discriminate pitch in music [127] and intonation patterns in speech [128]. Sound comprehension, in this view, should be related to the recognition of the acoustic profiles of vocal expression, as exemplified most typically in emotional expression. It has been stated erroneously that the voice might only reflect arousal. Recent research, using a larger number of parameters, has shown that spectro-temporal parameters play a major role in differentiating qualitative differences between emotions [129]. This is obvious, for example, in the vocal repertoire of most primate species with a clear distinction between squeaks, shrieks, and screams, with direct impact on the listener's arousal and affect, and sonants and gruffs, with structured spectra that provide an excellent medium for revealing clear cues to the identity of the caller (see above). These cues, which are highly idiosyncratic, impart distinctive voice cues in the acoustic features of these calls, which are associated with the patterns of dynamic action of the vocal folds or with the resonance properties of the vocal tract cavities [74,87]. Human infants, accordingly show an impressive acoustic sensitivity, which allow them to discriminate timing patterns, pitch, loudness, harmonic interval, and voice quality [11], with many perceptual biases being in place before articulated speech evolved [112]. Importantly, although all these features depend on acoustic parameters, they are in fact auditory phenomena [79]. It means that the discrimination of vocal cues is the interpretation of sound stimuli by the nervous system influenced by genetic (both species specific and shared with other taxa) and environmental (including cultural) factors.

Music as well as speech can be considered as dynamic stimuli with sounds changing continuously across the time of presentation. This means that new sensory information is added serially during sound presentation, with physiological systems that respond to simple changes in the physical stimulus being continuously active. Sounds, moreover, are dynamic and often require an accrual of information over time to be interpreted [130]. The effects of speech and music, therefore, are related in important ways to the information-processing mechanisms they engage. As a result, humans interpret speech and music sounds not only as expressive information, but also as coherent sound structures, which convey the whole pack of information. Even at this level, however, both speech and music structures are auditory phenomena which rely to a different degree on acoustical cues. In the case of phonemes recognition [131] and timbre discrimination in music [132], the most important cues are spectro-temporal. Spectral cues, in contrast, are crucial in the discrimination of intonation patterns in speech and pitch class structure in music [127].

The main difference between speech and music in this regard consists in the role of particular acoustic cues played in the transmission of meaning. While spectro-temporal cues are crucial for the recognition of words, they seem to be less important as far as the music structure is concerned. It means that spectro-temporal cues evolved in humans as a main source of transmitting lexical meaning. In contrast, spectral cues are important for discrete pitch class discrimination in music—one of the main elements of musical structure—which is deprived of lexical meaning. Nonetheless, spectral cues can contribute to the lexical meaning in tone languages where the relative change of pitch influences the interpretation of the word meaning [133]. Even in tone languages, however, lexical meaning is conveyed mainly by the means of spectro-temporal cues. Similarly, temporal cues can be used as an additional source of information which influences lexical meaning in "quantity languages", which are sensitive to the duration of the segments for the assignment of their meaning [134,135]. It has been shown also that spectral and temporal cues contribute to the signaling of the word meaning in non-tonal languages as well [136], with the extent to which these cues are important for the transmission of lexical meaning being dependent on the particular language.

#### **8. Conclusions and Perspectives**

In this paper, we described the role of preconceptual spectral and temporal cues in sound communication and in the emergence of meaning in speech and music, stressing the role of affective vocalizations as a common ancestral instrument in communicative interactions. In an attempt to search for shared components between speech and music, we have stressed their commonalities by defining speech and music as sensory rich stimuli. Their experience, moreover, involves different body channels, such as the face and the voice, but this bimodal approach has proven to be too restrictive. It has been argued, therefore, that an action-oriented approach is more likely to describe the reciprocity between multisensory processing and articulatory-motor routines as phonological primitives. As such, a distinction should be made between language and speech, with the latter being more centripetal in directing the attention of the listener to the sounding material itself, whereas language is mainly centrifugal in directing the attention away from the text to function referentially. There are, however, commonalities as well and the shared component between speech and music is not meaning, but sound. Therefore, to describe quite systematically the transition from sound to meaning in speech and music, one must stress the role of emotion and affect in early sound processing, the role of vocalizations and nonverbal affect burst in communicative sound comprehension, and the acoustic features of affective sound with a special emphasis on temporal and spectrographic cues as parts of speech prosody and musical expressiveness.

One of the major findings in this regard was a kind of hierarchy in the type of meaning that is conveyed, with a distinction between analog and digital usage of the sound. Especially, the role of affective prosody seems to be important here. As a typical example of analog processing, it goes beyond a mere discrete coding of speech and music, stressing the wider possibilities of sound-signal communications systems rather than relying merely on semantic content and propositional knowledge. As such, there seems to be a major ancestral function of affect burst, calls, protolanguage, and music which are related to several kinds of signaling, attention capturing, affective influence, and group cohesion. They hold a place in a developmental continuum at the phylogenetic and ontogenetic level.

The view presented thus suggests that meaning in language and music is a complex phenomenon which is composed of hierarchically organized features, which are mostly related to the interpretation of acoustical cues by the nervous system. The bulk of this interpretation, moreover, is processed at an unconscious level. More studies are needed, however, to better understand the role of spectral and temporal cues as sources of information in the complex process of human communication. Inter-species and inter-cultural comparative studies are especially promising in this respect, but equally important are developmental investigations, which together with genetic research can elucidate the interconnection between the environmental and hereditary information in the process of the development of human vocal communication.

**Author Contributions:** The first draft of this article was written by M.R. The final version was prepared jointly by M.R and P.P.

**Funding:** This research received no external funding.

**Acknowledgments:** We thank the anonymous reviewers. Their critical remarks were very helpful in updating our summary of the current available research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Poor Synchronization to Musical Beat Generalizes to Speech**†

#### **Marie-Élaine Lagrois 1,2,\*, Caroline Palmer 1,3 and Isabelle Peretz 1,2**


Received: 17 June 2019; Accepted: 1 July 2019; Published: 4 July 2019

**Abstract:** The rhythmic nature of speech may recruit entrainment mechanisms in a manner similar to music. In the current study, we tested the hypothesis that individuals who display a severe deficit in synchronizing their taps to a musical beat (called beat-deaf here) would also experience difficulties entraining to speech. The beat-deaf participants and their matched controls were required to align taps with the perceived regularity in the rhythm of naturally spoken, regularly spoken, and sung sentences. The results showed that beat-deaf individuals synchronized their taps less accurately than the control group across conditions. In addition, participants from both groups exhibited more inter-tap variability to natural speech than to regularly spoken and sung sentences. The findings support the idea that acoustic periodicity is a major factor in domain-general entrainment to both music and speech. Therefore, a beat-finding deficit may affect periodic auditory rhythms in general, not just those for music.

**Keywords:** beat deafness; music; speech; entrainment; sensorimotor synchronization; beat-finding impairment; brain oscillations

#### **1. Introduction**

Music is quite unique in the way it compels us to engage in rhythmic behaviors. Most people will spontaneously nod their heads, tap their feet, or clap their hands when listening to music. In early infancy, children already show spontaneous movements to music [1]. This coupling between movements and music is achieved through entrainment. Entrainment can be broadly defined as the tendency of behavioral and brain responses to synchronize with external rhythmic signals [2,3]. Currently, the predominant models of entrainment are based on the dynamic attending theory (DAT) [2,4–6]. According to this theory, alignment between internal neural oscillators and external rhythms enables listeners to anticipate recurring acoustic events in the signal, allowing for maximum attentional energy to occur at the onset of these events, thus facilitating a response to these events [2]. Multiple internal oscillators that are hierarchically organized in terms of their natural frequency or period are likely involved in this process. Interaction of these oscillators would permit the extraction of regularities in complex rhythms that are periodic or quasi-periodic in nature, such as music [7–9]. Of note, entrainment to rhythms, as modeled by oscillators, would apply not only to music but also to speech [10–18].

The periodicities contained in musical rhythms typically induce the perception of a beat, that is, the sensation of a regular pulsation, on which timed behaviors are built [19]. Simple movements in response to beat perception, like taps, are usually produced within a few tens of milliseconds of the

beat onset, indicating the precision of the temporal predictions made about the timing of upcoming beats [20–22]. Listeners can extract the beat from various complex rhythms, without the need for a one-to-one correspondence between acoustic events and beat occurrences [23–27] and across a large range of tempi (~94–174 beats per minute) [20,28–31]. Beat extraction is also robust to moderate tempo fluctuations [8,32,33]. Beat induction from music has in fact been proposed as one of the fundamental and universal traits of music [34,35].

Musical meter, which corresponds to the hierarchical organization of beats, where some beats are perceived as stronger than others, leads to higher-order periodicities of strong and weak beats (for example, a march versus a waltz). Similarly, speech has a hierarchically organized temporal structure, with phonemes, syllables, and prosodic cues, each occurring at different time scales [16,36–38]. As in music, metrical hierarchy in speech may rely on the occurrence of stressed or accented acoustic events, typically associated with syllables [11,17,39–41]. Stress patterns in speech vary and depend on different acoustic cues according to language. The meter of "stress-timed" languages, such as English, is usually clearer than the meter of "syllable-timed" languages like French [14,42]. However, regardless of the language studied, temporal intervals between stressed syllables are not as regular in speech as in music [41,43–46].

Despite this variability in the regularity of stress or beat in spoken language, individuals seem to be able to entrain to speech. Initial evidence in this regard is the finding that the timing of speech can be synchronized with a metronome [11]. Speakers can not only adapt their speech rate to match another speaker [47,48], but they also entrain to each other's syllables rate in conversational turn taking [18,49]. In a prior study using a similar experimental design to the present study [14], French and English monolingual speakers and French–English bilingual speakers were invited to tap their finger along with the beat they perceived in French and English sentences spoken with natural prosody. The variability of intervocalic intervals (IVIs) in these sentences predicted the participants' inter-tap variability, suggesting that the participants were able to entrain to the speech stimuli.

While there is evidence of entrainment to speech, a puzzling difference exists between the absence of synchronous ("choral") speech and the widespread and exquisite synchronization observed in music. To address this issue, Cummins [50,51] proposed that synchronous speech should be possible because (1) speakers of the same language have mastered the association between motor actions and speech sounds of their language, and (2) they share knowledge of speech timing. He supports his claim by showing that speakers can synchronize while reading an unfamiliar text without prior practice, which the author considered an indication of aperiodic synchronization [10,52–54]. According to this perspective, entrainment to speech and music would reflect a fundamental propensity of humans to time their actions with the rhythm of an external event.

Entrainment to speech and music has rarely been compared behaviorally, with few previous studies in this regard. In one of these [55], the influence of music and speech on entrainment was assessed through interference. The main task was to synchronize finger taps to a metronome while hearing highly isochronous computer-generated music or regularly spoken poems. When the metronome tones and the musical beats or stressed syllables were perfectly aligned, higher variability in the asynchronies between taps and metronome was found with the speech distractor compared to the musical one. When misaligned, both music and speech led to synchronization interference by increasing the asynchrony between taps and metronome onsets, and music induced the largest asynchrony. In a second experiment in this study, the stimuli were better matched: songs, either sung with lyrics, sung with a single syllable, or spoken with a regular pace, were presented. In this case, misaligned stimuli had identical detrimental effects on the variability of tapping to the metronome, whether spoken or sung. Therefore, when isochrony is equalized between music and speech, entrainment appears to be very similar.

However, natural speech is typically not isochronous. In a second study comparing music and speech [56], using the same paradigm as the current study, native French and English speakers tapped along with French and English sentences in three conditions: naturally spoken, regularly spoken, and sung with a simple melody. The inter-tap intervals (ITIs) were more variable in the naturally spoken sentences than in the other conditions. The taps were also more closely aligned to the beat (the nearest implied metronome click to which the singer synchronized her renditions of the stimuli) for sung than for regularly spoken sentences. These results show an overall effect of regularity on entrainment, with music being more suitable to elicit entrainment than regular speech.

Here, we tested the same materials as those used by Lidji and collaborators [56] with individuals who have a documented deficit in tracking the beat in music. This disorder is characterized by an inability to synchronize whole-body movements, clapping, or tapping to the beat of music [57–61], to amplitude-modulated noise derived from music [60], and to metronome-like rhythms [62,63]. This beat-finding deficit occurs in the absence of intellectual disability or acquired brain damage. Study of this "beat-deaf" population provides an opportunity to test the domain specificity of entrainment mechanisms. If the beat-finding disorder initially diagnosed with music also disrupts entrainment to speech, then the association will provide evidence for the domain-general nature of entrainment mechanisms to auditory rhythms.

Beat-deaf individuals and matched control participants who did not exhibit a beat processing disorder were asked to tap to spoken and sung sentences. If entrainment abilities are domain-general, then beat-deaf participants should show deficits to adapt their tapping period to the intervocalic period between syllables to all versions of sentences, compared to the control group. The control group was expected to replicate the findings of [56] showing largest inter-tap interval variability to natural speech, next largest to regularly spoken sentences, and smallest inter-tap variability to sung sentences, and with more accurate synchronization to the intervocalic period between syllables of sung sentences than regularly spoken sentences. Alternatively, if entrainment is domain-specific, beat-deaf participants' tapping should be most impaired for sung sentences and unimpaired (meaning similar to the control group) for speech.

#### **2. Materials and Methods**

#### *2.1. Participants*

Thirteen beat-deaf French-speaking adults (10 females) and 13 French-speaking matched control participants (11 females) took part in the study. The groups were matched for age, education, and years of music and dance training (detailed in Table 1). One beat-deaf participant was completing an undergraduate degree in contemporary dance at the time of testing. Accordingly, a trained contemporary dancer was also included in the control group. All participants were non-musicians and had no history of neurological, cognitive, hearing, or motor disorders. In addition, all had normal verbal auditory working memory and non-verbal reasoning abilities, as assessed by the Digit Span and Matrix Reasoning subtests of the WAIS-III (Wechsler Adult Intelligence Scale) [64], with no differences between groups on these measures (*p*-values > 0.34; Table 1). Participants provided written consent to take part in the study and received monetary compensation for their participation. All procedures were approved by the Research Ethics Council for the Faculty of Arts and Sciences at the University of Montreal (CERAS-2014-15-102-D).

Procedure Prior to Inclusion of Participants in the Study

Participants in the beat-deaf group had taken part in previous studies in our lab [63,65] and were identified as being unable to synchronize simple movements to the beat of music. Control participants had either taken part in previous studies in the lab or were recruited via online advertisements directed toward Montreal's general population or through on-campus advertisements at the University of Montreal.


**Table 1.** Characteristics of the beat-deaf and control groups.

ss—standard score. a, Scores from 12 beat-deaf and 10 control participants. Some participants did not complete the Matrix Reasoning test because they were Ph.D. students in a clinical neuropsychology program and were too familiar with the test.

Inclusion in the current study was based on performance on the Montreal Beat Alignment Test (M-BAT) [66]. In a beat production task, participants were asked to align taps to the beat of 10 song excerpts from various musical genres. Tempo varied across the excerpts from 82 beats per minute (bpm) to 170 bpm. Each song was presented twice, for a total of 20 trials. Control participants successfully matched the period of their taps to the songs' beat in at least 85% of the trials (M = 96.9%, SD = 5.2%); successful period matching was determined through evaluation of *p*-values on the Rayleigh *z* test of periodicity, with values smaller than 0.05 considered successful. In the beat-deaf group, the average percentage of trials with successful tempo matching was 39.2% (range of mean values: 10–65%, SD = 18.3%). As shown in Figure 1, there was no overlap between the groups' performance on this task, confirming that the participants in the beat-deaf group showed a deficit in synchronizing their taps to the beat of music.

**Figure 1.** Performance of participants in the control and beat-deaf groups in the beat production task of the Montreal Beat Alignment Test (M-BAT). Each dot represents a participant. Boxes correspond to a 95% confidence interval from the mean based on the standard error of the mean (SEM). The black horizontal line within each box indicates the group mean. The vertical lines represent two standard deviations from the mean.

Prior to their participation in the current study, participants completed the online test of amusia to screen for the presence of a musical pitch perception impairment [67]. The online test is composed of three tests: Scale, Off-beat, and Off-key. The Scale test requires the comparison of 30 pairs of melodies that differ by an out-of-key note in half of the trials. The Off-beat and Off-key tests consist of the detection of either an out-of-time or an out-of-key note, respectively. A score lying 2-SD below the mean of a large population on both the Scale and Off-key tests indicates the likely presence of pitch deafness (also called congenital amusia) [67,68]. Based on the data from Peretz and Vuvan [67], a cut-off score of 22 out of 30 was used for the Scale test and 16 out of 24 for the Off-key test. Table 2 indicates the individual scores of beat-deaf participants on the online test. Half of the beat-deaf group scored at or below the cut-off on both the Scale and Off-key tests. As these cases of beat-deaf participants could also be considered pitch-deaf, the influence of musical pitch perception will be taken into account in the analysis and interpretation of the results. All control participants had scores above the 2-SD cut-offs.

**Table 2.** Individual scores of the beat-deaf participants and the group average of their matched controls in the online test of amusia.


Scores in parentheses represent the cut-off scores taken from Peretz and Vuvan [67]. Participants with co-occurring pitch deafness are marked with †.

#### *2.2. Stimulus Materials*

The 12 French sentences used in this experiment were taken from Lidji et al. [56]. Each sentence contained 13 monosyllabic words and was recorded in three conditions as depicted in Figure 2. The recordings were made by a native Québec French/English female speaker in her twenties who had singing training. Recordings were made with a Neumann TLM 103 microphone in a sound-attenuated studio. In the naturally spoken condition, the speaker was asked to speak with a natural prosody (generating a non-periodic pattern of stressed syllables). In the regularly spoken condition, sentences were recorded by the speaker to align every other syllable with the beat of a metronome set to 120 bpm, heard over headphones. In the sung condition, the sentences were sung by the same speaker, again with every other syllable aligned to a metronome at 120 bpm, heard over headphones. Each sung sentence was set to a simple melody, with each syllable aligned with one note of the melody. Twelve unique melodies composed in the Western tonal style in binary meter, in major or minor modes, were taken from Lidji et al. [56]. These melodies were novel to all participants. Although each sentence was paired with two different melodies, participants only heard one melody version of each sung sentence, counterbalanced across participants.

Additional trials for all three conditions (naturally spoken, regularly spoken, sung) were then created from the same utterances at a slower rate (80% of original stimulus rate, i.e., around a tempo of 96 bpm) using the digital audio production software Reaper (v4.611, 2014; time stretch mode 2.28 SOLOIST: speech, Cockos Inc., New York, United States). This ensured that the beat-deaf participants adapted their taps to the rate of each stimulus and could comply with the task requirements. All the stimuli were edited to have a 400 ms silent period before the beginning of the sentence and a 1000 ms silent period at the end of the sentence. Stimuli amplitudes were also equalized in root mean square (RMS) intensity. Preliminary analyses indicated that all participants from both groups adapted the rate of their taps from the original stimulus rate to the slower stimulus rate, with a Group × Material (naturally spoken, regularly spoken, sung) × Tempo (original, slow) ANOVA on mean inter-tap interval (ITI) showing a main effect of Tempo, *<sup>F</sup>*(1,24) <sup>=</sup> 383.6, *<sup>p</sup>* <sup>&</sup>lt; 0.001, η*<sup>2</sup>* <sup>=</sup> 0.94, with no significant Group × Tempo interaction, *F*(1,24) = 0.0004, *p* = 0.98 or Group × Material × Tempo interaction, *F*(2,48) = 1.91, *p* = 0.17 (mean ITI results are detailed in Table 3). Therefore, the data obtained for the slower stimuli are not reported here for simplicity.

**Figure 2.** Example of a sentence in the naturally spoken, regularly spoken, and sung conditions. IVI refers to the intervocalic interval between stressed syllables.


**Table 3.** Mean inter-tap interval (ITI) in ms for each group according to material type and tempo.

For the comparison of mean ITI between groups and conditions, the mean ITIs were scaled to the ITI corresponding to tapping once every two words (or stressed syllables).

Table 4 describes the features of the rhythmic structure of the stimuli in each condition. Phoneme boundaries were marked by hand using Praat [69], and were classified as vowels or consonants based on criteria defined by Ramus, Nespor, and Mehler [70]. Note that the analyses reported below include the stimuli at the original tempo only. Once the segmentation was completed, a MATLAB script was used to export the onset, offset, and duration of vocalic (a vowel or a cluster of vowels) and consonantal (a consonant or a cluster of consonants) intervals. The Normalized Pairwise Variability Index for Vocalic Intervals (V-nPVI), an indication of duration variability between successive vowels [71], was used to measure the rhythmic characteristics of the stimuli. A higher V-nPVI indicates greater differences in duration between consecutive vocalic intervals. Comparison of sentences in the naturally spoken, regularly spoken, and sung conditions showed a significant difference between conditions, *F*(2,22) = 21.6, *p* < 0.001, η<sup>2</sup> = 0.66. The V-nPVI was higher in the naturally and regularly spoken conditions than in the sung condition (Table 4). The coefficient of variation (CV, calculated as SD/mean) of IVIs (vowel onset to onset) is another indication of rhythmic variability [14]. A small CV for IVIs indicates similar time intervals between vowel onsets across the sentence. Here the CV was measured between every other syllable's vowel onset, corresponding to stressed syllables (see IVI in Figure 2). Once again, a significant difference between conditions was observed, *F*(2,22) = 64.6, *p* < 0.001, η<sup>2</sup> = 0.85. Naturally spoken sentences had the largest timing variations between vowel onsets (M = 0.21), followed by regularly spoken sentences (M = 0.08), while sung sentences showed the smallest variability (M = 0.05). To ensure that the female performer was comparably accurate in timing the sentences with the metronome in the regularly spoken and sung conditions, the relative asynchrony between each vowel onset and the closest metronome pulsation was measured. In this context, a negative mean asynchrony indicates that the vowel onset preceded the metronome tone onset, while a positive asynchrony means that the vowel onset followed the metronome tone (Table 4). There was no significant difference between conditions, indicating similar timing with the metronome in the regularly spoken and sung conditions, *t*(11) = 1.146, *p* = 0.28.


**Table 4.** Stimuli characteristics related to rhythm.

Values indicate means; standard errors appear in parentheses. IVI—intervocalic interval (in ms); V-nPVI—normalized Pairwise Variability Index for Vocalic Intervals; CV—coefficient of variation (SD IVI/Mean IVI between stressed syllables); Beat asynchrony corresponds to the average of signed values from subtracting metronome tone onset from the closest spoken/sung vowel onset, in milliseconds. \*, indicate significant differences.

#### *2.3. Design and Procedure*

Participants performed three tasks. First, they performed a spontaneous tapping task to assess their spontaneous tapping rate (mean and variance) in the absence of a pacing stimulus. They were asked to tap as regularly as possible for 30 seconds, as if they were a metronome or the "tick-tock" of a clock (as in [30]). Participants were asked to tap with the index finger of their dominant hand. Next, participants performed the tapping task with the spoken/sung sentences, as described below. Then the participants repeated the spontaneous tapping task to determine whether their spontaneous rate had changed, and finally, they tapped at a fixed rate with a metronome set to 120 bpm (inter-beat interval of 500 ms) and 96 bpm (inter-beat interval of 625 ms), chosen to match the tempi of the spoken/sung stimuli used in the experiment. The experiment had a total duration of approximately 60 minutes.

In the spoken/sung tapping blocks, each participant was presented with 12 each of naturally spoken sentences, regularly spoken sentences, and sung sentences at the original rate (120 bpm), and six sentences in each condition at the slower rate (96 bpm). These stimuli were mixed and divided into three blocks of 18 trials each. Two pseudo-random orders were created such that not more than two sentences from the same condition occurred consecutively and that the same sentence was never repeated. On each trial, participants first listened to the stimulus; then, for two additional presentations of the same stimulus, they were asked to tap along to the beat that they perceived in the stimulus (as in [56]). The action to perform (listen or tap) was prompted by instructions displayed on a computer screen. Participants pressed a key to start the next trial. Prior to commencing the task, a demonstration video was presented to participants, which showed an individual finger tapping on the sensor with one example stimulus from each condition. In the demonstration, a different sentence was used for each condition, and each was presented at a different rate (84 bpm or 108 bpm) than the ones used in the experiment. The sung sentence example was also presented with a different melody than any heard by participants in the task. After the demonstration, participants completed a practice trial for each type of sentence.

For the metronome task, there were two trials at each metronome tempo (120 bpm and 96 bpm), and the presentation order of the two metronome tempi was counterbalanced across participants. Each metronome stimulus contained sixty 50 ms 440 Hz sine tones. Each metronome trial began with seven tones at the specific tempo, during which participants were instructed to listen and prepare to tap with the metronome. A practice trial was also first performed with a metronome set to 108 bpm. As mentioned previously, since all participants could adapt their tapping rate to the stimuli at both 120 bpm and 96 bpm, only the results of tapping to the metronome at 120 bpm (rate of the original speech stimuli) are reported here.

The experiment took place in a large sound-attenuated studio. The tasks were programed with MAX/MSP (https://cycling74.com). Taps were recorded on a square force-sensitive resistor (3.81 cm, Interlink FSR 406) connected to an Arduino UNO (R3; arduino.cc) running the Tap Arduino script (fsr\_silence\_cont.ino; [72,73]) and transmitting timing information to a PC (HP ProDesk 600 G1, Windows 7) via the serial USB port. The stimuli were delivered at a comfortable volume through closed

headphones (DT 770 PRO, Beyerdynamic, Heilbronn, Germany) controlled by an audio interface (RME Fireface 800). No auditory feedback was provided for participants' tapping.

#### *2.4. Data Analyses*

#### 2.4.1. Tapping Data Preprocessing

In the spontaneous tapping task, the first five taps produced were discarded and the following 30 ITIs were used, in line with McAuley et al.'s procedure [30]. If participants produced fewer than 30 taps, the data included all taps produced (the smallest number of taps produced was 16 in this task). Due to recording problems, taps were missing from one beat-deaf participant's first spontaneous tapping trial.

Recorded taps were first pre-processed to remove ITIs smaller than 100 ms in the spontaneous tapping task, and ITIs smaller than 150 ms in the spoken/sung tapping task and the metronome task. In the three tasks, taps were also considered outliers and were removed if they were more than 50% smaller or larger than the median ITI produced by each participant (median ITI ± (median ITI × 0.5)). Pre-processing of tapping data was based on the procedure described by [74]. Accordingly, the 100 ms criterion was used at first for the spoken/sung task but the number of outliers mean ITIs was high in both groups of participants. A 150 ms criterion was chosen instead considering that it remained smaller than two standard deviations from the average time interval between consecutive syllables across stimuli (M = 245 ms, SD = 47 ms, M − 2SD = 152 ms), thus allowing the removal of more artefact taps while still limiting the risk of removing intended taps. As a result, 1.6% of the taps were removed (range: 0.0–6.4%) in the spontaneous tapping task. In the spoken/sung tapping task, 0.85% of taps per trial were removed (range: 0–36.4% taps/trial). In the metronome task, 5.27% of taps were removed on average (range: 3.4–8.1%), leaving between 54 and 76 taps per trial, of which the first 50 taps produced by each participant were used for analysis.

#### 2.4.2. Analysis of Tapping Data

The mean ITI was calculated for all tapping tasks. In the spoken/sung tapping task, since each participant tapped twice on each utterance in succession, the mean ITIs per stimulus were averaged across the two presentations. However, in 0.16% of the trials, participants did not tap at the same hierarchical level in the two presentations of the stimulus. For example, they tapped on every syllable in the first presentation, and every other syllable in the second presentation. These trials were not included in the calculations of CV, to avoid averaging together taps with differing mean ITIs. Nevertheless, at least 11 of the 12 trials at 120 bpm for each participant in each condition were included in the analyses. In the metronome task, data were also averaged across the two trials with the metronome at 120 bpm.

In the spoken/sung tapping task, inter-tap variability (CV *SD* ITI/mean ITI) was computed for each condition. As Table 4 indicates, the CVs of taps to naturally spoken sentences should be larger than the CVs to regular stimuli. To assess this, we examined how produced ITIs matched the stimulus IVIs (as done by [75–77]). ITI deviation was calculated by averaging the absolute difference between each ITI and the corresponding IVI of the stimulus. To control for differences in IVI for each stimulus, the ITI deviation was normalized to the mean IVI of that stimulus and converted to a percentage of deviation (% ITI deviation) with formula (1) below, where x is the current interval and *n* the number of ITI produced:

$$\% \text{ ITI deviation} = (\Sigma | \text{ITIx - IVIx} | n) / \text{mean IVI} \times 100 \tag{1}$$

This measure of period deviation gives an indication of how participants' taps matched the rhythmic structure of the stimuli, whether regular or not.

Period-matching between spoken/sung sentences and taps was further assessed for the stimuli that contained regular beat periods (i.e., regularly spoken, sung, and metronome stimuli) with circular statistics using the Circular Statistics Toolbox for MATLAB [78]. With this technique, taps are transposed as angles on a circle from 0◦ to 360◦, where a full circle corresponds to the period of the IVI of the stimulus. The position of each tap on the circle is used to compute a mean resultant vector. The length of the mean resultant vector (vector length, VL) indicates how clustered the data points are around the circle. Values of VL range from 0 to 1; the larger the value, the more the points on the circle are clustered together, indicating that the time interval between taps matches the IVI of the stimulus more consistently. For statistical analyses, since the data were skewed in the control group for the spoken/sung task (skewness: −0.635, SE: 0.144) and in the metronome tapping task for participants of both groups (skewness: −1.728, SE: 0.427), we used a logit transform of VL (logVL = −1 × log(1 − VL)), as is typically done with synchronization data (e.g., [57,58,60,61,74]). However, for simplicity, untransformed VL is reported when considering group means and individual data. The Rayleigh *z* test of periodicity was employed to assess whether a participant's taps period-matched the IVI of each stimulus consistently [79]. A significant Rayleigh z test (*p*-value < 0.05) demonstrates successful period matching. An advantage of the Rayleigh test is that it considers the number of taps available in determining if there is a significant direction in the data or not [78]. Using linear statistics, the accuracy of synchronization was further measured using the mean relative asynchrony between taps and beats' onset time in milliseconds. Note that this measure only included trials for which participants could successfully match the inter-beat interval of the stimuli, as assessed by the Rayleigh test, since the asynchrony would otherwise be meaningless.

The period used to perform the Rayleigh test was adjusted to fit the hierarchical level at which participants tapped on each trial. Since the stimuli had a tempo of 120 bpm (where one beat = two syllables), this meant that if a participant tapped to every word, the period used was 250 ms, if a participant tapped every two words, then 500 ms, and every four words, 1000 ms. This approach was chosen, as suggested by recent studies using circular statistics to assess synchronization to stimuli with multiple metric level (or subdivisions of the beat period), in order to avoid bimodal distributions or to underestimate tapping consistency [61,80,81]. Given this adaptation, in the spoken/sung tapping task, we first looked at the closest hierarchical level at which participants tapped. This was approximated based on the tapping level that fitted best the majority of ITIs within a trial (i.e., the modal tapping level).

#### 2.4.3. Correlation between Pitch Perception and Tapping to Spoken/Sung Sentences

In order to assess the contribution of musical pitch perception to synchronization with the spoken and sung sentences, the scores from the online test of amusia were correlated with measures of tapping variability (CV) and period-matching (% ITI deviation) from the spoken/sung tapping task.

#### *2.5. Statistical Analyses*

Statistical analyses were performed in SPSS (IBM SPSS Statistics, Armonk, United States, version 24, 2016). A mixed repeated-measures ANOVA with Group as the between-subjects factor was used whenever the two groups were compared on a dependent variable with more than one condition. Because of the small group sample size, a statistical approach based on sensitivity analysis was applied, ensuring that significant effects were reliable when assumptions regarding residuals' normality distribution and homogeneity of variance were violated [82]. When these assumptions were violated, the approach employed was as follows: (1) inspect residuals to identify outliers (identified using Q—Q plot and box plot), (2) re-run the mixed-design ANOVA without the outliers and assess the consistency of the previous significant results, and (3) confirm the results with a non-parametric test of the significant comparisons [82]. If the effect was robust to this procedure, the original ANOVA was reported. Bonferroni correction was used for post-hoc comparisons. Other group comparisons were performed with Welch's test, which corrects for unequal variance. Paired *t*-tests were utilized for within-group comparisons on a repeated measure with only two conditions. Effect sizes are reported for all comparisons with *p*-values smaller than 0.50 [83]. To indicate the estimated effect sizes, partial eta-squared values are reported for repeated-measures ANOVA, and Hedge's g was computed for the other comparisons.

#### **3. Results**

#### *3.1. Spontaneous Tapping*

The mean ITI of the spontaneous tapping task ranged from 365 to 1109 ms in control participants and from 348 to 1443 ms in the beat-deaf group (Table 5). There was no significant group difference in the mean ITIs, *F*(1,23) = 1.2, *p* = 0.27, η*<sup>2</sup>* = 0.05, and no significant effect of Time, *F*(1,23) = 0.48, *p* = 0.49, η*<sup>2</sup>* = 0.02, and no interaction, *F*(1,23) = 0.19, *p* = 0.66, indicating that spontaneous tapping was performed similarly before and after the spoken/sung tapping task. In contrast, a main effect of Group emerged in the CV for spontaneous tapping, *F*(1,23) = 18.2, *p* < 0.001, η*<sup>2</sup>* = 0.44, with no effect of Time, *F*(1,23) = 0.30, *p* = 0.59, and no interaction, *F*(1,23) = 0.19, *p* = 0.67. The CV for spontaneous tapping was higher in the beat-deaf group than in the control group (Table 5). As observed by Tranchant and Peretz [63], the beat-deaf individuals showed more inter-tap variability than control participants when trying to tap regularly without a pacing stimulus.

**Table 5.** Mean inter-tap interval (ITI) and coefficient of variation (CV) of spontaneous tapping.


Numerical values represent group means, with the standard error of the mean in parentheses. a, Values are in milliseconds. b, *n* = 12; otherwise, *n* = 13.

#### *3.2. Tapping to Speech and Song*

As expected, participants' inter-tap variability (CV) for the naturally spoken sentences was higher than the CV in the other two conditions. Figure 3a depicts the mean CV of the stimulus IVIs and Figure 3b depicts the mean CV for tapping in each condition. The CV for participants' taps was larger for the naturally spoken sentences (M = 0.13) than for the regularly spoken (M = 0.10) and sung (M = 0.10) sentences, *F*(1.5,35.4) = 15.2, *p* < 0.001, η*<sup>2</sup>* = 0.39. The groups did not differ significantly, *F*(1,24) = 2.8, *p* = 0.10, η*<sup>2</sup>* = 0.11, and there was no interaction with material type, *F*(1.5,35.4) = 0.58, *p* = 0.52. One control participant had a larger tapping CV than the rest of the group for natural speech. Three beat-deaf participants also had larger CVs across conditions. However, removing the outliers did not change the results of the analysis. Thus, the inter-tap variability only discriminated natural speech from the regularly paced stimuli for both groups.

Deviation in period matching between ITIs and IVIs of stimuli indicated that control participants exhibited better performance than beat-deaf participants, whether the stimuli were regular or not. Control participants showed a smaller percentage of deviation between the inter-tap period produced and the corresponding stimulus IVI across stimulus conditions (% ITI deviation; Figure 4), with a main effect of Group, *F*(1,24) = 8.2, *p* = 0.008, η*<sup>2</sup>* = 0.26, a main effect of Material, *F*(1.4,32.5) = 95.9, *p* < 0.001, η*<sup>2</sup>* = 0.80, and no interaction, *F*(1.4,32.5) = 0.19, *p* = 0.74. Post-hoc comparisons showed a significant difference between all conditions: the % ITI deviation was the largest for naturally spoken sentences (20.9% and 27.2% for the control and beat-deaf group, respectively), followed by regular speech (11% and 18.2%) and sung sentences (8.5%, and 15.7%; see Figure 4). These results held even when outliers were removed.

**Figure 3.** (**a**) Coefficient of variation (CV) of the intervocalic interval (IVI) between stressed syllables of the stimuli. Each dot represents a sentence. (**b**) Mean CV of the inter-tap interval (ITI) produced by the beat-deaf and control group as a function of sentence type. Each dot represents a participant. Boxes correspond to a 95% confidence interval from the mean based on the standard error of the mean (SEM). The darker horizontal line within each box indicates the group mean, while the vertical lines represent two standard deviations from the mean.

**Figure 4.** Mean percentage of deviation between the inter-tap intervals (ITIs) produced by each participant and the IVIs of the sentences. Each dot represents a participant. Boxes corresponds to a 95% confidence interval from the mean based on standard error mean (SEM). The black horizontal line within each box indicates the group mean. The vertical lines represent two standard deviations from the mean.

In order to measure synchronization more precisely, we first examined the hierarchical level at which participants tapped. A chi-squared analysis of the number of participants who tapped at each hierarchical level (1, 2, or 4 words) by Condition and Group indicated a main effect of Group, χ2(2,78) = 7.4, *p* = 0.024. In both groups, participants tapped preferentially every two words (see Figure 5), although control participants were more systematic in this choice than beat-deaf participants. Both groups were consistent in the hierarchical level chosen for tapping across conditions. The hierarchical level at which a participant tapped determined the period used in the following analysis of synchronization to the regular stimuli.

**Figure 5.** Number of participants in each group who tapped at every word, every two words, or every four words, according to each sentence condition (natural, regular, or sung).

The average percentage of trials with successful period matching (using Rayleigh's z test) for the control group was 91.7% (range: 58–100%) for regularly spoken sentences and 90.4% (range: 50–100%) for sung ones. In the beat-deaf group, the mean percentage of successful period-matched trials was much lower, with 30.4% (range: 0–75%) and 23.8% (range: 0–66.7%) for regularly spoken and sung sentences, respectively. The percentage of trials with successful period matching did not differ between the regular and sung conditions, *t*(25) = 1.297, *p* = 0.21, *g* = 0.10.

We next examined if synchronization was more consistent and accurate for sung than for regularly spoken sentences. These analyses were conducted on trials for which participants were able to synchronize successfully with the beat (i.e., Rayleigh *p*-value < 0.05). Because most beat-deaf participants failed to synchronize with the stimuli, the analyses are limited to the control group. The analyses of the log transform of the mean vector length (logVL) revealed that the control group's tapping was as constant with regularly spoken sentences (*M* = 1.79, range: 1.18 to 3.09) as with sung ones (*M* = 1.85, range: 1.10 to 3.45), *t*(12) = −0.755, *p* = 0.46, *g* = 0.09. Accuracy of synchronization was assessed with the mean relative asynchrony between taps and beats in milliseconds. Control participants anticipated the beat onsets of sung sentences significantly earlier (M = −14 ms, range: −51 to 19 ms) than the beat onsets of regularly spoken sentences (M = 1 ms, range: −30 to 34 ms), *t*(12) = 3.802, *p* = 0.003, *g* = 0.74. This result suggests that beat onsets were better anticipated in sung sentences than in regularly spoken ones, corroborating results found by Lidji and collaborators [56]. Of note, the two beat-deaf participants (B2 and B4) who could successfully period-match the stimuli on more than 50% percent of trials showed similar consistency (logVL range: 1.08 to 1.53) and accuracy (mean asynchrony range: −3 ms to 22 ms) of synchronization to control participants.

#### *3.3. Tapping to Metronome*

All participants could successfully match their taps to the period of the metronome, as assessed by the Rayleigh z test, except for one beat-deaf participant (B10) who tapped too fast compared to the 120 bpm tempo (mean ITI = 409 ms for a metronome inter-onset interval of 500 ms). Thus, this participant and a matched control were removed from subsequent analyses in this task. As in previous analyses, control participants had smaller inter-tap variability than beat-deaf participants. This was confirmed by a group comparison with Welch's test on the CV, *t*(14.0) = 11.698, *p* = 0.004, *g* = 1.35 (control: M = 0.06, SE = 0.003; beat-deaf: M = 0.09, SE = 0.01). Period-matching consistency, using the logVL, also showed a significant group difference, *t*(22.0) = 9.314, *p* = 0.006, *g* = 1.20. The difference between groups was not significant, however, for the mean relative asynchrony between taps and metronome tones, *t*(20.4) = 0.066, *p* = 0.80 (control: M = −56 ms, range: −120 ms to 0 ms; beat-deaf: M = −53 ms, range: −104 ms to −11 ms).

#### *3.4. Contribution of Musical Pitch Perception to Entrainment to Utterances*

To assess the impact of musical pitch perception on tapping performance, we correlated the scores from the online test of amusia with tapping variability (CV) and period matching (%ITI deviation) for all conditions and participant groups (Table 6). The correlations between CV and musical pitch-related tests did not reach significance, while the % of ITI deviation did for two of the three stimulus conditions when considering participants from both groups. The significant correlation between the Scale test and the % ITI deviation was driven mostly by the beat-deaf group (*r(8)* = −0.61) rather than the control group (*r(11)* = −0.05). There was also a significant correlation between the Off-key test and % ITI deviation. None of the correlations reached significance with the Off-beat test. However, tapping variability (CV) to sentences and to music (M-BAT) were highly correlated in control but not beat-deaf participants.

**Table 6.** Spearman correlations between tapping and music perception.


CV—coefficient of variation; ITI—inter-tap interval. Outliers from the beat-deaf group were removed, with <sup>a</sup> *n* = 24, <sup>b</sup> *n* = 23, <sup>c</sup> *n* = 13, <sup>d</sup> *n* = 11, <sup>e</sup> *n* = 10. Columns in white indicate correlations with participants of both groups, light blue with control participants only, and darker blue beat-deaf participants only. Significant correlations after correcting for multiple comparisons are marked in orange (*p* ≤ 0.015).

These results raise the possibility that beat-deaf individuals with an additional deficit in pitch perception have a more severe impairment in finding the beat. If we compare beat-deaf participants with and without a co-occurring musical pitch deficit, the difference between groups does not reach significance on period-matching consistency of tapping (mean logVL) in the M-BAT beat production test, *t*(6.2) = 1.874, *p* = 0.11, *g* = 1.0. Thus, musical pitch perception seems to have little impact on synchronization to both musical (see [65]) and verbal stimuli.

#### **4. Discussion**

This study investigated the specialization of beat-based entrainment to music and to speech. We show that a deficit with beat finding initially uncovered with music can similarly affect entrainment to speech. The beat-deaf group, in the current study identified on the basis of abnormal tapping to various pre-existing songs, also show more variable tapping to sentences, whether naturally spoken or spoken to a (silent) metronome, as compared to matched control participants. These results could argue for the domain generality of beat-based entrainment mechanisms to both music and speech. However, even tapping to a metronome or tapping at their own pace is more irregular in beat-deaf individuals than in typical non-musicians. Thus, the results point to the presence of a basic deficiency in timekeeping mechanisms that are relevant to entrainment to both music and speech and might not be specific to either domain.

Such a general deficiency in timekeeping mechanisms does not appear related to an anomalous speed of tapping. The spontaneous tapping tempo of the beat-deaf participants is not different from the tempo of neurotypical controls. What differs is the regularity of their tapping. This anomalous variability in spontaneous tapping has not been reported previously [57–60,62]. Two beat-deaf cases previously reported from the same lab [62], not included in the present sample, had higher inter-tap variability in unpaced tapping; however, the difference was not statistically significant in comparison to a control group. However, our study includes one of the largest samples of individuals with a beat-finding disorder so far (compared to 10 poor synchronizers in [60] for example), which might explain some discrepancies with previous studies. Only recently (in our lab, [63]) has an anomalously high variability in spontaneous regular tapping in beat-deaf individuals been observed irrespective of the tapping tempo.

A similar lack of precision was noted among beat-deaf participants compared to matched controls when tapping to a metronome, which is in line with Palmer et al. [62]. These similar findings suggest that temporal coordination (both in the presence of auditory feedback from a metronome and in its absence during spontaneous tapping) is impaired in beat-deaf individuals. These individuals also display more difficulty with adapting their tapping to temporally changing signals, such as phase and period perturbations in a metronome sequence [62]. Sowi ´nski and Dalla Bella [60] also reported that poor beat synchronizers had more difficulty with correcting their synchronization errors when tapping to a metronome beat, as reflected in lag -1 analyses. Therefore, a deficient error correction mechanism in beat-deaf individuals may explain the generalized deficit for tapping with and without an external rhythm. This error correction mechanism may in turn result from a lack of precision in internal timekeeping mechanism, sometimes called "intrinsic rhythmicity" [63].

However, the deficit in intrinsic rhythmicity in beat-deaf individuals is subtle. The beat-deaf participants appear sensitive to the acoustic regularity of both music and speech, albeit not as precisely as the control participants. All participants tapped more consistently to regularly spoken and sung sentences than to naturally spoken ones. All showed reduced tapping variability for the regular stimuli, with little difference between regularly spoken and sung sentences, while normal control participants also showed greater anticipation of beat onsets in the sung condition. The latter result suggests that entrainment was easier for music than speech, even when speech is artificially made regular. However, the results may simply reflect acoustic regularity, which was higher in the sung versions than in the spoken versions: the sung sentences had lower intervocalic variability and V-nPVI than regular speech, which may facilitate the prediction of beat occurrences, and, therefore, entrainment. These results corroborate previous studies proposing that acoustic regularity is the main factor supporting entrainment across domains [56].

Another factor that may account for better anticipation of beats in the sung condition is the presence of pitch variations. There is evidence that pitch can influence meter perception and entrainment in music [84–93]. The possible contribution of musical pitch in tapping to sung sentences is supported by the correlations between perception of musical pitch and period-matching performance (as measured by ITI deviation) in tapping to the sung sentences. However, the correlation was similar for the regularly spoken sentences where pitch contributes little to the acoustic structure. Moreover, the beat-deaf participants who also had a musical-pitch deficit, corresponding to about half the group, did not perform significantly poorer than those who displayed normal musical pitch processing. Altogether, the results suggest that pitch-related aspects of musical structure are not significant factors in entrainment [55,56,94,95].

Thus, a key question remains: What is the faulty mechanism that best explains the deficit exhibited by beat-deaf individuals? One useful model to conceptualize the imprecision in regular tapping that seems to characterize beat deafness, while maintaining sensitivity to external rhythm, is to posit broader tuning of self-sustained neural oscillations in the beat-impaired brain. An idea that is currently gaining increasing strength is that auditory-motor synchronization capitalizes on the tempi of the naturally occurring oscillatory brain dynamics, such that moments of heightened excitability (corresponding to particular oscillatory phases) become aligned to the timing of relevant external events (for a recent review, see [96]). In the beat-impaired brain, the alignment of the internal neural oscillations to the external auditory beats would take place, as shown by their sensitivity to acoustic regularities, but it would not be sufficiently well calibrated to allow precise entrainment.

This account of beat deafness accords well with what is known about oscillatory brain responses to speech and music rhythms [12,97–104]. These oscillatory responses match the period of relevant linguistic units, such as phoneme onsets, syllable onsets, and prosodic cues, in the beta/gamma, theta, and delta rhythms, respectively [16,38,105,106]. Oscillatory responses can also entrain to musical beat, and this oscillatory response may be modulated by the perceived beat structure [81,107–111]. Oscillatory responses may even occur in the absence of an acoustic event on every beat, and not just in response to the frequencies present in the signal envelope, indicating the contribution of oscillatory

responses to beat perception [23,25,110]. Ding and Simon [98] propose that a common entrainment mechanism for speech and music could occur in the delta band (1–4 Hz). If so, we predict that oscillations in the delta band would not be as sharply aligned with the acoustic regularities present in both music and speech in the beat-deaf brain as in a normal brain. This prediction is currently under study in our laboratory.

One major implication of the present study is that the rhythmic disorder identified with music extends to speech. This is the first time that such an association across domains is reported. In contrast, there are frequent reports of reverse associations between speech disorders and impaired musical rhythm [112–116]. Speech-related skills, such as phonological awareness and reading, are associated to variability in synchronization with a metronome beat [117–119]. Stutterers are also less consistent than control participants in synchronizing taps to a musical beat [120,121]. However, in none of these prior studies [112,116,118] was a deficit noted in spontaneous tapping, hence in intrinsic rhythmicity. Thus, it remains to be seen if their deficit with speech rhythm is related to a poor calibration of intrinsic rhythmicity as indicated here.

The design used in this study, which presented stimuli in the native language of the participants, creates some limitations in generalization of these findings across languages. It is possible, for example, that stress- and syllable-timed languages might elicit different patterns of entrainment [14]. French is usually considered a less "rhythmic" language than English [70,71]. One's native language has also been shown to influence perception of speech rhythm [14,56,122,123]. For example, Lidji et al. [14] found that tapping was more variable to French sentences than English sentences, and that English speakers tapped more regularly to sentences of both languages. However, using the same protocol as the one used here, Lidji et al. [56] found that tapping was more variable to English than French stimuli, irrespective of participants' native language. Thus, it is presently unclear whether participants' native language influence tapping to speech. This should be explored in future studies.

The use of a tapping task has also limited ecological value for entrainment to speech. A shadowing task, for example, where the natural tendency of speakers to entrain to another speaker's speech rate is measured, could be an interesting paradigm to investigate further entrainment to speech [18,47–49] in beat-deaf individuals. The use of behavioral paradigms (tapping tasks), in the absence of neural measurements (such as electroencephalography), leaves open the question of the order in which timing mechanisms contribute to entrainment in speech and music. For example, it is possible that entrainment with music, which typically establishes a highly regular rhythm, is processed at a faster (earlier) timescale than language, which requires syntactic and semantic processing, known to elicit different timescales in language comprehension tasks [124,125]. These questions offer interesting avenues for future directions in comparison of rhythmic entrainment across speech and music.

#### **5. Conclusions**

In summary, our results indicate that beat deafness is not specific to music, but extends to any auditory rhythm, whether a metronome, speech or song. Furthermore, as proposed in previous studies [55,56], regularity or isochrony of the stimulus period seems to be the core feature through which entrainment is possible.

**Author Contributions:** Conceptualization, M.-É.L., C.P. and I.P.; data curation, M.-É.L.; formal analysis, M.-É.L.; funding acquisition, I.P.; investigation, M.-É.L.; methodology, M.-É.L., C.P. and I.P.; project administration, M.-É.L. and I.P.; resources, I.P.; software, M.-É.L. and C.P.; supervision, I.P.; validation, M.-É.L., C.P. and I.P.; visualization, M.-É.L.; writing—original draft, M.-É.L., C.P. and I.P.; writing—review and editing, M.-É.L., C.P. and I.P.

**Funding:** This research was funded by Natural Sciences and Engineering Research Council of Canada grant number 2014-04-068, the Canada Research Chairs program.

**Acknowledgments:** We would like to thank Pauline Tranchant for help with recruitment and insightful comments on data analysis, and Mailis Rodrigues for help with programming the task. We also thank Dawn Merrett for help with editing.

**Conflicts of Interest:** The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Processing of Rhythm in Speech and Music in Adult Dyslexia**

#### **Natalie Boll-Avetisyan 1,\*, Anjali Bhatara <sup>2</sup> and Barbara Höhle <sup>1</sup>**


Received: 23 September 2019; Accepted: 29 April 2020; Published: 30 April 2020

**Abstract:** Recent studies have suggested that musical rhythm perception ability can affect the phonological system. The most prevalent causal account for developmental dyslexia is the phonological deficit hypothesis. As rhythm is a subpart of phonology, we hypothesized that reading deficits in dyslexia are associated with rhythm processing in speech and in music. In a rhythmic grouping task, adults with diagnosed dyslexia and age-matched controls listened to speech streams with syllables alternating in intensity, duration, or neither, and indicated whether they perceived a strong-weak or weak-strong rhythm pattern. Additionally, their reading and musical rhythm abilities were measured. Results showed that adults with dyslexia had lower musical rhythm abilities than adults without dyslexia. Moreover, lower musical rhythm ability was associated with lower reading ability in dyslexia. However, speech grouping by adults with dyslexia was not impaired when musical rhythm perception ability was controlled: like adults without dyslexia, they showed consistent preferences. However, rhythmic grouping was predicted by musical rhythm perception ability, irrespective of dyslexia. The results suggest associations among musical rhythm perception ability, speech rhythm perception, and reading ability. This highlights the importance of considering individual variability to better understand dyslexia and raises the possibility that musical rhythm perception ability is a key to phonological and reading acquisition.

**Keywords:** developmental dyslexia; Iambic/Trochaic Law; rhythmic grouping; musicality; speech perception; rhythm perception

#### **1. Introduction**

Developmental dyslexia (henceforth, dyslexia) affects the acquisition of reading and writing skills despite adequate cognitive and motoric abilities and appropriate access to education. Beyond literacy, dyslexia is also characterized by deficits in spoken language processing, particularly in processing phonological information. For this reason, researchers have proposed that deficits in the processing of phonological information may be the bridge connecting the deficits in spoken and written language, e.g., [1–4]. One prominent theory of dyslexia proposes that phonological processing difficulties are a consequence of impaired auditory processing abilities, in particular when processing rhythm information in speech and music [5]. The present paper aims to connect these hypotheses by investigating the processing of one specific type of phonological information, namely, rhythm information in speech, and its potential associations with literacy and the ability to perceive musical rhythms in dyslexia.

#### *1.1. The Phonological Deficit Hypothesis*

The original phonological deficit hypothesis proposes that a deficit in phonological skills underlies dyslexia, as evidenced by difficulties with tasks that tap into phoneme awareness, letter-sound knowledge, verbal short-term memory, and rapid automatized naming. For a recent review see [6]. Research on this hypothesis has primarily concentrated on deficits regarding segmental (i.e., phoneme) information and has, for example, established that children with dyslexia do not seem to perceive phonemes in the same way as children without dyslexia. Specifically, they have a reduced sensitivity to phonemically relevant distinctions (e.g., when discriminating /p/ from /b/) and an enhanced sensitivity to allophonic variants (e.g., when discriminating different realizations of /b/) compared to listeners without dyslexia, who show clear effects of categorical perception of consonants (for a meta-analysis see [7]). As categorical perception is assumed to result from effects of the native language phonological system to speech perception, e.g., [8], weak categorical perception may indicate that the language-specific phoneme categories are not sufficiently well-established. In the case of dyslexia, less well-defined phoneme categories may create difficulties in the phoneme-grapheme mappings that are relevant for the acquisition and/or processing of written language.

#### *1.2. Rhythm Perception Deficits in Dyslexia*

More recent developments in dyslexia research have shown that the phonological deficits in dyslexia are not restricted to processing segmental information, but also affect the processing of suprasegmental (i.e., prosodic) information, and, in particular, the processing of rhythm. Rhythm is established by the regular occurrence of an element or a pattern in time. Rhythm is an important feature of languages' prosody, and can be characterized by, for example, an alternation of more prominent (i.e., strong) syllables with less prominent (i.e., weak) ones. Languages differ in their rhythmic structure as the organization of speech in alternations of strong and weak syllables is determined by language-specific "metrical stress" rules [9–11]. For example, in English and German, the basic rhythmic unit has a strong-weak (i.e., trochaic) pattern, but in other languages such as Hebrew, the basic rhythmic unit is weak-strong (i.e., iambic) [9]. Compared to groups without dyslexia, groups of individuals with dyslexia show lower performance in tasks that require perceptual sensitivity to and/or knowledge of stress rules. For example, this is the case in discrimination tasks with words or phrases pronounced with correct or incorrect stress patterns, e.g., [12–14]–an effect that is even present in young children with a familial risk for dyslexia [15]. In addition, these abilities have been found to correlate with reading skills [16–18].

Goswami [5,19] has proposed that a fundamental deficit in the processing of rhythmic information is associated with dyslexia. This account focuses on the periodic modulations of amplitude (amplitude envelope) that are crucial to establish speech rhythm with amplitude peaks being aligned with the strong (stressed) syllables of a speech sequence and Goswami assumes that the processing of this amplitude envelope is impeded in dyslexia. These difficulties may result from atypical basic auditory processing: numerous studies have found that individuals with dyslexia show low performance in the perception of rise time (i.e., the velocity of the amplitude increase) and that the perception of rise time is related to the discrimination of word stress patterns (for a review see [20]. Research on the neural basis of this impairment suggests that in dyslexia, neural oscillations are not synchronized with auditory rhythms in the same way as in populations without dyslexia [21–24]. Independent of whether the basis of the impairment is perceptual or neural, according to Goswami, the problem in the processing of rhythm hinders the segmentation of speech into syllables and also the perception of subsyllabic units like rhymes and single phonemes, the latter case explaining the segmental phonological problems in dyslexia. Although at this point any causal interpretations of associations in neural rhythmic entrainment and dyslexia have to be taken with care, it is relevant to note that Goswami's theory has the potential to account for a broader range of deficits that have been observed in dyslexia. Low performance in the perception of speech rhythm seems to extend to non-linguistic domains such as beat perception in music [25–27], and even to motor synchronization

abilities such as rhythmic tapping [28–30], which suggests a domain-general rhythm processing deficit in dyslexia.

With its focus on rhythm processing, Goswami and colleagues' work offers a substantial approach to the potential mechanisms underlying the performance of individuals with dyslexia in different domains. In our study, we intend to broaden the view on rhythm perception in dyslexia by looking at duration and intensity as acoustic cues of speech rhythm perception. Acoustically, strong and weak syllables can be distinguished on the basis of specific cues such as intensity, duration, and pitch, with strong syllables often being louder, longer, and higher than weak ones [31,32]. Interestingly, these different cues have different effects on rhythmic grouping and segmentation: while alternations in syllables' duration lead to the perception of weak-strong patterns, alternations in intensity and pitch lead to the perception of strong-weak patterns [9,33–38]; for more details see 1.3. The main goal of this paper is to investigate rhythmic grouping according to this bias in individuals with dyslexia.

Of course, not only speech is rhythmically structured. Rhythm is a domain-general phenomenon. Similar organizational rhythmic principles with regular alternations of strong and weak elements are also found in music [9,39], where the same acoustic cues (intensity, duration, and pitch) are relevant for conveying rhythm, and the same tendency to use these cues differently at the beginning or the end of a unit is often exhibited [39–41]. If rhythm perception in speech and music relies on shared perceptual mechanisms or shared rhythm representations, then it should be the case that individuals with better music abilities should also show enhanced language abilities [42–44]. In line with this, [45] reported that, within a group of adults with dyslexia, musicians outperformed non-musicians on several auditory measures, including rise time, frequency, intensity, and timing perception, even reaching the same levels of performance as musicians without dyslexia. However, the advantage that musicians with dyslexia experienced in the auditory perception tasks did not extend to their literacy and phonological awareness. Accordingly, other researchers doubt that dyslexia relates to poor rhythm perception, e.g., [46]. A second goal of the present paper, therefore, is to further examine whether rhythm processing deficits in speech and music are linked with reading deficits in dyslexia.

#### *1.3. Biases on Auditory Rhythmic Grouping*

In this paper, we will investigate for the first time how biases on auditory rhythm perception affect speech rhythm perception by adults with dyslexia. For this, we focus on rhythmic grouping of speech following the Iambic/Trochaic Law (ITL) [9]. According to the ITL, rhythmic perception is guided by universal biases. These biases have the effect that sequences of sounds varying in intensity tend to be perceived as trochees (strong-weak), whereas sound sequences varying in duration tend to be perceived as iambs (weak-strong; e.g., [33–36]). These biases have been attested for speakers of various languages, including English [33,34,36,47,48], German, French [38,49], Spanish [48], and Italian [37]. Since rhythmic grouping preferences are asymmetrical between the perceived acoustic cues, these biases cannot simply be accounted for by a tracking of acoustic cues to prominence in the signal. Importantly, asymmetries in rhythmic grouping are mirrored in the rhythm structures in language and music where final prominence is usually marked by a long syllable or note, and initial prominence by a loud syllable or beat, which supports the assumption of the ITL as universal [19]. More recent research, however, indicates that rhythmic grouping preferences are subject to individual variation and depend to some degree on aspects such as individuals' language background [38,47–50] and their musical abilities [51,52].

#### 1.3.1. Effects of Language Background on Rhythmic Grouping

Language background's effects on perception may relate to differences in the function of stress between the languages: [38] hypothesized that when perceiving speech, the ability to draw on abstract phonological representations of lexical stress would facilitate German speakers' rhythm processing. This is because German uses lexical stress contrastively (e.g., /'te,nor/ 'common sense' vs. /te'nor/ 'singer'), while French does not. In [38]'s rhythmic grouping experiment, German and French

listeners listened to syllable streams, in which syllables alternated in intensity (loud-soft-loud-soft ... ), pitch (high-low-high-low ... ), duration (long-short-long-short ... ), or neither (flat control condition). Participants were asked to indicate via button presses whether they perceived strong-weak or weak-strong groupings. The result was that both groups perceived iambs and trochees as predicted by the ITL, but the German listeners were more consistent and had clearer rhythmic grouping preferences than the French listeners. Moreover, German but not French listeners experienced the illusion of hearing strong-weak groupings when listening to the control sequences that did not contain any acoustic cues to rhythm. Ref. [38] argue that this effect is likely to also be driven by the presence of abstract phonological representations of stress in German: As words in German are pre-dominantly trochaic, German listeners might apply a default grouping to sound sequences based on their linguistic experience. Since French has no lexical stress, there may be no reason for a default grouping based on their experience.

#### 1.3.2. Effects of Musical Background on Rhythmic Grouping

Musical *experience*as defined by the number of acquired musical instruments, the duration of musical training, and the earliest age of acquiring a musical instrument has been found to influence rhythmic grouping. However, this seems to be modulated by the individuals' language background and has, to this point, only been found to affect native speakers of French and not native speakers of German [38,49,51,52]. French speakers who are musically experienced have clearer preferences for grouping acoustically complex non-speech sounds [49] as well as for grouping speech, though only if they are also proficient speakers of German [51]. While general musical experience never predicted monolingual German speakers' grouping of speech, their ability to perceive musical rhythm as measured by a standardized musical ability test (the Musical Ear Test, henceforward MET [53]) did (though their ability to perceive melodies did not) [52]. Musical abilities can be, but not necessarily, correlated with musical experience [46]. Instead, they may relate to more general auditory perception abilities, which vary widely among individuals [29]. In addition, the abilities to perceive and discriminate musical rhythms do not always correlate with musical melody perception abilities [54]. Together, these results suggest a specific connection between language and music via rhythmical properties.

#### *1.4. Hypotheses and Predictions*

Individuals with dyslexia have repeatedly exhibited relatively weak stress and rhythm processing abilities, even in domains other than language (e.g., in tapping and music perception). This suggests that their rhythmic grouping preferences will also be weak, especially since rhythmic grouping depends on native language phonological knowledge. Given the findings that musical ability also influences rhythmic speech grouping, the present study set out to investigate the relations among speech rhythm processing, musical rhythm perception ability, reading ability, and dyslexia in German listeners. We aimed at investigating the following research questions:


We hypothesized the following:

(1) Based on the hypothesis that adults with dyslexia have difficulties in processing rhythm, we expect them to show weak grouping preferences. Hence, they should show less asymmetrical grouping preferences when hearing sequences varying in intensity or duration than adults without dyslexia. Further, if this rhythmic deficit hinders the establishment of phonological representations for metrical structure, adults with dyslexia should not show grouping preferences when hearing rhythmically invariant sequences.


To investigate these hypotheses, we conducted a rhythmic grouping experiment with adults with and without dyslexia and measured their musical rhythm ability by means of the MET [53], and their reading ability by means of the Salzburger Lese- und Rechtschreibtest SLRT-II [55]. In order to avoid pre-selecting or grouping participants based on their musicality and cognitive abilities, we applied regression modeling for data analysis, with musical rhythm ability, musical experience, and cognitive abilities as covariates.

#### **2. Materials and Methods**

#### *2.1. Participants*

Participants were 23 monolingually-raised adult native speakers of German with dyslexia (nine women, 14 men, mean age = 24 years, age range: 17–35 years) and 23 (12 women, 11 men) age-matched controls. An additional participant with dyslexia was raised bilingually, and, hence, excluded together with the age-matched control. Participants gave informed consent before taking part.

The inclusion criterion for participants with dyslexia was that they showed us their formal testimonial of their developmental dyslexia diagnosis. In Germany, there are no nation-wide standards for dyslexia diagnosis. To verify the diagnosis provided by the participants, we compared how the groups with and without dyslexia fared at a reading test (i.e., the SLRT-II; [55], see below). Results of a linear regression indicated significantly lower nonword reading ability for the group with dyslexia compared to the group without (β = 41, SE = 6.06, t = 6.77, *p* < 0.001). This result allowed us to conclude that the testimonial of the dyslexia diagnosis did justify the division of the participants into two groups (with vs. without dyslexia). Hence, we used group (rather than reading ability scores) as a factor to test assumptions regarding dyslexia. Other than these, there were no further constraints on recruitment. Participants of both groups were recruited in the cities of Berlin and Potsdam by means of distribution of flyers and online advertisements on social media, to make sure that our sample would not only consist of university students. For a detailed summary of the groups' background information and the groups' average performance in the tasks described in Section 2.2, see Table 1.

The sample size is justified, as effect sizes of prior rhythmic grouping studies were high: for example in [38] for comparisons between French and German listeners in the intensity condition Cohen's d = 1.4 (large) and Cohen's d = 1.1 (large) in the duration condition; for comparisons between conditions (duration vs. intensity, and duration vs. control) within native speakers of German Cohen's d = 4.4 (large). Moreover, we tested our design using the PANGEA software (https://jakewestfall.shinyapps.io/pangea/, see [56]), which revealed a high power (0.91) for a study design including a four-way interaction with 23 participants per group with the alpha level set at 0.05, and an assumed medium effect size of 0.45. This effect size of 0.45 is conservative given the large effect sizes found in prior studies, however, since prior studies on rhythmic grouping have, as yet, not included adults with dyslexia, power calculations have to be taken with caution.


**Table 1.** Summaryoftheresultsofallquestionsfromthequestionnaireaswellasallmusicalandcognitivetestsforboththegroupofadultswithversuswithoutdyslexia.





#### *2.2. Task Battery*

#### 2.2.1. Rhythmic Grouping Preferences

In order to assess rhythmic speech grouping preferences, we used the stimuli and procedure from [37], Experiment 1. The stimuli were 90 speech-like streams that consisted of different simple syllables in which one consonant was always followed by one vowel (e.g., / ... zulebolilozimube ... /). The streams were text-to-speech synthesized with a German pronunciation and flat F0. There were three conditions: An intensity condition in which every second syllable was louder than the preceding one, a duration condition in which every second syllable was longer than the preceding one, and a control condition, in which all syllables were of equal intensity and duration. The task was to listen to each of the nonsense speech streams and to indicate by button press whether this pattern consisted of strong-weak or weak-strong disyllables. The proportion of strong-weak responses in the three conditions (intensity/duration/control) served as a dependent variable (Section 3.1/Section 3.3); for details, see [38].

#### 2.2.2. Musical Rhythm Perception Ability

Receptive musical rhythm abilities were assessed using the Musical Ear Test [53]. Participants heard 52 pairs of rhythmic sequences, which are containing 4–11 wood block beats, and had to decide whether the two sequences were the same or different. The obtained proportion of correct responses was used as a dependent measure to evaluate whether the group with dyslexia showed lower performance than the group without dyslexia (Section 3.2). Furthermore, this measure was used as an independent variable to understand its role as a predictor of rhythmic grouping (Section 3.3) and reading ability (Section 3.4).

#### 2.2.3. Questionnaire

An interview based on a questionnaire was used to collect information on the participants' musical and language background, and, if applicable, their dyslexia status and therapy experience (for details and a summary of the results, see Table 1). Questions were read out by the experimenter, who also filled out the questionnaire based on the responses. Following [49,51,52], a predictor of musical experience was extracted using the answers to questions regarding the number of acquired instruments, the age of acquiring the first instrument, and the duration of years of musical practice. In the following analyses, it was tested whether this predictor was correlated with musical rhythm ability (Section 3.2), rhythmic grouping (Section 3.3), and reading ability (Section 3.4).

#### 2.2.4. Reading Ability

Participants completed the reading fluency test of the Salzburger Lese- und Rechtschreibtests (SLRT-II) [55], a standardized test for the diagnosis of dyslexia. They were asked to read aloud lists of words and nonwords within a time limit (one minute per list). It allows for a separate diagnosis of deficits in automatic word recognition versus synthetic sound-based reading. The latter is predicted to be particularly weak in individuals with dyslexia. Note that we did not use this test to diagnose any of the participants with dyslexia. The purpose of this test was to verify that the groups defined on presence vs. absence of formal diagnosis of dyslexia truly differed in reading ability (see Section 2.1), and to test whether musical rhythm ability predicted reading ability (Section 3.4).

#### 2.2.5. Cognitive Ability

Many studies suggest that individual variability in cognitive abilities such as general verbal comprehension, short-term memory, and processing speed can influence performance in psycholinguistic experiments (for a systematic review, see [57]). In order to verify that potential differences between the groups with or without dyslexia in the experimental task are not due to differences in such general cognitive abilities, participants completed four subsets from the Wechsler Adult Intelligence Scale WAIS-IV (a version adapted for German, [58]), a standardized tool for determining the intelligence quotient. To test verbal comprehension (specifically, verbal reasoning and semantic knowledge), participants performed the subtest Similarities, in which participants heard 18 pairs of words (e.g., piano & drum, or friend & enemy), for which they had to describe which attributes they share. Next, to test short-term memory, they performed subtests that measured their digit span. Specifically, they listened to sequences of orally presented numbers, and, in three subsequent sub-tests, they were required to repeat them in as heard, backward, or in sequential (ascending) order.

For measuring their processing speed, we selected two subtests: Symbol search and Coding. In Symbol Search, participants were required to search for two target symbols in a row of different symbols, and to indicate whether the target symbols were present or not. In Coding, nine different numbers (1–9) are assigned a different symbol. In the task, participants are presented with a list of numbers and are required to draw the corresponding symbol next to each of the numbers. (Participants additionally completed a nonword repetition task for adults [59], which was based on the Mottier test, a standardized test for German-speaking children [60]. Because of redundancy with the digit span tests (Section 2.2.5), which also test verbal memory, we did not include the data of the nonword repetition test in the analyses.) A composite score of the results of all subtests served as a covariate in analyses of rhythmic grouping (Section 3.1/Section 3.3), musical rhythm perception ability (Section 3.2) and reading ability (Section 3.4).

#### *2.3. Data Processing and Analyses*

For the analyses, we included data from both the groups of adults with dyslexia (N = 23) and without dyslexia (N = 23). The analysis (Section 3) consisted of four parts.

First, to address hypothesis (1), we tested whether rhythmic speech grouping preferences by the two groups (with vs. without dyslexia) differed from chance in the three acoustic conditions (intensity, control, and duration) by means of generalized linear mixed-effects (Section 3.1).

Second, to address hypothesis (2), a linear regression analysis with the MET scores as a dependent variable was performed in order to determine whether group differences existed, while controlling for general cognitive ability, and musical experience (Section 3.2).

Third, to address hypothesis (1) and (3), we tested whether rhythmic grouping preferences differed between groups, and whether it depended on individuals' musical rhythm perception ability. For this, we performed a generalized linear mixed-effects model analysis. In a stepwise fashion, we incrementally increased the models' complexity to understand the effects of the factors group (which we predicted to have an effect) and musical rhythm perception ability (which we predicted to have an effect) on the three conditions (intensity, duration, control), while, ultimately, controlling for general cognitive ability and musical experience. Our method was to compare mixed-effects models that either included or excluded predictors to find the combination of predictors that accounted for most variance in the data, following the recommended procedures [61–63] (Section 3.3).

Fourth, to address hypothesis (4), we assessed the association of musical rhythm perception ability and reading ability in both the group with and the group without dyslexia, while again controlling for musical experience and cognitive abilities. For this, we performed a linear regression analysis with nonword reading ability (i.e., SLRT nonword reading scores) as the dependent variable, and group, musical rhythm perception ability, cognitive ability and musical experience in the fixed part (Section 3.4).

For the control variable "cognitive ability," a composite score was generated that combined the averaged WAIS-IV subtest scores. For the control variable "musical experience", we generated a composite score on the basis of three questions from the questionnaire representing the participants' years of musical training, their age of beginning musical training, and the number of learned musical instruments/activities. Both composite scores were created by means of Principal Component analysis (see Appendix A, Table A1) to avoid collinearity. Collinearity occurs when a number of independent

variables are correlated, which poses a problem to regression analyses. Principal component regression is a commonly used method to reduce collinearity, as it eliminates the dimensions that are causing the collinearity problem [64] (p. 446). Following the classical procedures, we included the first principal components (PCs) as independent factors in our subsequent regression analyses. The first PC reflecting cognitive ability accounted for 58% of the variance contained in the data of the 4 WAIS-IV subtests, which were represented by this PC to a comparable degree (see Appendix A for details). The first PC reflecting musical experience accounted for 82% of the variance of the 3 questions that were equally represented by this variable (see Appendix B, Table A2).

All analyses were performed in R [65] using the package lme4 [66]; graphs were generated using the package ggplot2 ([67]). For plotting modeled data, the package effects [68] was used to extract the model estimates and respective *SE*s.

#### **3. Results**

#### *3.1. Rhythmic Grouping Preferences*

Tests against chance (see Appendix C, Table A3) revealed that in both the intensity and control condition, trochaic (strong-weak) responses were above chance for both the group with dyslexia (intensity: *p* < 0.001; control: *p* = 0.03) and the group without dyslexia (both *p*'s < 0.001). In the duration condition, both groups gave more iambic (weak-strong) responses than expected by chance (both *p*'s < 0.001, see Figure 1).

**Figure 1.** Proportions of trochaic responses (back-transformed, y-axis adjusted to the logit space) in the three acoustic conditions for both groups. The graph reflects the estimates of a simple logit linear mixed-effects model (responses ~ condition \* group + (condition + 1||participants) + (1|items) ... ).

#### *3.2. Musical Rhythm Ability*

We compared how the two groups (with vs. without dyslexia) fared at the MET for rhythm, while controlling for general cognitive ability and musical experience. Control participants' average rhythm MET scores were higher (73.49% correct, SD = 9.96) than those by participants with dyslexia (61.87% correct, SD = 9.92) with large effect size (Cohen's d = 1.17). Results of a linear regression confirmed that differences between groups were significant (*p* = 0.03). Moreover, rhythm MET scores were significantly predicted by cognitive ability (*p* < 0.01), but not by musical experience (*p* = 0.28, for the full results, see Appendix D, Table A4). Groups did not, however, differ with regards to their musical experience (β = 0.77, SD = 0.46, t = 1.68, *p* = 0.1). This suggests that dyslexia is associated with reduced musical rhythm perception ability that is independent of musical experience.

#### *3.3. Predictors of Rhythmic Grouping Preferences*

Next, we tested whether rhythmic grouping preferences differed in strength between groups and explored the role of individual differences in musical rhythm perception ability, cognitive ability, and musical experience. For this, we report the main results of all models that entered our stepwise regression analysis. To measure the consistency of rhythmic grouping preferences (i.e., how consistent participants were in grouping duration variation as iambs and intensity as trochees), we entered contrasts between conditions into our models. In all models (see Appendix E, Tables A5–A9), significant effects were obtained in the Duration-Intensity contrast and in the Control-Duration contrast (both *p*'s < 0.001), indicating that in both the intensity and control condition, more trochaic responses were given than in the duration condition.

Model 1, serving as a basis, included only the interaction of Condition and Group in the fixed part (Formula: Response ~ Condition/(Group) + (1 + Duration-Intensity + Control-Duration || participant) + (1 | item) to test the hypothesis that adults with versus without dyslexia differ in their rhythmic speech grouping preferences. Model results (fully reported in Table A5 and depicted in Figure 2) show significant group differences in both the intensity (*p* = 0.003) and the control condition (*p* = 0.02), with more trochaic responses by adults without dyslexia than by adults with dyslexia. In the duration condition, no group differences were found.

**Figure 2.** Linear regression lines illustrating the effects of musical rhythm perception ability (Musical Ear Test scores) on rhythmic grouping in the three acoustic conditions separated by group (left panel: with dyslexia, right panel: without dyslexia).

Model 2 included only the interaction of Condition and Musical rhythm perception ability in the fixed part, excluding group (Formula: Response ~ Condition/(Musical rhythm perception ability) + (1 + Duration-Intensity+ Control-Duration || participant)+(1 | item) to evaluate the general contribution of musical rhythm perception ability on rhythmic speech grouping preferences. Model results (see Table A6) show significant effects of musical rhythm perception ability on all conditions: higher musical rhythm perception ability was associated with more trochaic groupings in the intensity (Intensity\*Musical rhythm perception ability: *p* = 0.04) and control condition (Control\*Musical rhythm perception ability: *p* < 0.001), and more iambic groupings in the duration condition (Duration\*Musical rhythm perception ability, *p* < 0.001, see Figure 2). Model comparisons revealed that Model 2 was a better fit than Model 1 (χ<sup>2</sup> = 39.53, *p* < 0.001).

Model 3 included the three-way interaction of Condition, Group and Musical rhythm perception ability in the fixed part (Formula: Response ~ Condition/(Group\* Musical rhythm perception ability) + (1 + Duration-Intensity + Control-Duration || participant) + (1 | item) to understand whether group and musical rhythm perception ability predict speech grouping preferences independently. Results (see Table A7 and Figure 2) suggest that this is not the case. Interactions of Group with the Intensity and Control conditions that were present in Model 1 no longer reached significance in Model 3, and the interaction of Duration\*Group did not reach significance either. However, the interactions Duration\*Musical rhythm perception ability (*p* < 0.001) and Control\*Musical rhythm perception ability (*p* < 0.001) that were present in Model 2 remained highly significant in Model 3. This suggests that group differences in the Control condition as attested in Model 1 are due to differences in musical rhythm perception ability between the groups. Moreover, the results suggest that variance in the Duration and Control condition is better captured by differences among individuals' musical rhythm perception ability than by dyslexia status. There were no three-way interactions of any of the conditions with group and musical rhythm perception ability. Model comparisons revealed that Model 3 was a better fit than Model 2 (χ<sup>2</sup> = 13.62, *p* < 0.001).

Two further models tested the potential effects of two control variables: cognitive ability (Model 4, reported in Table A8) and musical experience (Model 5, reported in Table A9). These models revealed the same effects that were also present in Model 3. However, because neither of these control variables significantly influenced participants' grouping preferences in any of the conditions, nor did an inclusion of these factors improve the model fit, we do not discuss these models further (detail and model outputs are provided in Tables A8 and A9).

To summarize, Model 3 (Table A7), which included interactions of condition with group and musical rhythm perception ability, accounted best for the data, which revealed effects of musical rhythm perception ability on the control and duration (but not the intensity) condition, but no significant effects of the group factor.

#### *3.4. Predictors of Nonword Reading Ability*

Results of a linear regression analysis (see Appendix F, Table A10 for details) revealed neither effects of cognitive ability nor of musical experience on reading ability (no main effect, no interaction). There was, however, a significant main effect of group, indicating that—as expected—the group without dyslexia had higher reading ability than the group with dyslexia (*p* < 0.001). Moreover, there was a marginal interaction of musical rhythm perception ability and group (*p* = 0.056, Cohen's f2 = 0.10 (medium)). To understand the interaction, we tested the effect of musical rhythm perception ability on reading ability per group. Results were that musical rhythm perception ability positively predicted reading ability by individuals with dyslexia (β = 97.77, SE = 35.76, t = 2.73, *p* = 0.01, Cohen's f <sup>2</sup> = 0.36 (large)) but not by individuals without dyslexia (β = −11.693, SE = 48.25, t = −0.24, *p* = 0.81; see Figure 3).

**Figure 3.** Linear regression lines reflecting the association between nonword reading ability (Salzburger Lese- und Rechtschreibtests (SLRT) scores) and musical rhythm perception ability (Musical Ear Test (MET) scores) split by group, shades indicate confidence intervals, rectangles (with dyslexia) and triangles (without dyslexia) the individuals' averages.

#### **4. Discussion**

The present study is based on the theory that there is a deficit in rhythm processing in dyslexia [1–4], which we studied by exploring the modulating effects of musical rhythm perception ability on rhythmic speech grouping by adults with and without dyslexia. Populations with dyslexia have previously been demonstrated to have difficulties with processing stress and rhythm information [12–14], which suggests that the phonological deficit affects not only segmental but also suprasegmental aspects of speech. Hence, we investigated whether adults with dyslexia have reduced abilities in rhythmic grouping of speech, an ability that has previously been found to depend on native language phonological knowledge [38,47,48]. Rhythm, however, is not only part of phonology, but is also an integral aspect of other auditory domains, such as music. Previous studies have established that there are links between individuals' musical abilities and their rhythm processing [52]. Hence, we hypothesized to find links between rhythm processing in speech and music and reading ability in dyslexia.

Specifically, our research was intended to provide answers to the following questions, and delivered the following central findings:


The results are discussed below.

#### *4.1. Dyslexia, Rhythmic Grouping Preferences, and Musical Rhythm Ability*

First, regarding the link between dyslexia and rhythmic speech grouping, results revealed significant preferences for groupings as predicted by the ITL in all conditions (iambic in the duration condition, trochaic in both the control and intensity condition), by native speakers of German with and without dyslexia. This result was unexpected. Our original hypothesis was that rhythmic grouping preferences would be weakened in dyslexia. This hypothesis was motivated by results from prior studies showing that individuals with dyslexia have weakened stress perception abilities, e.g., [12–14]. In prior studies, we found that native speakers of French had weakened rhythmic grouping preferences compared to native speakers of German—a result argued to relate to differences in the phonological systems of German and French (due to the lack of contrastive lexical stress in the French language). Since French speakers have, moreover, repeatedly been found to have weakened stress perception abilities, e.g., [69,70], and the same is true for individuals with dyslexia, e.g., [12–14], we drew a parallel. Unexpectedly, results attested that German speakers with dyslexia show consistent grouping preferences at the group level, just like German speakers without dyslexia. This result replicates our previous findings with native speakers of German without dyslexia and extends them to native speakers of German with dyslexia.

It is important to further explore why listeners with dyslexia generally show the same pattern of responses as those without in the rhythmic speech grouping task: At first glance, this conflicts with the assumption that individuals with dyslexia predominantly have a deficit in rhythm processing. However, the present results can be better understood by considering the results of the model comparisons that addressed the fourth research question about the association of dyslexia, rhythmic speech perception, and musical rhythm perception ability. Model 1 (Table A5, the baseline model that excluded the musical rhythm perception ability factor), suggested a detrimental impact of dyslexia on rhythmic speech perception in both the intensity and control condition. This is in line with prior studies that suggested links between speech rhythm perception and dyslexia, e.g., [12–14].

Importantly, these effects disappeared when, in Model 3 (Table A7, as well as Models 4 and 5 (Tables A8 and A9) that additionally controlled for cognitive ability and musical experience), musical rhythm perception ability was added as a predictor. This suggests that differences in individuals' musical rhythm perception ability better capture the variance in the data than the individuals' dyslexia status (i.e., there are individuals with dyslexia who have high musical rhythm perception ability with consistent grouping preferences, and individuals without dyslexia with low musical rhythm perception ability with inconsistent grouping preferences). However, even though this suggests that rhythmic speech perception is independent of dyslexia and only modulated by musical rhythm perception, adults with dyslexia had overall lower musical rhythm perception ability than adults without dyslexia, which implies an indirect effect of group on rhythmic speech perception.

Notably, as predicted, musical rhythm perception ability accounted for variance in the rhythmic grouping of duration-varied and rhythmically invariant control speech sequences, but, in contrast with our predictions, not of intensity-varied speech sequences. These findings suggest that the relation between musical rhythm perception ability and speech rhythm processing cannot be simply explained by a general deficit in perceiving the acoustic information that is relevant for perceiving rhythm (otherwise the processing of intensity-varied speech sequences should also be related to musical rhythm perception ability). The results of the control condition, in which acoustic cues to rhythm were absent, suggest that the relation between musical rhythm perception ability and speech rhythm processing is established via phonological knowledge. In previous studies [38], it was proposed that German listeners might perceive trochees even in the absence of acoustic cues to rhythm, because German has trochaic metrical stress and their abstract knowledge about this phonological property of their language affects German listener's perception (as commonly seen also in sound perception which is also affected by native language phonemic categories). Ref. [52] also observed that musical rhythm perception ability was associated with this default grouping procedure, and it was speculated whether individual differences in basic auditory perception abilities might lead to differences in how listeners

establish phonological knowledge. The fact that the present study replicates the effect of the connection between musical rhythm perception ability and default trochaic groupings with adults with dyslexia is interesting, as it offers additional support for the interpretation that that default rhythm perception procedures are subject to individual variation and are associated with more general auditory rhythm perception abilities.

In order to explain why musical rhythm perception ability, contrary to our predictions, did not affect grouping of intensity-varied sequences by adults with dyslexia, we consider previous studies on the ITL. Infants have been found to use pitch and intensity cues for trochaic groupings (pitch: [37,71], intensity: [72]) more readily than duration cues for iambic groupings [37,71,72]. Based on these findings it has been proposed that the use of duration cues for grouping is acquired, while the use of other rhythmic cues for trochaic groupings is innate (more evidence for this proposal comes from studies with rats [73,74] but c.f. [75,76] for evidence that the use of duration for iambic groupings is also innate). The present finding that intensity-based grouping is unmodulated by musical rhythm perception ability is consistent with the assumption of an innate preference for trochaic groupings when intensity alternations are perceived. Speculatively, this might suggest that in dyslexia, perception of innately biased speech processing routines is unimpaired. It would be interesting if future studies followed up on this.

#### *4.2. Dyslexia and Musical Rhythm Ability*

Regarding the associations among dyslexia, reading ability and musical rhythm perception ability, results were as predicted: First, adults with versus without dyslexia differed in their musical rhythm perception ability: as a group, adults with dyslexia showed a lower performance in the MET rhythm subtest than the control group. This result is in line with previous findings indicating deficits in musical rhythm processing abilities in dyslexia [25–27]. However, it must be noted that performance in the MET is associated with short-term memory performance [58]. This is in line with literature that has found short-term memory ability to be enhanced in musical people [77–79]. However, it is also well-known that groups with dyslexia have less efficient short-term memory abilities than groups without dyslexia (see Table 1, it is also true for the present sample: WAIS, Working Memory: digit span: with dyslexia: 24 (15–38); without dyslexia: 28.86 (22–37)), and it is debated whether this reduced short-term memory efficiency is the basis of the impairment or is an effect of the phonological deficit [3,80]. In fact, we found that musical rhythm ability was predicted by cognitive ability, a composite variable that included the participants' digit span scores. In order to better understand if musical rhythm ability is lower in dyslexia independently of related cognitive abilities, future studies should aim to control for this confounding factor.

#### *4.3. Dyslexia, Nonword Reading Ability, and Musical Rhythm Ability*

We tested whether reading ability was predicted by group and by musical rhythm perception ability. As expected, the groups differed in their reading ability. Interestingly, we furthermore found a marginally significant interaction (*p* = 0.056) of group and musical rhythm perception ability to account for reading ability. We explored this interaction (although results based on this insignificant interaction have to be taken with care), and found that, in particular, reading ability of adults with dyslexia was predicted by musical rhythm perception ability: the lower their performance in the MET rhythm subtest, the lower was their score in the nonword reading test. This finding is consistent with prior studies, which found links between musical rhythm perception ability and reading ability (e.g., [81] and references therein). Both these results support theories of links between general rhythm processing abilities and dyslexia and, accordingly, with reading ability.

Note that the lack of a relation between musical rhythm perception ability and reading ability in adults without dyslexia does not justify the conclusion that this association is exclusive to dyslexia. Potentially, a link between musical rhythm perception ability and reading ability could also be found if a reading test were used with adults without dyslexia that elicits greater variability in this groups' reading ability than the SLRT, a test particularly designed for identifying dyslexia in

adulthood. Moreover, the relation between musical rhythm perception ability and reading ability may be non-linear, with ceiling effects of musical rhythm perception ability at a certain level of high reading ability that adults without dyslexia typically reach. It will be interesting to address these questions in future research.

#### **5. Conclusions**

In sum, the main findings of the present study are the following: First, rhythmic grouping of speech is not predicted by dyslexia status, but by musical rhythm ability. That is, the present study does not provide direct evidence for the theory that there is a specific speech rhythm processing deficit in dyslexia. However, the fact that we found the group of adults with dyslexia to show lower musical rhythm perception ability than the group of adults without dyslexia, and that musical rhythm ability predicted speech rhythm grouping indicates that there is a link between rhythm processing in music and speech and dyslexia. Second, musical rhythmic skills predict reading in dyslexia. The results suggest clear links between dyslexia (i.e., reading ability), musical rhythm perception ability, and speech rhythm processing, not only when rhythmic cues are available but also when a lack of cues triggers knowledge-driven default processing routines. All in all, the results point to individual differences in the group of adults with dyslexia that are explained by their musical rhythm perception ability.

The present findings cannot inform about causal relationships between musical rhythm perception ability and dyslexia. However, they raise the possibility that rhythm perception ability is a key to phonological and reading acquisition. The present results are in line with two assumptions about the underlying reasons for these links. The first assumption is that deficits connected with dyslexia can be compensated by rhythm perception ability. The second assumption is that the deficits connected with dyslexia are a consequence of lower rhythm perception ability. That is, potentially, individuals with lower rhythm perception ability have a higher risk for developing phonological and reading deficits.

Future studies should address the question of how musical rhythm perception ability, speech rhythm perception, and reading are causally connected. To explore the first assumption, studies should assess the potential of rhythmic interventions in dyslexia therapy, and therewith, follow a line of research that has already been initiated, e.g., [82,83]. Ideally, future research should explore pre-/post-test paradigms to explore whether musical rhythm perception ability can be enhanced by training. This can then be extended to other types of rhythmic behavior, such as motor synchronization (e.g., tapping) with rhythmic beats, to pave the way for targeted rhythm-based therapeutic approaches. To explore the second assumption, future research should conduct longitudinal studies with very young infants with a familial risk for dyslexia (for a similar suggestion, see [84]), to pave the way for our understanding of whether the ability to perceive rhythm in music (and other sensory domains) is a reliable early marker of developmental dyslexia.

**Author Contributions:** Conceptualization, N.B.-A., A.B. and B.H.; methodology, N.B.-A., A.B. and B.H.; analysis, N.B.-A.; investigation, N.B.-A.; data curation, N.B.-A.; writing—original draft preparation, N.B.-A. and B.H.; writing—review and editing, N.B.-A., A.B. and B.H.; visualization, N.B.-A.; project administration, N.B.-A.; funding acquisition, B.H. All authors have read and agree to the published version of the manuscript.

**Funding:** This research was funded by two Agence Nationale de la Recherche-Deutsche Forschungsgemeinschaft grants (# 09-FASHS-018 and HO 1960/14–1 to Barbara Höhle and Thierry Nazzi, and HO 1960/15–1 and ANR-13-FRAL-0010 to Ranka Bijeljac-Babic and Barbara Höhle), and partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)–project number 317633480–SFB 1287, Project C03.

**Acknowledgments:** We thank Sophie Gruhn, Franz Hildebrandt-Harangozó, Olivia Malotka, and Isabelle Mackuth for help with recruiting and testing participants, and Daniel Schad for consultancy regarding the statistical analysis. We acknowledge the support of the Deutsche Forschungsgemeinschaft and Open Access Publishing Fund of University of Potsdam.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

Principal Component Analysis was used to generate one composite score to reflect general cognitive ability on the basis of four tests from the WAIS. In the analyses, we used the first Principal Component as a factor to represent cognitive ability, which, as can be seen from the loadings in Table A1, represented all four variables to a comparable degree, and captured a proportion of 0.58 of their variance.

**Table A1.** Results of the Principal Component Analysis over the data of four variables relating to general cognitive ability.


#### **Appendix B**

Principal Component Analysis was used to generate one composite score to reflect musical experience on the basis of three questions from the questionnaire regarding the number of acquired musical instruments/activities, the earliest age of acquiring a musical instrument/activity, and the duration of musical training in years. In the analyses, we used the first Principal Component as a factor to represent musical experience, which, as can be seen from the loadings in Table A2, represented all three variables to a comparable degree, and captured a proportion of 0.82 of their variance.

**Table A2.** Results of the Principal Component Analysis over the data of three variables relating to general musical experience.


#### **Appendix C**

For each group, we calculated a generalized linear mixed effects model (under use of the bobyqa optimizer) with condition as fixed factor, participants and items as random factors, but no random slopes because of false convergence. The intercept was set to zero, so that each of the three acoustic conditions (intensity, control, and duration) was compared to chance. Results are provided in Table A3.


**Table A3.** Results of two generalized mixed effects models (one for the group of adults with dyslexia, and one for the group of adults without dyslexia) to test the groups' preferences against chance in the three acoustic conditions reported in Section 3.3.

<sup>1</sup> Formula: Response ~ −<sup>1</sup> <sup>+</sup> Condition <sup>+</sup> (1 <sup>|</sup> participant) <sup>+</sup> (1 <sup>|</sup> item). Each line shows the coefficients of the intercept of each of the separate models. Negative β estimates indicate more iambic responses, and positive β estimates indicate more trochaic responses. Level of significance: \* *p* < 0.05, \*\*\* *p* < 0.001.

#### **Appendix D**

The results of the linear regression analysis for testing whether musical rhythm perception ability (MET scores) is predicted by dyslexia (group factor), cognitive ability (first principal component of the WAIS scores) and musical experience (first principal component combining the number of learned musical instruments/activities, age of musical acquisition, and duration of musical training) are provided in Table A4.

**Table A4.** Parameters of the regression analysis of the effects of dyslexia, musical experience and cognitive ability on musical rhythm perception ability.


<sup>1</sup> Formula: lm(Musical rhythm perception ability ~ Group\*Musical experience + Group\*Cognitive ability). Level of significance: \* *p* < 0.05, \*\* *p* < 0.01.

#### **Appendix E**

For each group, we calculated a generalized linear mixed effects model (under use of the bobyqa optimizer) with condition as fixed factor, participants and items as random factors, but no random slopes because of false convergence. The intercept was set to zero, so that each of the three acoustic conditions (intensity, control, and duration) was compared to chance. Results are provided in Table A10. To test the effects of the factors group and musical rhythm perception ability on the three conditions, while controlling for cognitive ability and musical experience, generalized linear mixed-effects models were built that incrementally increased the number of predictors in a stepwise fashion. Model fits were compared by means of their loglikelihood using the anova() function from the LME4 package ([57]). Successive difference contrast coding was used for comparing groups and conditions. This contrast (coded with the contr. sdif() function from the MASS package, [85]) assigns the grand mean to the intercept, and beta coefficients indicate the difference scores between two compared levels. In the case of group, this contrast was specified as subtracting the group with dyslexia from the group without dyslexia. For condition, the contrasts were Duration-Intensity (β reflecting duration minus intensity) and Control-Duration (control minus duration). Both continuous predictors were centered around their mean to reduce collinearity, and z-transformed (using the scale() function) as models

with untransformed predictors did not converge (mathematically, z-transformation does not affect the results). The models were coded containing a fraction, with condition being the numerator, and the predictors group, musical rhythm perception ability, and/or cognitive ability being the denominator. By this, it is possible to assess the effects of the predictors group, musical rhythm perception ability and cognitive ability on each of the conditions in separation.

Random intercept for participants and items, and random slopes for the condition contrasts by participants were included. Correlations of the random slopes by participants were subtracted (by ||), as not all reported models would converge when including them and comparisons suggested that they did not significantly account for variance. Models including random slopes for cognitive ability, group, and/or musical rhythm perception ability by item did not improve the model fits.

Parameters of the tested models are reported in the tables below. To correct for multiple comparisons, adjusted *p*-values (Bonferroni correction) are reported in the last column (model 2, *p*-values multiplied by 2, model 3: *p*-values multiplied by 3, etc.). Level of significance: \* *p* < 0.05, \*\* *p* < 0.01, \*\*\* *p* < 0.001.


**Table A5.** Output of the first model exploring the effects of group on the three conditions.

<sup>1</sup> Formula: *Response* ~ *Condition*/(*Group*) + (1 + *condL2v1* + *condL3v2* || *participant*) + (1 | *item*). Level of significance: \* *p* < 0.05, \*\* *p* < 0.01, \*\*\* *p* < 0.001.



<sup>1</sup> Formula: *Response* ~ *Condition*/(*Musical rhythm perception ability*) + (1 + *Duration-Intensity* + *Control-Duration* || *participant*) + (1 | *item*). Level of significance: \* *p* < 0.05, \*\*\* *p* < 0.001.


**Table A7.** Output of the third model exploring the effects of group and musical rhythm perception ability on the three conditions.

<sup>1</sup> Formula: *Response* ~ *Condition*/(*Group \* Musical rhythm perception ability*) + (1 + *Duration-Intensity* + *Control-Duration* || *participant*) + (1| *item*).Level of significance: \* *p* < 0.05, \*\*\* *p* < 0.001.

**Table A8.** Output of the fourth model that extends model 3 by cognitive ability as control variable, which did not improve the model fit.


<sup>1</sup> Table A8 reports the results of Model 4, which extended Model 3 by adding cognitive ability as control variable (Formula: *Response* ~ *Condition*/(*Group*\**Musical rhythm perception ability* + *Cognitive ability*) + (1 + *Duration-Intensity* + *Control-Duration* || *participant*) + (1 | *item*). Level of significance: \* *p* < 0.05, \*\* *p* < 0.01, \*\*\* *p* < 0.001. Not different from Model 3, the results of Model 4 suggest highly significant effects of musical rhythm perception ability on the duration (*p* = 0.02) and the control condition (*p* = 0.008). Intensity was, just as in Model 3, not modulated by musical rhythm perception ability, and there were, again, no interactions of any condition and group, and neither any three-way interactions of any condition with group and musical rhythm perception ability. Altogether, there were no significant effects of cognitive ability. Model comparisons revealed that Model 4 was not better than Model 3 (χ<sup>2</sup> = 8.05, *p* < 0.23).

**Table A9.** Output of the fifth model that extends model 3 by musical experience as control variable, which did not improve the model fit.


<sup>1</sup> Formula: *Response* ~ *Condition*/(*Group \* Muscial rhythm perception ability* + *Musical experience*) + (1 + *Duration-Intensity* + *Control-Duration* || *participant*) + (1 | *item*). Level of significance: \* *p* < 0.05, \*\*\* *p* < 0.001.Table A9 reports the results of Model 5, which extended Model 3 by adding musical experience as control variable (Formula: *Response* ~ *Condition*/(*Group\*Musical rhythm perception ability* + *Musical experience*) + (1 + *Duration-Intensity* + *Control-Duration* || *participant*) + (1 | *item*). The results of Model 5 match those of Model 3, suggesting the same highly significant effects of musical rhythm perception ability on the duration (*p* = 0.003) and the control condition (*p* = 0.001). Altogether, there were no significant effects of musical experience. Model comparisons revealed that Model 5 was not better than Model 3 (χ<sup>2</sup> = 8.82, *p* < 0.18).

#### **Appendix F**

The results of the linear regression analysis for testing the effects of dyslexia (group factor), musical rhythm perception ability (MET scores), and cognitive ability (First principal component of the WAIS scores) on reading ability (SLRT nonword reading scores) are provided in Table A10.

**Table A10.** Parameters of the regression analysis of the effects of dyslexia, musical rhythm perception ability and cognitive ability on reading ability.


<sup>1</sup> Formula: lm(Reading ability ~ Group\*Musical rhythm perception ability+ Group\*Cognitive ability+Group\*Musical experience). Level of significance: \*\*\* *p* < 0.001.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Communication*
