**Neurophysiological Markers of Statistical Learning in Music and Language: Hierarchy, Entropy and Uncertainty**

#### **Tatsuya Daikoku**

Department of Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences, 04103 Leipzig, Germany; daikoku@cbs.mpg.de; Tel.: +81-5052157012

Received: 10 May 2018; Accepted: 18 June 2018; Published: 19 June 2018

**Abstract:** Statistical learning (SL) is a method of learning based on the transitional probabilities embedded in sequential phenomena such as music and language. It has been considered an implicit and domain-general mechanism that is innate in the human brain and that functions independently of intention to learn and awareness of what has been learned. SL is an interdisciplinary notion that incorporates information technology, artificial intelligence, musicology, and linguistics, as well as psychology and neuroscience. A body of recent study has suggested that SL can be reflected in neurophysiological responses based on the framework of information theory. This paper reviews a range of work on SL in adults and children that suggests overlapping and independent neural correlations in music and language, and that indicates disability of SL. Furthermore, this article discusses the relationships between the order of transitional probabilities (TPs) (i.e., hierarchy of local statistics) and entropy (i.e., global statistics) regarding SL strategies in human's brains; claims importance of information-theoretical approaches to understand domain-general, higher-order, and global SL covering both real-world music and language; and proposes promising approaches for the application of therapy and pedagogy from various perspectives of psychology, neuroscience, computational studies, musicology, and linguistics.

**Keywords:** statistical learning; implicit learning; domain generality; information theory; entropy; uncertainty; order; *n*-gram; Markov model; word segmentation

#### **1. Introduction**

The brain is a learning system that adapts to multiple external phenomena existing in its living environment, including various types of input such as auditory, visual, and somatosensory stimuli, and various learning domains such as music and language. By means of this wide-ranging system, humans can comprehend structured information, express their own emotions, and communicate with other people [1]. According to linguistic [2,3] and musicological studies [4,5], music and language have domain-specific structures including universal grammar, tonal pitch spaces, and hierarchical tension. Neurophysiological studies likewise suggest that there are specific neural bases for language [6,7] and music comprehension [8,9]. Nevertheless, a body of research suggests that the brain also possesses a domain-general learning system, called statistical learning (SL), that is partially shared by music and language [10,11]. SL is a process by which the brain automatically calculates the transitional probabilities (TPs) of sequential phenomena such as music and language, grasps information dynamics without an intention to learn or awareness of what we know [12,13], and further continually updates the acquired statistical knowledge to adapt to the variable phenomena in our living environments [14]. Some researchers also indicate that the sensitivity to statistical regularities in sequences could be a by-product of chunking [15].

The SL phenomenon can partially be supported by a unified brain theory [16]. This theory tries to provide a unified account of action and perception, as well as learning under a free-energy principle [17,18], which views several keys of brain theories in the biological (e.g., neural Darwinism), physical (e.g., information theory), and neurophysiological (e.g., predictive coding) sciences. This suggests that several brain theories might be unified within a free-energy framework [19], although its capacity to unify different perspectives has yet to be established. This theory suggests that the brain models phenomena in its living environment as a hierarchy of dynamical systems that encode a causal chain structure in the sensorium to maintain low entropy [16], and predicts a future state based on the internalized model to minimize sensory reaction and optimize motor action. This prediction is in keeping with the theory of SL in the brain. That is, in SL theory, the brain models sequential phenomena based on TP distributions, grasps entropy in the whole sequences, and predicts a future state based on the internalized stochastic model in the framework of predictive coding [20] and information theory [21]. The SL also occurs in action sequences [22,23], suggesting that SL could contribute to optimization of motor action.

SL is considered an implicit and ubiquitous process that is innate in humans, yet not unique to humans, as it is also found in monkeys [24,25], songbirds [26,27], and rats [28]. The terms implicit learning and SL have been used interchangeably and are regarded as the same phenomenon [15]. A neurophysiological study [29] has suggested that conditional probabilities in the Western music corpus are reflected in the music-specific neural responses referred to as early right anterior negativity (ERAN) in event-related potential (ERP) [8,9]. The corpus study also found statistical universals in music structures across cultures [30,31]. These findings also suggest that musical knowledge may be at least partially acquired through SL. Our recent studies have also demonstrated that the brain codes the statistics of auditory sequences as relative information, such as relative distribution of pitch and formant frequencies, and that this information can be used in the comprehension of other sequential structures [10,32]. This suggests that the brain does not have to code and accumulate all received information, and thus saves some memory capacity [33]. Thus, from the perspective of information theory [21], the brain's SL is systematically efficient.

As a result of the implicit nature of SL, however, humans cannot verbalize exactly what they statistically learn. Nonetheless, a body of evidence indicates that neurophysiological and behavioural responses can unveil musical and linguistic SL effects [14,32,34–44] in the framework of predictive coding [20]. Furthermore, recent studies have detected the effects of musical training on linguistic SL of words [41,43,45–47] and the interactions between musical and linguistic SL [10] and between auditory and visual SL [44,48–50]. On the other hand, some studies have also suggested that SL is impaired in humans with domain-specific disorders such as dyslexia [51–53] and amusia [54,55], disorders that affect linguistic and music processing, respectively (though Omigie and Stewart (2011) [56] have suggested that SL is intact in congenital amusia). Thiessen et al. [57] suggested that a complete-understanding statistical learning must incorporate two interdependent processes: one is the extracting process that computes TPs (i.e., local statistics) and extracts each item, such as word segmentation, and the other one is the integration process that computes distributional information (i.e., summary statistics) and integrates information across the extracted items. The entropy and uncertainty (i.e., summary statistics), as well as TPs, are used to understand the general predictability of sequences in domain-general SL that could cover music and language in the interdisciplinary realms of neuroscience, behavioral science, modeling, mathematics, and artificial intelligence. Recent studies have suggested that SL strategies in the brain depend on the hierarchy, order [14,35,58,59], entropy, and uncertainty in statistical structures [60]. Hasson et al. [61] also indicated that certain regions or networks perform specific computations of global or summary statistics (i.e., entropy), which are independent of local statistics (i.e., TP). Furthermore, neurophysiological studies suggested that sequences with higher entropy were learned based on higher-order TP, whereas those with lower entropy were learned based on lower-order TP [59]. Thus, it is considered that information-theoretical and neurophysiological concepts on SL link each other [62,63]. The integrated approach of

neurophysiology and informatics based on the notion of order of TP and entropy can shed light on linking concepts of SL among a broad range of disciplines. Although there have been a number of studies on SL in music and language, few studies have examined the relationships between the "order" of TPs (i.e., the order of local statistics) and entropy (i.e., summary statistics) in SL. This article focuses on three themes in SL from the viewpoint of information theory, as well as neuroscience: (1) a mathematical interpretation of SL that can cover music and language and the experimental paradigms that have been used to verify SL; (2) the neural basis underlying SL in adults and children; and (3) the applicability of therapy and pedagogy for humans with learning disabilities and healthy humans.

#### **2. Mathematical Interpretation of Brain SL Process Shared by Music and Language**

#### *2.1. Local Statistics: Nth-Order Transitional Probability*

According to SL theory, the brain automatically computes TP distributions in sequential phenomena (local statistics) [35], grasps uncertainty/entropy in the whole sequences (global statistics) [61], and predicts a future state based on the internalized statistical model to minimize sensory reaction [16,20]. The TP is a conditional probability of an event B given that the latest event A has occurred, written as P(B|A). The TP distributions sampled from sequential information such as music and language are often expressed by *n*th-order Markov models [64] or *n*-gram models [21] (Figure 1). Although the terminology of *n*-gram models has frequently been used in natural language processing, it has also recently been used in music models [65,66]. They have often been applied to develop artificial intelligence that gives computers learning abilities similar to those of the human brain, thus generating systems for data mining, automatic music composition [67–69], and automatic text classification in natural language processing [70,71]. The mathematical model of SL including *n*th-order Markov and (*n* + 1)-gram models is the conditional probability of an event *en*+1, given the preceding n events based on Bayes' theorem:

$$P(\mathcal{c}\_{\mathcal{U}+1} \mid \mathcal{c}\_{\mathcal{U}}) = P(\mathcal{c}\_{\mathcal{U}+1} \cap \mathcal{c}\_{\mathcal{U}}) / P(\mathcal{c}\_{\mathcal{U}}) \tag{1}$$

From the viewpoint of psychology, the formula can be interpreted as positing that the brain predicts a subsequent event *en*+1 based on the preceding events *en* in a sequence. In other words, learners expect the event with the highest TP based on the latest n states, whereas they are likely to be surprised by an event with lower TP (Figure 2).

**Figure 1.** Example of *n*-gram and Markov models in statistical learning (SL) of language (**a**) and music (**b**) based on information theory. The top are examples of sequences, and the others explain how to calculate TPs (*P*(*en*+1|*en*)) based on zero- to second-order Markov models. They are based on the conditional probability of an event *en*+1, given the preceding n events based on Bayes' theorem. For instance, in language ((**a**), This is a sentence), the second-order Markov model represents that the "a" can be predicted based on the last subsequent two words of "This" and "is". In music ((**b**), C4, D4, E4, F4), second-order Markov model represents that the "E" can be predicted based on the last subsequent two tones of "C" and "D".

**Figure 2.** SL models and the sequences used in neural studies. All of the models and paradigms in sequences based on concatenation of words (**a**), Markov model of tone (**b**) and word (**c**), and concatenation of words with different TPs of the last stimuli in words (**d**) are simplified so that the characteristics of paradigms can be compared. In the example of word-segmentation paradigm (**a**), the same words do not successively appear. TP—transitional probability.

#### *2.2. Global Statistics: Entropy and Uncertainty*

SL models are sometimes evaluated in terms of entropy [72–75] in the framework of information theory, as done by Shannon [21]. Entropy can be calculated from probability distribution, interpreted as the average surprise (uncertainty) of outcomes [16,76], and used to evaluate the neurobiology of SL [60], as well as rule learning [77], decision making [78], anxiety, and curiosity [79,80] from the perspective of uncertainty. For instance, the conditional entropy (H(B|A)) in the *n*th order TP distribution (hereafter, Markov entropy) can be calculated from information contents:

$$H(X\_{i+1} \mid X\_i) = -\Sigma P(\mathbf{x}\_i) \Sigma P(\mathbf{x}\_{i+1} \mid \mathbf{x}\_i) \log\_2 P(\mathbf{x}\_{i+1} \mid \mathbf{x}\_i) \tag{2}$$

where *H*(*Xi*+1|*Xi*) is the Markov entropy; *P*(*Xi*) is the probability of event xi occurring; and *P*(*Xi*+1|*Xi*) is the probability of *Xi*+1, given that *Xi* occurs previously. Previous articles have suggested that the degree of Markov entropy modulates human predictability in SL [61,81]. The uncertainty (i.e., global/summary statistics), as well as the TP (i.e., local statistics), of each event is applicable to and may be used to predict many types of sequential distributions, such as music and language, and to understand the predictability of a sequence (Figure 3). Indeed, entropy and uncertainty are often used to understand domain-general SL in the interdisciplinary realms of neuroscience, behavioural science, modeling, mathematics, and artificial intelligence.

**Figure 3.** The entropy (uncertainty) of predictability in the framework of SL. The uncertainties depend on (**a**) TP ratios in a first-order Markov model (i.e., bigram model) and (**b**) orders of models in the TP ratio of 10% vs. 90%.

#### *2.3. Experimental Designs of SL in Neurophysiological Studies*

The word segmentation paradigm is frequently used to examine the neural basis underlying SL (e.g., [34,41,43,44,46,82–96]). This paradigm basically consists of a concatenation of pseudo-words (Figure 2a). In the pseudo-words sequence, the TP distributions based on a first-order Markov model represent lower TPs in the "first" stimulus of each word (Figure 2a: P(B|A), P(C|B), and P(A|C)) than other stimuli of word (Figure 2a: P(C|A), P(A|A), P(A|B), P(B|B), P(B|C), and P(C|C)). When the brain statistically learns the sequences, it can identify the boundaries between words based on first-order TPs (Figure 2a) [97,98], and segment/extract each word. The SL of word segmentation based on first-order TPs has been considered as a mechanism for language acquisition in the early stages of language learning, even in infancy [12]. Recent studies have also demonstrated that SL can be performed based on within-word, as well as between-word, TPs ([40,98] for example, see Figure 2d). Although a number of studies have used a word segmentation paradigm consisting of words with a regular unit length (typically, three stimuli within a word), previous studies suggest that the unit length of words [99], the order of TPs [59], and the nonadjacent dependencies of TPs in sequences ([14,100–102] for example, see Figure 2c) can modulate the SL strategy used by the brain. Indeed, natural languages and music make use of higher-order statistics, including hierarchical, syntactical structures. To understand the brain's higher-order SL systems in a form closer to that used for natural language and music, sequential paradigms based on higher-order Markov models have also been used in neurophysiological studies ([32,35,103] for example, see Figure 2b). Furthermore, the *n*th-order Markov model has been applied to develop artificial intelligence that gives computers learning and decision-making abilities similar to those of the human brain, thus generating systems for automatic music composition [67–69] and natural language processing [70,71]. Information-theoretical approaches, including information content and entropy based on *n*th-order Markov models, may be useful in understanding the domain-general SL, as it functions in response to real-world learning phenomena in the interdisciplinary realms of brain and computational sciences.

#### **3. Neural Basis of Statistical Learning**

#### *3.1. Event-Related Responses and Oscillatory Activity*

The ERP and event-related magnetic fields (ERF) modalities directly measure brain activity during SL and represent a more sensitive method than the observation of behavioral effects [40,41,104]. Based on predictive coding [20], when the brain encodes the TP distributions of a stimulus sequence, it expects a probable future stimulus with a high TP and inhibits the neural response to predictable external stimuli for efficiency of neural processing. Finally, the effects of SL manifest as a difference in the ERP and ERF amplitudes between stimuli with lower and higher TPs (Figure 4). Although many studies of word segmentation detected SL effects on the N400 component [43,46,88,89,93,94,105], which is generally considered to reflect a semantic meaning in language and music [106–108], auditory brainstem response (ABR) [96], P50 [41], N100 [94], mismatch negativity (MMN) [40,44,98], P200 [46,89,105], N200–250 [44,47], and P300 [83] have also been reported to reflect SL effects (Table 1). In addition, other studies using Markov models also reported that SL is reflected in the P50 [14,36,37], N100 [10,14,32,35], and P200 components [35]. Compared with later auditory responses such as N400, the auditory responses that peak earlier than 10 ms after stimulus presentation (e.g., ABR) and at 20–80 ms, which is around P50 latency, have been attributed to parallel thalamo–cortical connections or cortico–cortical connections between the primary auditory cortex and the superior temporal gyrus [109]. Thus, the suppression of an early component of auditory responses to stimuli with a higher TP in lower cortical areas can be interpreted as the transient expression of prediction error that is suppressed by predictions from higher cortical areas in a top-down connection [96]. Thus, top-down, as well as bottom-up, processing in SL may be reflected in ERP/ERF. On the other hand, SL effects on N400 have been detected in word-segmentation tasks, but not in the Markov model. TPs of a word-segmentation task are calculated based on first-order models (Figure 2a). In other words, in terms of the "order" of TP, SL of word segmentation (i.e., sequence consisting of word concatenation) and first-order Markov model have same hierarchy of TP. Nevertheless, SL studies using the first-order Markov model did not detect learning effects of N400 (Table 1). The phenomenon of word segmentation itself has been considered as a mechanism of language acquisition in the early stages of language learning [12]. Several papers claim that the sensitivity to statistical regularities in sequences of word concatenation

could be a by-product of chunking [15]. Neurophysiological effects of word segmentation, such as N400, reflecting a semantic meaning in language [106–108] may be associated with the neural basis underlying linguistic functions, as well as statistical computation itself. On the other hand, our previous study using the first-order Markov model [36] struggled to detect N400 in terms of a stimulus onset asynchrony of sequences (i.e., 500 ms). A future study will be needed to verify SL effects of N400 using the Markov model.

**Table 1.** Overview of neurophysiological correlations with auditory statistical learning. TP—transitional probability; ABR—auditory brainstem response; MMN—mismatch negativity; STS—superior temporal sulcus; STG—superior temporal gyrus; IFG—inferior frontal gyrus; PMC—premotor cortex; PTC—posterior temporal cortex.


**Figure 4.** Representative equivalent current dipole (ECD) locations (dots) and orientations (bars) for the N100 m responses superimposed on the magnetic resonance images (**a**) (Daikoku et al., 2014 [32]; and the SL effects (**b**) (Daikoku et al., 2015 [10]) (NS = not significant). When the brain encodes the TP in a sequence, it expects a probable future stimulus with a high TP and inhibits the neural response to predictable stimuli. In the end, the SL effects manifest as a difference in amplitudes of neural responses to stimuli with lower and higher TPs (**b**).

It has been suggested that SL could also be reflected in oscillatory responses in the theta band [115,116]. Moreover, the human and monkey auditory cortices represent the neural marker of predictability based on SL in the form of modulations of transient theta oscillations coupling with gamma and concomitant effects [25], suggesting that SL processes are unlikely to have evolved convergently and are not unique to humans. According to previous studies, low-frequency oscillations may play an important role in speech segmentation associated with SL [73], and in tracking the envelope of the speech

signal, whereas high-frequency oscillations are fundamentally involved in tracking the fine structure of speech [117]. Furthermore, there is evidence of top-down effects in low-frequency oscillations during listening to speech (up to beta band: 15–30 Hz), whereas bottom-up processing dominates in higher frequency bands [118]. Studies on the auditory oddball paradigm have also demonstrated that the power and/or coherence of theta oscillations to low-probability sounds is increased relative to high-probability sounds. Thus, many studies suggest that the lower-frequency oscillations, including theta band, are related to the prediction error [119]. Top-down predictions also control the coupling between speech and low-frequency oscillations in the left frontal areas, most likely in the speech motor cortex [120]. Although low-frequency oscillations could cover ERP components that have been suggested to reflect SL effects, the studies on oscillation and prediction imply the importance of investigating SL effects on oscillatory responses, as well as ERP.

#### *3.2. Anatomical Mechanisms*

#### 3.2.1. Local Statistics: Transitional Probability

Neuroimaging studies have indicated that both cortical and subcortical areas play an important role in SL. For instance, the auditory association cortex, including the superior temporal sulcus (STS) [91] and superior temporal gyrus (STG) [110], contributes to auditory SL of both speech and non-speech sounds. Previous studies have also reported the effects of laterality on SL. For instance, functional magnetic resonance imaging (fMRI) [121] and near-infrared spectroscopy (NIRS) [111] studies have suggested that SL is linked to the left auditory association cortex or the left inferior frontal gyrus (IFG) [112,122], which include Wernicke's and Broca's areas, respectively. Furthermore, one previous study has indicated that brain connectivity between bilateral superior temporal sources and the left IFG is important for auditory SL [45]. On the other hand, another study has shown that the right posterior temporal cortex (PTC), which represents the high levels of the peri-Sylvian auditory hierarchy, is related to higher-order auditory SL [35] (i.e., second-order TPs). Further study will be needed to examine the relationships between the order of TPs in sequences and the neural correlations that depend on the order of TPs and hierarchy of SL.

Some studies have suggested that the sensory type of each stimulus modulates the neural basis underlying SL. For instance, some previous studies have suggested that the right hemisphere contributes to visual SL [123]. Paraskevopoulos and colleagues [50] revealed that the cortical network underlying audiovisual SL was partly common with and partly distinct from the unimodal networks of visual and auditory SL, comprising the right temporal and left inferior frontal sources, respectively. fMRI studies have also reported that Heschl's gyrus and the medial temporal lobe [124] contribute to auditory and visual SL, respectively [113], and that motor cortex activity also contributes to visual SL of action words [22]. Furthermore, Cunillera et al. [88] have suggested that the superior part of the ventral premotor cortex (PMC), as well as the posterior STG, are responsible for SL of word segmentation, suggesting that linguistic SL is related to an auditory–motor interface. Another study has suggested that the abstraction of acquired statistical knowledge is associated with a gradual shift from memory systems in the medial temporal lobe, including the hippocampus, to those of the striatum, and that this may be mediated by slow wave sleep [125].

#### 3.2.2. Global Statistics: Entropy

Perceptive mechanisms of summary structure (i.e., global statistics) are considered to be independent of the prediction of each stimulus with different TPs (local statistics) [57,61]. Recent studies have examined the brain systems that are responsible for encoding the uncertainty of global statistics in sequences by comparing brain activities while listening to Markov/word-concatenation and random sequences, which have lower and higher entropies, respectively. Regardless of whether music or language is assessed, the hippocampus and the lateral temporal region [88], including Wernicke's area [114], are considered to play important roles in encoding uncertainty and conditional entropy of statistical information [60]. Bischoff-Grethe et al. have also indicated that Wernicke's area may not be exclusively associated with uncertainty of language information [114]. Furthermore, uncertainty in auditory and visual statistics is coded by modality-general, as well as modality-specific, neural mechanisms [126,127], supporting the hypothesis that the neural basis underlying the brain's perception of global statistics (i.e., uncertainty), as well as local statistics (i.e., prediction of each stimulus with different TPs), is a domain-general system. Our previous neural study also suggested that reorganization of acquired statistical knowledge requires more time than the acquisition of new statistical knowledge, even if the new and previously acquired information sets have equivalent entropy levels [14]. Furthermore the results suggested that humans learn larger structures, such as phrases, first and subsequently extract smaller structures, such as words, from the learned phrases (global-to-local learning strategy). To the best of our knowledge, however, no study has yet demonstrated the differences and neural basis interactions between global and local statistics. Further study is needed to reveal how the coding of global statistics affects that of local statistics.

#### **4. Clinical and Pedagogical Viewpoints**

#### *4.1. Disability*

Although SL is a domain-general system, some studies have reported that SL is impaired in domain-specific disabilities such as dyslexia [51–53] and amusia [54,55], which are language- and music-related disabilities, respectively. Ayotte and colleagues [128] have suggested that individuals with congenital amusia fail to learn music SL but can learn linguistic SL, even if the sequences of both types have the same degree of statistical regularity [54]. Another study has suggested, in contrast, that SL is intact in amusia [56], and that individuals with amusia lack confidence in their SL ability, although they can engage in SL of music. Peretz et al. [54] stated that the input and output of the statistical computation might be domain-specific, whereas the learning mechanism might be domain-general. Furthermore, previous studies have indicated that SL ability is impaired in patients with damage to a specific area of the brain. For instance, SL is impaired in connection with hippocampal [129] and right-hemisphere damage [130]. Indeed, it has been suggested that the hippocampus plays an important role in SL [124]. One recent study indicated that auditory deprivation leads to disability of not only auditory SL [131] but also visual SL [132]. This implies that there may be specific neural mechanisms for SL that can be shared among distinct sensory modalities. Another study [133], however, suggested that a period of early deafness is not associated with SL disability. Further study is needed to clarify whether SL disability is related to temporary auditory deprivation.

#### *4.2. Music-to-Language Transfer*

#### 4.2.1. Neural Underpinnings of SL That Overlap across Music and Language Processing

Because of the acoustic similarity [134], cortical overlap [135,136], and domain generality of SL across language and music, experienced listeners to particular spectrotemporal acoustic features, such as rhythm and pitch, in either speech or music have an advantage when perceiving similar features in the other domain [137]. According to neural studies, musical training leads to a different gray matter concentration in the auditory cortex [138] and a larger planum temporale (PT) [139–143]; the region where both language and music are processed. An ERP study has demonstrated that both the linguistic and the musical effects of SL on the N100–P200 response, which could originate in the belt and parabelt auditory regions [144,145], were larger in musicians than in non-musicians [46]. Thus, the increased PT volume associated with musical training may facilitate auditory processing in SL. A magnetoencephalographic (MEG) study also reported that the effect of SL on the P50 response was larger in musicians than in non-musicians [41], suggesting that musical training also boosts corticofugal projections in a top-down manner regarding predictive coding [96].

Musical training could also facilitate the effects of SL on N400 [46], which is considered to be associated with IFG and PMC [88]. According to the results of a neural study, musicians have an increased gray matter density of the left IFG (i.e., Broca's area) and PMC [146]. Other studies have suggested that, during SL of word segmentation, musicians exhibit increased left-hemispheric theta coherence in the dorsal stream projecting from the posterior superior temporal (pST) and inferior parietal (IP) brain regions toward the prefrontal cortex, whereas non-musicians show stronger functional connectivity in the right hemisphere [115]. An MRI study also demonstrated that SL of word segmentation leads to pronounced left-hemisphere activity of the supratemporal plane, IP lobe, and Broca's area [147]. Thus, the left dorsal stream is considered to play an important role in SL, as well as language [7] and music learning [148].

The SL of word segmentation plays an important role in various speech abilities. Recent studies have revealed a strong link between SL of word segmentation and more general linguistic proficiency such as expressive vocabulary [149] and foreign language [150]. An fMRI study [151] has suggested that, during SL of word segmentation, participants with strong SL effects of familiar language on which they had been pretrained had decreased recruitment of fronto-subcortical and posterior parietal regions, as well as a dissociation between downstream regions and early auditory cortex, whereas participants with strong SL effects of novel language that had never been exposed showed the opposite trend. Furthermore, children with language disorders perform poorly when compared with typical developing children in tasks involving musical metrical structures [152], and have more difficulty in SL of word segmentation [153] and perception of speech rhythms [154,155]. Thus, musical training, including rhythm perception and production, is important for the development of language skills in children. Together, a body of study indicates that musical expertise may transfer to language learning [104]. It is generally considered that the left auditory cortex is more sensitive to temporal information, such as musical beat and the voice-onset (VOT) time of consonant-vowel (CV) syllables, whereas the right auditory cortex plays a role in spectral perception, such as pitch and vowel discriminations. Recent studies have indicated relationships between rhythm perception and SL [156].

Recent neural studies have demonstrated that SL of speech, pitch, timbre, and chord sequences can be performed and reflected in ERP/ERF [10,36,37,40,46]. Furthermore, the brain codes statistics of auditory sequences as relative information, such as relative distribution of pitch and formant frequencies, which could be used for comprehension of another sequential structure [10,32], suggesting that SL is ubiquitous and domain-general. On the other hand, the relative importance of acoustic features such as rhythm, pitch, intensity, and timbre varies depending on the domain, that is, music or language [157]. For instance, unlike spoken language, music contains various pitch frequencies. Recent studies have suggested that, compared with speech sequences, sung sequences with various pitches facilitate auditory SL based on word segmentation [92] and the Markov model [10]. These results further support the advantage of musical training for language SL. In addition, Hansen and colleagues have suggested that musical training also facilitates the hippocampal perception of global statistics of entropy (i.e., uncertainty) [158], as well as local statistics of each TP. Thus, musical training contributes to the improvement of SL systems in various brain regions, including the auditory cortex. Together, the facilitation of SL may be related to enhancement of the left dorsal stream via the IFG and PMC, as well as PT, enhanced low-level auditory processing in a top-down manner, and enhanced hippocampal processing. Musical training including rhythm perception contributes to these enhancements and facilitates the involvement of SL in language skills, and thus could be an important clinical and pedagogical strategy in persons with any of a variety of language-related disorders such as dyslexia [159,160] and aphasia [161].

#### 4.2.2. Children and Adults: Critical Periods and Plasticity in the Brain

Previous studies have demonstrated that auditory SL can be performed even by sleeping neonates [85,86,162]. SL is ubiquitously performed at birth, showing that the human brain is innately prepared for it. An infant's SL extends to rhythms [163], visual stimuli [164], objects [165], social learning [23,166], and a general mechanism by which infants form meaningful representations of the environment [167]. Furthermore, infants can also learn non-adjacent statistics [101]. This suggests that SL plays an important role in an infant's syntactic learning, as well as the simple segmentation of words. These results may enable us to disentangle the respective contributions of nature and nurture in the acquisition of language and music. On the other hand, an MEG study has suggested that the strategies for language acquisition in infants could shift from domain-general SL to domain-specific processing of native language between 6 and 12 months [116], a "critical period" for language acquisition [168]. A comparable developmental change from domain-general to domain-specific learning strategies can also occur in music perception [169]. During the "critical period" of heightened plasticity, the brain is formed by sensory experience [170–172]. The development of primary cortical acoustic representations can be shaped by the higher-order TP of stimulus sequences [58]. An ERP study [173] suggested that sensitivity to speech stimuli in infants gradually shifts from accentuation to repetition during a critical period. These results may suggest that cortical reorganization depending on early experience interacts with SL [174], and that fluctuations in the degree of dependence on SL for the acquisition of language and music are part of the developmental process during critical periods. On the other hand, the SL system in the brain can be preserved even in adults (e.g., [32,35,40,41]). According to previous studies, neural plasticity can occur in adults through SL [175] and musical training [176]. In fact, there is no doubt that SL occurs in adults who are already beyond the critical periods, and that their SL ability can be modulated by auditory training. Recent studies have revealed that the process of reorganization of acquired statistical knowledge can be detected in neurophysiological responses [14]. Furthermore, a computational study on music suggested the possibility that the time-course variation of statistical knowledge over a composer's lifetime can be reflected in that composer's music from different life stages [177]. Thus, implicit updates of statistical knowledge could be enabled by the combined and interdisciplinary approach of brain, behavioral, and computational methodologies [178].

#### **5. General Discussion**

#### *5.1. Information-Theoretical Notions for Domain-General SL: Order of TP and Entropy*

SL is a domain-general and interdisciplinary notion in psychology, neuroscience, musicology, linguistics, information technology, and artificial intelligence. To generate SL models that are applicable to all of these various realms, the *n*th-order Markov and *n*-gram models based on information theory have frequently been used in natural language processing [70,71] and in the creation of automatic music composition systems [67–69]. Such models can verify hierarchies of SL based on various-order TPs. Natural languages and music include higher-order statistics, such as hierarchical syntactical structures and grammar. Thus, information-theoretical approaches, including information content and entropy based on *n*th-order Markov models [59,61,81], can express domain-general statistical structures closer to those of real-world language and music. The SL models are often evaluated in terms of entropy [72–75]. From a psychological viewpoint, entropy is interpreted as the average surprise (uncertainty) of outcomes [16,76]. Previous studies have demonstrated that the perception of entropy and uncertainty based on SL could be reflected in neurophysiological responses [59] and activity of the hippocampus [60]. Hasson et al. [61] indicated that certain regions or networks perform specific computations of global or summary statistics (i.e., entropy), which are independent of local statistics (i.e., TP). Furthermore, Thiessen and colleagues [57] proposed that a complete-understanding statistical learning must incorporate two interdependent processes: one is the extracting process that computes TPs and extracts each item, such as word segmentation, and the other one is the integration process that computes distributional information and integrates information across the extracted items. Our previous studies [59] investigated correlation among entropy, order of TP, and the SL effect. As a result, the SL effects of sequences with higher entropy were lower than those with lower entropy, even when TP itself is same between these two sequences. This suggests that an evaluation of computational model of sequential information by entropy in the field of informatics may partially be

able to predict learning effect in human's brain. Thus, the integrated methodology of neurophysiology and informatics based on the notion of entropy can shed light on linking the concept of SL among a broad range of disciplines. To understand the domain-general SL system that incorporates notions from both information theory and neuroscience, it is important to investigate both global and local SL.

#### *5.2. Output of Statistical Knowledge: From Learning to Using*

According to recent studies, acquired statistical knowledge contributes to the comprehension and production of complex structural information, such as music and language [179], intuitive decision-making [77,78,180–182], auditory-motor planning [183], and creativity involved in musical composition [62]. Several studies suggest that musical representation is mainly formed by a tacit knowledge [184–186]. Thus, statistical knowledge is closely tied to musical and speech expression such as composition, playing, and conversation. In addition, global statistical knowledge (i.e., entropy and uncertainty), as well as local statistical knowledge (each TP), is also supposed to contribute to decision-making [78], anxiety [80], and curiosity [79]. A number of studies have reported, however, that humans cannot verbalize exactly what they have learned statistically, even when an SL effect is detected in neurophysiological responses [14,32,34–44]. Nevertheless, our previous study suggested that statistical knowledge could alternatively be expressed via abstract medium such as musical melody [32]. In these studies, learners could behaviorally distinguish between sequences with more than eight tones with only higher TPs and those with only lower TPs, suggesting that humans can distinguish sequences with different TPs when they are provided longer sequences when compared with a conventional way in word-segmentation studies that present sequences with three tones. These studies may also suggest that that SL of auditory sequences partially interact with the Gestalt principle [5]. Furthermore, an fMRI study has suggested that the abstraction of statistical knowledge is associated with a gradual shift from the memory systems in the medial temporal lobe, including the hippocampus, to those of the striatum, and that this may be mediated by slow wave sleep [125]. Future study is needed to examine how/when statistical learning contributes to mental expression of music and language.

#### *5.3. Applicability in Clinical and Pedagogy*

Previous studies suggest that neurophysiological correlations of SL can disclose subtle individual differences that might be underestimated by behavioral levels [34,88,89,187], although recent studies showed individual differences in SL by behavioral tasks [188]. Some studies suggest that neurophysiological responses disclose SL effects, even when no SL effects cannot be detected in behavioral levels [40,41]. Neurophysiological markers of SL may at least be informative when studying less accessible populations such as infants, who are unable to deliver an obvious behavioral response [86,162]. For instance, ERP/ERF could be a useful method for the evaluation of the individual ability of SL, which is linked to individual skill in language and music learning [189,190], and which is impaired in humans with language- and music-based learning impairments such as dyslexia [51–53] and amusia [54,55]. Thus, neurophysiological markers of SL may be applicable for the evaluation of therapeutic and educational effects for patients and healthy humans [191] across any domain in which the conditional probabilities of sequential events vary systematically. Francois's findings [43] suggest the possibility of music-based remediation for children with language-based SL impairments. In addition, by using information theoretic approaches such as higher-order Markov models and entropy, SL ability can be evaluated in the form that is closest to that used in learning natural language and music [14,63]. The integration of neural, behavioral, and information-theoretical approaches may enhance our ability to evaluate SL ability in terms of both music and language.

#### *5.4. Challenges and Future Prospects: SL in Real-World Music and Language*

Although SL is generally considered domain-general, many studies also report that comprehension of language and music, which have domain-specific structures including universal grammar, tonal pitch spaces, and hierarchical tension [2–5], may rely on domain-specific neural bases [6–9,192]. Furthermore, current SL paradigms are not sufficient to account for all levels of the music- and language-learning process. Some studies suggest two steps of the learning process [193,194]. The first is SL, which shares a common mechanism among all the domains (domain generality). The second is domain-specific learning, which has different mechanisms in each domain (domain specificity). This learning process implies that, at least in an earlier step of the learning process, SL plays an essential role that covers music and language learning abilities [195]. On the other hand, few studies investigated how statistically acquired knowledge was represented in real-world communication, conversation, action, and music expression. Future studies will be needed to investigate how neural systems underlying SL contribute to comprehension and production in real-world music and language. Information-theoretical approaches based on higher-order Markov models can be used to understand SL systems in a form closer to that used for natural language and music, from a perspective of linguistics, musicology, and a unified brain theory such as the free-energy principle [16], including optimisation of action, as well as perception and learning.

#### **6. Conclusions**

This paper reviews a body of recent neural studies on SL in music and language, and discusses the possibility of therapeutic and pedagogical application. Because of a certain degree of acoustic similarity, neural overlap, and domain generality of SL between speech and music, musical training positively affects language skills in SL. Recent studies also suggested that SL strategies in the brain depend on the hierarchy, order [14,35,58,59], entropy, and uncertainty in statistical structures [60], and that certain brain regions perform specific computations of entropy that are independent of those of TP [61]. Yet few studies have investigated the relationships between the order of TPs (i.e., order of local statistics) and entropy (i.e., global statistics) in terms of SL strategies of the human brain. Information-theoretical approaches based on higher-order Markov models that can express hierarchical information dynamics as they are expressed in real-world language and music represent a possible means of understanding domain-general, higher-order, and global SL in the interdisciplinary realms of psychology, neuroscience, computational studies, musicology, and linguistics.

**Funding:** This work was supported by Suntory Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Conflicts of Interest:** The author declares no conflicts of interest.

#### **References**


© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Domain-Specific Expectations in Music Segmentation**

#### **Susana Silva \*, Carolina Dias and São Luís Castro \***

Center for Psychology at University of Porto (CPUP), Faculty of Psychology and Education Sciences, 4200-135 Porto, Portugal

**\*** Correspondence: susanamsilva@fpce.up.pt (S.S.); slcastro@fpce.up.pt (S.L.C.); Tel.: +351-964-366-725 (S.S.); +351-22-607-97-56 (S.L.C.)

Received: 20 June 2019; Accepted: 16 July 2019; Published: 17 July 2019

**Abstract:** The acoustic cues that guide the assignment of phrase boundaries in music (pauses and pitch movements) overlap with those that are known for speech prosody. Based on this, researchers have focused on highlighting the similarities and neural resources shared between music and speech prosody segmentation. The possibility that music-specific expectations add to acoustic cues in driving the segmentation of music into phrases could weaken this bottom-up view, but it remains underexplored. We tested for domain-specific expectations in music segmentation by comparing the segmentation of the same set of ambiguous stimuli under two different instructions: stimuli were either presented as speech prosody or as music. We measured how segmentation differed, in each instruction group, from a common reference (natural speech); thus, focusing on how instruction affected delexicalization effects (natural speech vs. transformed versions with no phonetic content) on segmentation. We saw interactions between delexicalization and instruction on most segmentation indices, suggesting that there is a music mode, different from a speech prosody mode in segmentation. Our findings highlight the importance of top-down influences in segmentation, and they contribute to rethinking the analogy between music and speech prosody.

**Keywords:** Prosody; Phrasing; Perception; Melody

#### **1. Introduction**

Most speech listeners and music listeners segment the auditory input into phrase-like units [1–4]. In both domains, listeners detect phrase boundaries as the input unfolds. This leads to the possibility of building the segmentation map of an utterance or a music piece, defining how many phrases were heard, whether they were short, long, regular or irregular in length, and how they relate to each other. Language and music users have ways of emphasizing their intended segmentation maps (phrase boundary locations) using specific graphic signs in printed versions of language and music. In written language, the intended segmentation map of an utterance is sometimes achieved by printed punctuation marks [5,6]. Printed music does not have a mandatory analogue of punctuation marks to signal the presence of intended phrase boundaries. Slurs are perhaps the most obvious sign of intended segmentation maps, although other markers such as pause signs can also be used [7,8].

In speech research, the idea of segmentation map, a set of individual choices regarding segmentation, has been implemented mostly with pairs of syntactically ambiguous sentences [8–10], holding more than one meaning depending on how they are parsed. The way that participants judge and understand such speech materials is then taken as an index of their segmentation choices. While traditional behavioral approaches only allowed delayed (post-exposure) judgements, more recent techniques such as eye-tracking [11,12] or EEG [8,10] popularized the online monitoring of speech segmentation, often affording the tracking of participants' revisions of their initial segmentation choices [13]. Online monitoring techniques have also increased the interest in music segmentation (e.g., [14]). In the present study, we used a simple behavioral online monitoring approach to both

speech and music segmentation maps, which consisted of asking participants to press a key every time they heard a phrase ending.

The segmentation of speech into phrase-like units depends not only on linguistic content (lexico–syntactic structure), but also on the paralinguistic intonation patterns of speech prosody (e.g., [1]), which define intonational phrases. The perception of intonation patterns per se, regardless of interactions with linguistic content, is driven by low-level acoustic cues such as changes in pitch, duration, or the presence of silence. In this sense, it is possible to view the segmentation of intonation-related speech prosody (speech prosody hereafter) as a bottom-up process, i.e., as a process where extra-perceptual factors like previous expectations of what an intonational phrase should be do not play a major role (see [15] for a discussion on top-down vs. bottom-up). Pitch deflections and pauses are acoustic cues that play an important role in driving both the segmentation of music [3,14] and that of speech prosody [16–18]. Does it follow that music is segmented the same way as speech prosody? The answer depends on how we assume that music segmentation is driven: if it is driven by acoustic cues, as in speech prosody, the answer is yes. If it is driven both by acoustic cues plus music-specific expectations (i.e., an idea of what a musical phrase is), the answer is no. The literature is mixed on this matter, as we will see below.

The idea that boundary assignment in music is driven solely by acoustic cues, which we will refer to as a bottom-up view on auditory segmentation, is present in the literature on the Closure Positive Shift (CPS) event-related potential. The CPS is an electrophysiological marker of phrase boundary perception, which has been found for speech [8,10,19], delexicalized (hummed) speech [20] and music [14,21–23] with little morphological variation across the three [20,24]. The bottom line of the CPS approach is that segmentation shares neural resources across music and speech prosody, and a strong motivation for these studies has been the fact that the same type of segmentation cues (pitch deflections, pauses) can be detected in both domains [14]. CPS studies have focused on the acoustic features that characterize musical and prosodic segmentation points (music phrases vs. intonational phrases). These are expected to elicit a brain response corresponding to boundary detection, with little effects of prior knowledge or contextual aspects. An implication of this view is that the segmentation map of a music piece can be similar to the segmentation map of a sample of speech prosody, provided that both have the same acoustic boundary cues at the same time points.

The alternative view, which we will refer to as the *top-down view*, emphasizes the role of expectations in suppressing or counteracting acoustic boundary cues. For instance, it has been admitted that music-specific expectations can make the listener search for four-bar structures when judging whether a musical phrase has ended or not [22,23], possibly overriding pauses within the four-bar phrase. The top-down view also relates to the idea that music segmentation may rely on more global cues than the segmentation of speech prosody [25]; such cues extending in time beyond the limits of a local boundary mark such as a pause and requiring integration. In contrast to the bottom-up view, one should expect here that equivalent boundary cues in music and speech prosody would not lead to equivalent segmentation maps, since segmentation options would depend on additional music-specific top-down influences. To our knowledge, neither this top-down-based hypothesis nor its bottom-up alternative have been subject to testing.

In the present paper, we tested whether music segmentation into phrases is driven by music-specific expectations that add to the acoustic cues used to segment speech prosody into intonational phrases. Thus, we tested for a top-down view on music segmentation. To that end, we compared participants' segmentation maps of a single set of ambiguous auditory stimuli, which were either presented as music or as speech prosody. We manipulated only the instruction, inducing different processing modes on the very same acoustic materials: top-down expectations and bottom-up processing for music, against bottom-up processing (only) for speech prosody vs. no additional expectations for speech prosody.

The ambiguous stimuli were obtained by an audio-to-MIDI conversion of natural speech, resulting in pitch-and-rhythm auditory streams deprived of linguistic content. For convenience of expression, we will refer to these wordless auditory streams as delexicalized versions, even though the difference between them and natural speech lies, strictly speaking, at the phonetic level rather than just the lexical one. Due to the algorithms involved in the audio-to-MIDI conversion of natural speech (see

methods), two types of data (speech prosody) distortions were expected: first, continuous pitch would be converted into discontinuous pitch (octave divided into 12-semitone intervals), lending a music-like character to these ambivalent streams; second, timing-related information concerning speech syllables might not be integrally preserved, although an approximation was expected. The first type of speech prosody distortion (discontinuous pitch) was necessary to keep the credibility of the music instruction. The second type of distortion created additional differences between delexicalized versions and the original speech signal, such that the former were, strictly speaking, delexicalized and modified. Nevertheless, delexicalized versions contained the pitch-and-timing information listeners use for processing speech prosody, with pitch and timing value-ranges reflecting the ones that occur in natural language. In this sense, we considered our delexicalized versions to be representative of speech prosody, even though they were not an exact copy of the speech prosody patterns that generated them.

In order to minimize participants' awareness of our experimental manipulation, instruction was set as a between-subjects factor. To circumvent the risk of imperfect group matching inherent to a between-subjects approach, we sought for a common reference (baseline) in the two groups, against which we analyzed participants' segmentation maps of ambiguous, delexicalized stimuli. The common reference we used was natural speech. Therefore, we collected the segmentation maps of a single set of delexicalized (ambiguous) stimuli (i.e., speech without lexical content) of two groups of participants receiving different types of instruction (speech prosody – "This is prosody" vs. music – "This is music"), as well as the segmentation maps of their natural-speech counterparts, in which case the instruction for segmentation was common to both groups ("This is speech"). We then focused on determining whether delexicalization effects (natural speech vs. delexicalized versions, within-subjects factor) were equivalent under music vs. speech prosody instructions, thus probing between-subjects instruction effects with the benefit of a baseline. Similar deviations from the natural-speech baseline (similar delexicalization effects) across instruction conditions (delexicalized presented as music vs. delexicalized presented as speech prosody) would indicate that music participants adopted segmentation approaches to delexicalized versions similar to those of speech prosody participants. In this case, there would be no reason to admit that there are music-specific expectations in music segmentation. By contrast, different deviations from baseline (different delexicalization effects) would indicate that music participants adopted segmentation approaches to delexicalized versions differing from those of speech prosody participants. In this case, music-specific expectations could be considered real.

The existence of delexicalization effects was a precondition to the goal of comparing such effects across instruction conditions. Delexicalization effects were expected under the speech prosody instruction, at least for one reason: it is known that lexicality – the presence vs. absence of lexical content - affects speech segmentation, in the sense that lexical information may override prosodic boundary markers in phrase boundary assignment [26–28] and the so-called *linguistic bias* ([29], see also [30] for a similar phenomenon in word segmentation) emerges (cf. [31]). For instance, Buxó-Lugo and Watson [26] found that listeners consistently report hearing more boundaries at syntactically licensed locations than at syntactically unlicensed locations, even when the acoustic evidence for an intonational boundary was controlled. Cole, Mo and Baek [28] analyzed the predictors of phrase boundary assignment, and found syntactic structure to be the strongest one, winning over prosodic cues. Meyer and colleagues [29] found that 2-phrase prosodic sentences with 2-phrase lexical groups lead to segmentation in 2 phrases, but 1-phrase prosodic sentences do not necessarily lead to a single phrase when there are two lexical groups. In the latter case, an electrophysiological marker of the linguistic bias is visible. On the other hand, the existence of delexicalization effects was a precondition, but not a target of this study, and this is why we did not discuss delexicalization effects per se. Instead, our question was whether the delexicalization effect tested under the music instruction would, or would not, parallel the delexicalization effect under the speech prosody instruction - in other words, if delexicalization would interact with instruction in the generation of segmentation maps.

In our approach, we characterized segmentation maps from two different viewpoints: segment length (correlated with the number of segments), and the matching with predefined segmentation models. Interactions between delexicalization and instruction on any of these measures would indicate music-specific expectations.

#### **2. Materials and Methods**

#### *2.1. Participants*

Seventy participants took part in the experiment. Half (*n* = 35) were assigned to the speech instruction (31 women), and the other half to the music instruction (27 women). There was no evidence of significant differences between the two groups concerning age (*M* ± *SD*: 20.54 ± 2.85 for speech, 20.28 ± 1.52 for music; *t*(68) = 0.47, *p* > 0.64, *d* = 0.12) and musical training (11 participants in the speech condition had 3.27 ± 2.24 years of training, ten in the music condition with 3.90 ± 2.46; *t*(68) = −0.17, *p* > 0.86, *d* = −0.04). All participants had normal hearing. None reported psychiatric or neurological disorders. Participants signed informed consent, according to the Declaration of Helsinki.

#### *2.2. Stimuli*

Stimulus materials consisted of natural speech samples and delexicalized versions of these (see Supplementary materials). The latter were presented under two different instructions (speech prosody vs. music) but they were physically the same. We used five different samples of natural speech. In order to maximize prosodic spontaneity, we selected these samples from available personal and media recordings instead of laboratory recordings. Each sample contained an utterance, combining full sentences that were semantically related (see Appendix A for transcriptions and sentence structure). Four utterances were spoken by men, and one by a woman. Stimulus 1 contained the online description of a short movie that was being watched by the speaker; stimulus 2 was a fragment of an interview; stimulus 3 and 4 were poems recorded by a famous Portuguese diseur; stimulus 5 was an excerpt from a news broadcast. Stimuli were similar in length (~60 sec., see Table 1), and they were all normalized to 70 dB rms.


<sup>a</sup> Relative SD = SD/highest SD (Stimulus 4). Note that the magnitude relation across stimuli is equivalent, whether it comes in Hz or in Mel; <sup>b</sup> Mel – Measure of pitch that accounts for different sensitivity levels across the frequency range; <sup>c</sup> in natural speech, pitch change is continuous.

To create delexicalized versions, natural speech samples were converted to MIDI with software Live 9 (www.ableton.com), using a bass timbre and settings for monophonic stimuli. This audio-to-MIDI conversion software detects stable-pitch fragments preceded by transients (an attack), disregarding intensity information. When dealing with music audio, the software searches for music notes. In speech-related audio, it should detect syllable-like events.

As shown in Table 1, pitch mean and standard deviation were preserved after audio-to-MIDI conversion (Wilcoxon signed rank tests: *Z* = 0, *p* = 0.059 for mean pitch; *Z* = 2, *p* >0.56 for standard deviation of pitch). In delexicalized (discrete pitch) versions, the pitch change rate was close to the syllable rate of speech (3–4 syllables per second, see [32,33]), supporting the idea that the algorithm captured syllable-like units As for the proportion of silences, it was apparently higher in delexicalized versions, but statistical tests did not confirm this (*Z* = 13, *p* >0.13).

#### *2.3. Procedure*

We started the experiment with auditory reaction time measurements. Participants heard a series of beeps, among which there was a human voice pronouncing a syllable. They were asked to press a key as soon as they heard the human voice. The purpose of these measurements was to provide a participant-specific correction for reaction times (time between perception and key press) for the task of detecting phrase endings that would be requested in the experiment.

All participants were first exposed to the five delexicalized stimuli. Those under the speech instruction were told that the stimuli were derived from real speech, thus containing "the melody of speech, without the words". Participants under the music instruction were told that stimuli were "excerpts of contemporary music". All participants were asked to press the space bar of the computer keyboard every time they perceived a phrase ending. Before the experimental trials, all were given a brief explanation of the concept of phrase ("a speech/music fragment, with a beginning and an end"), followed by a demonstration of a possible way of segmenting either speech prosody (speech instruction) or music (music instruction) into phrases. In these examples, we defined segments with similar length across instructions (6 sec. for speech prosody, 7 sec. for music). Given that the concept of music phrase is not trivial among non-experts, we told music-instruction participants that music phrases "were the equivalent of speech phrases, in that they defined unitary fragments". We stressed that there were no wrong answers. Participants were given one practice trial, either with a delexicalized utterance (speech instruction) or with a music excerpt (music instruction) and then they proceeded into the experimental trials. Each trial consisted of one stimulus to be segmented. Since segmentation was made online, they were unable to go back for corrections. Therefore, we gave participants a second chance: each stimulus was presented twice in succession, and participants did the segmentation on both (5 x 2 trials). Only the second presentation of each stimulus was considered in the analyses. We presented stimuli no more than twice in order to keep the experiment short enough to avoid fatigue.

After segmenting the five delexicalized stimuli, participants were asked to do the same on the 5 × 2 natural speech counterparts. They were informed that they would listen to "normal speech" and they should, again, press the space bar whenever they sensed the phrase had just ended. Participants were not informed that delexicalized and natural speech had the same source. We created three different versions of the experiment, in order to counterbalance the order of presentation of the five stimuli (1-2-3-4-5; 1-4-5-2-3; 4-1-5-3-2). In each version, stimulus order was common to delexicalized and lexicalized sets. Thus, in version 1, participants heard 1-2-3-4-5 delexicalized and then 1-2-3-4-5 lexicalized. We did so in order to keep delexicalized and lexicalized conditions as equivalent as possible.

At the end of the experiment, participants were given a questionnaire where they rated the level of confidence in their segmentation responses for each block (delexicalized vs natural speech) on a 5-point scale and made any comments they wished to. Stimulus delivery was made with Presentation software (www.neurobs.com, v. 20). The experiment lasted about 40 minutes.

#### *2.4. Segmentation Models*

Prior to the analysis, we defined virtual segmentation points in each stimulus according to four theoretical models, each model based on a different segmentation cue: Pause, Pitch break, Pitch rise and Pitch drop. The adopted models intended to explore the idea of pauses and pitch movements such as low-level acoustic cues subtending both speech prosody and music segmentation (see Introduction section). Considering the possibility that music segmentation may rely more on global cues (see Introduction section)

than local boundary marks, two models targeted local cues (Pauses and Pitch breaks), and two targeted global cues (Pitch rises and Pitch drops).

Pauses and Pitch breaks were considered as local cues, in the sense that they included a restricted number of events (silence onset/offset, sudden pitch change), which unfolded within a short time-window. Based on a preliminary inspection of our five natural speech stimuli, we defined Pauses as silent periods longer than 200 ms. The onset of the Pause was considered the segmentation point. Pitch breaks were marked if two consecutive pitch values that were separated by a silence (shorter than 200 ms, the threshold for pause) differed by more than one standard deviation of the stimulus mean pitch. The onset of the second pitch value was set as the segmentation point. Note that the perception of pitch breaks is necessarily context-dependent, since pitch is continuously changing, and we are focusing on salient pitch breaks, which depend on the overall pitch context. However, the break per se (two different pitch values, separated by a short pause) occurs in a short time window. This is the reason why we considered pitch break as a local cue.

Pitch rises and Pitch drops were viewed as global cues, since they require the integration of multiple (pitch) values across time, and they tend to occur within larger time windows. Pitch rises and Pitch drops were defined as unidirectional pitch movements. Since Pitch drops are more common in natural speech, given the F0 decline phenomenon ([34,35], a universal tendency for pitch to drop across sentences, we used more restrictive criteria for Pitch drops than for Pitch rises. Pitch rises and drops should be either wide in pitch range (at least one third of global pitch range) or long-lasting (minimum 500 ms for pitch rise, and 1000 ms for pitch drops). For Pitch drops, we set the additional criterion that pitch should reach a low-frequency range, namely half a standard deviation from the global mean pitch. Small pitch deflections up to 250/200 ms were allowed within Pitch rise/drop segments, as well as pauses up to 200 ms. The offset of pitch movements (rises or drops) corresponded to the segmentation point. Pitch drops or rises not complying with these criteria were not used as virtual segmentation points of any kind.

When Pauses coexisted with Pitch breaks, rises or drops, we considered these as different situations/models. Pauses combined with Pitch rises or drops were viewed as mixed cues (local plus global cues), and pauses combined with Pitch breaks (i.e., when the pause between contrasting pitch values was larger than 200 ms) were viewed as local cues. Thus, in total, we had seven models.

Virtual segmentation points (cues) were marked for delexicalized and natural speech versions separately, leading to version-specific segmentation models. There was not a complete overlap in the number of segmentation points across the two versions, which was due to the audio-to-MIDI conversion process (e.g., pause lengths became slightly different in some cases, making the number of pause points differ). However, such differences were irrelevant to our main research question, which concerned the influence of instruction on the delexicalization effect rather than the delexicalization effect itself.

#### *2.5. Preprocessing and Statistical Analysis*

We were interested in the interaction between delexicalization (delexicalized vs. natural speech, within-subjects) and instruction (speech vs. music, between-subjects) on segmentation maps. Such interactions would indicate music-specific expectations, non-overlapping with prosody-specific ones. We analyzed the effects of delexicalization and instruction; first on segment length, and then on the adherence to a number of segmentation models we created (model matching).

To compute participants' segment length, we calculated the interval between participants' key presses. Participants' metrics per stimulus (mean and standard deviation of segment length – the latter indexing segment length variability) were obtained.

To analyze the matching of participants' segmentations with the segmentation models, participant-specific reaction times (see procedure; *M* + *SD* = 287 ± 50 ms) were first subtracted from the raw time of key presses in order to obtain corrected segmentation points for each stimulus (Figure 1B). Then, also for each stimulus, we merged the time stamps of the virtual segmentation points from all seven models into one global array of time values (Figure 1A).

**Figure 1.** (A) Stimulus-specific global arrays, combining all segmentation models for each utterance (St1-5: Stimulus 1–5; lower line: delexicalized; upper line: natural). Dots represent virtual segmentation points; (B) Example of a segmentation maps (Participant 01) containing actual segmentation marks for each of the five stimuli in natural speech versions (above) and delexicalized ones (below, speech prosody instruction in this case).

With reference to each stimulus-specific global array of time values (lengths ranging from 45 to 96 virtual points depending on the stimulus, see Figure 1A), we derived separate logical arrays (1, true or present vs. 0, false or absent) for each model (1 marking the points of the model in question and 0 the points of other models), and one logical array per participant (1 marking participants' segmentation points and 0 absence of segmentation). When defining participants' logical arrays, the closest value of the global array of time values was always chosen. Maximum inter-point distances in global arrays of time values were 2690 and 3497 ms for stimulus 1 (delexicalized and natural), 4099 and 5116 ms for stimulus 2, 2397 and 3091 ms for stimulus 3, 2520 and 4886 ms for stimulus 4, 2075 and 1972 for stimulus 5. Therefore, this was the maximum error that could occur when fitting participants' marks to the available models. Finally, we computed the similarity between the logical array describing each participant's behavior and each of the seven logical arrays describing each model, using the Russell and Rao binary similarity coefficient [36]. The Russell and Rao coefficient evaluates the overlap of two data series concerning a binary attribute (present or absent). In our case, we measured how the distribution of participants' marks in time overlapped with the distribution of model-specific segmentation points; both filled with present vs. absent points in reference to the global array of time values. We referred to these coefficients as model matching scores, since they described participants' level of adherence to a given segmentation model.

For statistical analyses, we used mixed ANOVAs. We first analyzed the effects of delexicalization and instruction on the mean and standard deviation of segment length. We then considered the effects of delexicalization, instruction and model (within-subjects, seven levels/models: Pause, Pitch break, Pause plus pitch break, Pitch rise, Pitch drop, Pause plus pitch rise, Pause plus pitch drop) on model matching scores. In the presence of third-order interactions (delexicalization x instruction x model), delexicalization x instruction interactions were considered per model. Along the model matching analysis with seven models, we inspected whether the results fitted with the high-order classification of cues into local, global and mixed, to see whether it made sense to quantify the differences related to this triad. Mixed ANOVAs were also used to analyze questionnaire responses related to participants' confidence in their segmentation responses.

Even though participants heard delexicalized versions prior to natural speech, natural speech was the common reference against which the segmentation maps of the two delexicalized conditions (speech vs. music instruction) were evaluated. Therefore, we refer to the concept of delexicalization throughout the results section as a logical, rather than chronological process.

#### **3. Results**

#### *3.1. Segment Length*

The overall mean segment length was around 7000 ms (Figure 2), corresponding to an average of 8.6 segments per speech/music 60-sec sample (10.4/7.8 segments for delexicalized speech under speech/music instructions; 7.5/ 8.6 segments for natural speech). Mean segment length showed no main effects of delexicalization (*p* >0.17, η2p = 0.027) or instruction (*p* >0.29, η2p = 0.016), but there was an interaction between the two (*F*(1,68) = 10.12, *p* = 0.002, η2p = 0.13, Figure 2): delexicalization led to decreased segment length under the speech instruction (*t*(34) = −4.66, *p* <0.001, *d* = −0.66), while it caused no significant changes under the music instruction (*p* >0.30, *d* = 0.21, Figure 2).

**Figure 2.** Delexicalization and instruction effects on the mean (left) and standard deviation (right) of segment length. Participants under the speech instruction decreased segment length in delexicalized versions, while those under the music instruction did not show any change. The standard deviation (variability) decreased in delexicalized versions for both instruction levels. Vertical bars represent the standard error of the mean.

Delexicalization decreased the standard deviation (variability) of segment length (main effect of delexicalization: *F*(1,68) = 4.60, *p* = 0.036, η2p = 0.063, Figure 2), regardless of the instruction (nonsignificant delexicalization x instruction interaction: *p* >0.16, η2p = 0.029).

#### *3.2. Model Matching Scores*

We found a significant interaction between delexicalization and instruction (*F*(1,68) = 19.32, *p* <0.001, η2p = 0.22) on model matching scores. Both instruction conditions decreased general adherence to (all) models when given delexicalized versions (speech: *F*(1,34) = 99.84, *p* <0.001, η2p = 0.75; music: *F*(1,34) = 10.48, *p* = 0.003, η2p = 0.24), but the decrease was larger for the speech-prosody instruction. These effects came along with a significant (third-order) delexicalization x instruction x model interaction (Figure 3A, *F*(6,408) = 6.09, *p* <0.001, η2p = 0.08), suggesting that delexicalization x instruction interactions differed across models.

When the three-way interaction was broken down into the seven models (Figure 3A,B), the pattern of effects and interactions (delexicalization x instruction) was indeed heterogeneous, and it did not overlap with the associated cue types (local, global, mixed). Pauses alone (local cue, *p* >0.45, η2p = 0.008) and Pitch drops (global, *p* >0.95, η2p = 0.000) showed non-significant interactions between delexicalization and instruction. For these, delexicalization increased model matching in both instruction levels (main effect of delexicalization on matching with Pauses: *F*(1,68) = 378.96, *p* <0.001, η2p = 0.85; on matching with Pitch drop: *F*(1,68) = 19.16, *p* <0.001, η2p = 0.22). Significant delexicalization x instruction interactions showed up for Pitch breaks (local cue, *F*(1,68) = 5.15, *p* = 0.026, η2p = 0.07), Pitch rise (global, *F*(1,68) = 48.89, *p* <0.001, η2p = 0.42), Pause + pitch rise (mixed, *F*(1,68) = 4.56, *p* = 0.036, η2p = 0.06), and Pause + pitch

drop (mixed, *F*(1,68) = 11.31, *p* = 0.001, η2p = 0.14). The interaction for Pause + pitch break was marginal (local cue, *F*(1,68) = 3.07, *p* = 0.084, η2p = 0.04). All these interactions indicate different expectations for music compared to the speech prosody instruction.

**Figure 3.** Effects of delexicalization (natural vs. delexicalized speech), model (7 models) and instruction (speech prosody vs. music) on model matching. A: Delexicalization effects per model, with five out of seven models (Pitch break, Pause + Pitch break, Pitch rise, Pause + Pitch rise, Pause + Pitch drop) showing delexicalization x instruction interactions (marked with *t*). B: Delexicalization x instruction interactions per model. Vertical bars represent the standard error of the mean.

The type of interaction was independent from cue type: we saw similar patterns for Pitch break (local cue), Pause + pitch rise and Pause + pitch drop (both mixed): for all, delexicalization had the effect of decreasing model matching scores for both speech and music, with a stronger effect in speech (Pitch break: *t*(34) = 11.53, *p* <0.001, *d* = 2.69 speech, *t*(34) = 11.82, *p* <0.001, *d* = 2.56 music; Pause + pitch rise: *t*(34) = 5.41, *p* <0.001, *d* = 1.05 speech, *t*(34) = 2.04, *p* = 0.049, *d* = 0.53 music; Pause + pitch drop: *t*(34) = 27.58, *p* <0.001, *d* = 4.87 speech, *t*(34) = 17.81, *p* <0.001, *d* = 3.97 music). For Pause + pitch break—a local cue, just like Pitch break alone - delexicalization increased model matching for speech (*t*(34) = −3.35, *p* = 0.002, *d* = −0.71) while having no effect for music (*t*(34) = −0.72, *p* >0.47, *d* = −0.14). Finally, for Pitch rise – a global cue, just like Pitch drop, which showed no interaction - delexicalized versions decreased model matching scores under the speech instruction (*t*(34) = 2.47, *p* = 0.019, *d* = 0.61), while increasing it under the music one (*t*(34) = −7.46, *p* <0.001, *d* = −1.46; Figure 3B).

#### *3.3. Confidence in Segmentation*

Participants' level of confidence in their segmentation responses was higher for natural speech (*M* ± *SD*: 3.83 ± 0.55, 5-point scale) compared to delexicalized versions (2.89 ± 0.66; *F*(1,64) = 89.61, *p* <0.001, η2p = 0.58). Both speech prosody and music-instruction participants showed similar gains in confidence when going from delexicalized to natural speech (*p* = 0.57, η2p = 0.01).

#### **4. Discussion**

Our goal was to determine whether the segmentation of music into phrases is driven by music-specific expectations, which would indicate that segmentation processes in music do not overlap with those that occur in speech prosody. To that end, we tested whether participants' segmentations of a single set of ambiguous stimuli without lexical content differed according only to the identity assigned to these stimuli (speech prosody vs. music), under a manipulation of the variable instruction. Since the effect of instruction was obtained from two different groups that could be imperfectly matched, we created a baseline-related measure of this effect: we focused on how instruction influenced the within-subjects difference between natural speech (a baseline or common reference) and ambiguous, delexicalized stimuli subject to manipulations of instruction. This within-subjects difference was named delexicalization effect. In our analysis, cross-group differences in the segmentation of the natural-speech baseline were indeed apparent (see Figures 2 and 3B), suggesting that participants' segmentation strategies differed a priori across groups and thus our baseline-related measure of instruction effects was prudent.

Supporting the hypothesis of music-specific expectations, the delexicalization effect changed according to instruction in several aspects. Instruction influenced delexicalization effects on segment length: mean segment length decreased with delexicalization for the speech instruction, but it did not change for the music one. In addition, instruction changed delexicalization effects on the matching of segmentation maps with five out of seven theoretical segmentation models: for instance, the matching with Pitch rise models decreased with delexicalization for speech instruction, but it increased for music instruction (Figure 3B).

Our primary goal was to determine whether expectations in music segmentation *di*ff*ered* from those in the speech-prosody domain, and thus we were interested in *any* interactions between delexicalization and instruction. Based on previous literature, we admitted the possibility that music instructions would increase reliance on global cues but, beyond that, our approach was exploratory regarding the contents of music-specific expectations. Note that, in the context of our delexicalization-effect-based approach, expectations must be framed in relative terms, i.e., how participants in each level of instruction diverged from natural speech when confronted with delexicalized versions.

The hypothesis that music segmentation would favor global cues (Pitch drop and Pitch rise) did not get support. It was true that participants under the music instruction favored Pitch rise (global cue), while those under the speech instruction devalued Pitch rise. However, the same did not go for Pitch drop, which is also a global cue. Critically, music participants favored local cues (Pitch break) and mixed cues (Pause plus pitch drop, Pause plus pitch rise) more than speech participants. Therefore, the dichotomy global–local seems irrelevant to distinguish between music and speech prosody segmentation.

Having excluded the global-cue hypothesis on music segmentation, what were we left with? First, the music instruction seems to have preserved the mechanisms of natural speech segmentation more than the speech prosody instruction: unlike speech prosody participants, music participants did not decrease segment length with delexicalization. They also preserved Pitch breaks, Pauses plus pitch drops and Pauses plus pitch rises more than speech prosody participants. The possibility that natural speech expectations may be more similar to music than prosody-specific expectations

themselves is an intriguing finding that deserves further discussion. Explanations for this finding may generally relate to the disturbing potential of delexicalized speech prosody stimuli. One possibility may be that speech prosody requires phonetic content to be fully decoded, while the same does not apply to music. This might relate to the phonetic advantage effect, according to which it is easier to imitate prosody-related pitch when prosody is accompanied by phonetic content [37]. Although the authors also found a phonetic advantage for music, contradictory evidence is available [38]. In the context of our study, it is possible that dissociating speech prosody from its original phonetic content (delexicalized speech prosody versions) may have disturbed prosodic segmentation to such an extent that delexicalized music versions remained closer to natural speech. Specifically, it is possible that such disturbance was caused and/or amplified by the violation of expectations that takes place when a linguistic stimulus presents itself deprived of phonetic content. An alternative possibility may relate to the characteristics of our stimuli, namely the music-like characteristic of our delexicalized versions. We tested delexicalized stimuli using discontinuous (musical) pitch, resulting from the audio-to-MIDI transformation. Although this was necessary to maximize the credibility of the music instruction while keeping the stimuli unchanged across instruction levels, this may have created a sense of strangeness in participants from the speech group. As a result, it is possible that speech participants did not activate the bottom-up approach that we expected for speech prosody, nor the music-like set of expectations. So, although our results make it clear that speech prosody was approached differently from music, it is possible that speech prosody may have been perceived as an undetermined, unfamiliar type of auditory stream, eliciting hybrid, atypical and/or unstable expectations. In order to rule out any limitations brought by the music-like character of delexicalized stimuli - it may be helpful to add control conditions in future studies, wherein participants in each instruction condition are also presented with continuous pitch versions as delexicalized stimuli. Although these two possibilities, delexicalized speech prosody is generally disturbing, or/and particularly disturbing with discontinuous pitch, make sense, we should bear in mind that participants from the two instruction conditions did not differ in their confidence level regarding segmentation responses. From this viewpoint, one might think participants in the speech prosody instruction condition were, at least consciously, not more disturbed than those in the music condition. Still, it is possible that confidence may go along with changes in processing modes, and this may have occurred with speech prosody participants as they went from natural speech to delexicalized versions.

A second manifestation of music-specific segmentation was the increased adherence to pitch rise cues, with the opposite trend observed for speech prosody segmentation. Pitch drop is a universal, default feature of human speech [39], possibly because it is a natural outcome of decreased air flow as one vocalizes continuously without a specific pitch plan. Differently, pitch rises require planning and resources to be executed. This type of vocal attitude is characteristic of music, and it is not surprising that we saw increased expectations for pitch rise under the music instruction. Finally, music participants were unreactive to Pauses plus pitch breaks, unlike speech prosody participants, who relied more heavily on these after delexicalization. One possibility may be that the coexistence of pauses and pitch breaks tend to be interpreted more as the ending of a musical section rather than as the ending of a phrase, driving music participants to ignore this type of boundary cues. Further studies could test this possibility, by eliciting both types of segmentation; sections vs. phrases.

Concerning general aspects of our study that may deserve investigation in future studies, one relates to the order of presentation of natural vs. delexicalized versions, which may raise concerns over priming effects. In our study, participants were first exposed to delexicalized versions, and then to the natural speech counterparts. We did this because we were concerned that music-instruction participants might raise hypotheses on the origin of the delexicalized stimuli in case we had done the reverse and started with natural speech, and we wanted to avoid the risk of having to eliminate participants due to such awareness. Our choice for the present study may have introduced priming effects, but the reverse option would have done the same. Critically, potential priming effects of delexicalized over natural speech versions were common to both instruction levels, and this was all that we had to control for in face of our research question (does instruction influence the delexicalization effect?). Although the order of presentation was not likely responsible for the differences between instruction levels, it may have affected the type of expectations that were observed. This is the reason why it might be useful to counterbalance the order of block (natural vs. delexicalized stimuli) presentation in future studies. Another aspect concerns the variety of segmentation models we used, which is not exhaustive and may be expanded. Specifically, future studies may benefit from considering pre-boundary lengthening phenomena [27], which are known to guide segmentation in music and speech prosody, but which we did not consider here.

Our main finding was that there are music-specific expectations, or top-down influences, in music segmentation. Our results suggest that there is a "music-segmentation mode", different from the processing mode engaged in speech prosody segmentation, which we assumed to be a bottom-up, data-driven approach. Although we found support for different modes, our findings do not inform us on whether speech prosody engages any expectations at all: it may be the case that music segmentation recruits expectations, or top-down processing, but speech prosody does not (our working assumption), but it may also be true that speech prosody also engages expectations, even though different from those engaged in music. A third scenario could be that speech prosody engages expectations while music segmentation is purely bottom-up, but this would go against evidence that listeners rely on metric, structural cues, such as 4-bar phrases, to perform segmentation (See Introduction Section). The best way to address these questions is to better specify the type of expectations in each domain and find cross-studies replicable patterns.

Our main finding arose from an experimental paradigm that approached music and speech prosody in ways that may not be considered fully representative of these phenomena. To probe music segmentation, we used a series of rhythmically organized discontinuous pitches without tonal organization (pitches did not organize according to tonal harmony [40]) and conveyed by a musical timbre (bass). While this approach captures basic elements of what is considered "music" (discontinuous pitch, rhythm, musical timbre), it misses important elements of common-practice music, namely tonal harmony (implicit in tonal melodies) and metric regularity. From this viewpoint, we should admit that we did not probe music segmentation in a broad sense but, rather, the segmentation of a particular music style, likely similar to contemporary jazz music (as we told our participants). So, what would happen if we used mainstream music, in case it would be possible with our paradigm? Our guess is that music-specific expectations would be more salient, since both tonal harmony and metric regularity, both absent in speech prosody, work as music-related segmentation cues [41]. In this sense, the limitations of the stimuli we presented as music concerning the ecological validity of our findings may not be significant. As for the ways we probed speech prosody, these may have limited the generality of our conclusions, as we already discussed.

#### **5. Conclusions**

In sum, our study was novel in testing for music-specific expectations in music segmentation, and we found evidence for these within the frame of our paradigm and assumptions. The existence of music-specific expectations in segmentation remained underacknowledged in the field of music-language comparative studies on segmentation [14,20,24]. Our findings contribute to challenge the analogy between speech prosody and music that has remained implicit in the field, setting the stage for a "music mode" and a "speech prosody mode" in segmentation.

**Supplementary Materials:** Stimulus materials can be downloaded from: https://drive.google.com/file/d/ 1XGyxhRsByrcTmymHiCNnPRcCegwYYIoo/view?usp=sharing.

**Author Contributions:** Conceptualization, S.S. and S.L.C.; methodology, S.S. and S.L.C.; formal analysis, S.S. and C.D.; investigation, S.S. and C.D.; data curation, C.D.; writing—original draft preparation, S.S. and C.D.; writing—review and editing, S.S. and S.L.C.; supervision, S.S. and S.L.C.; project administration, S.S. and S.L.C.; funding acquisition, S.L.C.

**Funding:** This research was supported by Fundação para a Ciência e a Tecnologia under grant UID/PSI/00050/2013. and COMPETE 2020 program.

**Acknowledgments:** We are grateful to Filipa Salomé and José Batista for their help with the analysis, and to our participants.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

Pitch Structure and Linguistic Content of the Five Stimuli

In each figure, pitch structure is plotted above for natural speech stimuli, and below for delexicalized versions. Boundaries under natural speech versions indicate segmentation based on linguistic context, specifically on sentences as defined in transcription texts.

In transcriptions, // indicates clause boundaries, **bold and underlined** words indicate the main verb, and underlined words the verb (and conjunctions) in subordinate and/or coordinate clauses. Self-corrections and hums are included.

**Figure A1.** Stimulus 1—Pitch structure.


19. [**It says:**] Porto Football Club, tri-champion, ninety four, ninety seven. //

#### **Table A1.** Stimulus 1—Transcription.

19. [**Diz:**] "Futebol Clube do Porto, tricampeão, noventa e quatro, noventa e sete". //

**Figure A2.** Stimulus 2—Pitch structure.



255

**Figure A3.** Stimulus 3—Pitch structure.

**Table A3.** Stimulus 3—Transcription.


2. Já **dominaste** os reis, os espaços! //

inteiro **é** teu. //

3. Mas, ainda para além, um novo sol **rompeu**, abrindo o infinito aos rumos dos teus passos, pairando numa esfera... acima deste plano... // sem recear jamais // que os erros te retomem ... //

fecundos,// então, ó ser sublime, o mundo


minute spreads in fertile centuries, // then, oh sublime being, the whole world **is** yours. //


**Figure A4.** Stimulus 4—Pitch structure.

**Table A4.** Stimulus 4—Transcription.

### **Portuguese English**


**Figure A5.** Stimulus 5—Pitch structure.

**Table A5.** Stimulus 5—Transcription.



#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Is It Speech or Song? E**ff**ect of Melody Priming on Pitch Perception of Modified Mandarin Speech**

**Chen-Gia Tsai 1,2 and Chia-Wei Li 3,\***


Received: 22 September 2019; Accepted: 21 October 2019; Published: 22 October 2019

**Abstract:** Tonal languages make use of pitch variation for distinguishing lexical semantics, and their melodic richness seems comparable to that of music. The present study investigated a novel priming effect of melody on the pitch processing of Mandarin speech. When a spoken Mandarin utterance is preceded by a musical melody, which mimics the melody of the utterance, the listener is likely to perceive this utterance as song. We used functional magnetic resonance imaging to examine the neural substrates of this speech-to-song transformation. Pitch contours of spoken utterances were modified so that these utterances can be perceived as either speech or song. When modified speech (target) was preceded by a musical melody (prime) that mimics the speech melody, a task of judging the melodic similarity between the target and prime was associated with increased activity in the inferior frontal gyrus (IFG) and superior/middle temporal gyrus (STG/MTG) during target perception. We suggest that the pars triangularis of the right IFG may allocate attentional resources to the multi-modal processing of speech melody, and the STG/MTG may integrate the phonological and musical (melodic) information of this stimulus. These results are discussed in relation to subvocal rehearsal, a speech-to-song illusion, and song perception.

**Keywords:** melody perception; tonal language; inferior frontal gyrus; priming effect

#### **1. Introduction**

Tonal languages are characterized by the use of lexical tones for distinguishing lexical semantics. Lexical tones include distinct level pitches and pitch-glide patterns. Owing to pitch variation, spoken utterances in tonal languages are rich in melody and sometimes comparable to music. Related to this, it has been acknowledged that tonal languages and music share similarity in the perceptual-cognitive processing of pitch. Recognition of lexical tones relies on relations between successive pitches [1–3], and thus the underlying neural substrates partially overlapped with those underlying recognition of musical pitch intervals [4]. Evidence of similarity between spoken utterances of tonal languages and music also comes from traditional music. The distinction between speech and song is blurred in many genres of Chinese musical theater. For example, it has been suggested that eight oral delivery types in Cantonese opera occupy different positions on a speech-music spectrum according to their tonal features, rhythmic features, and instrumental accompaniment [5]. In Chinese opera, a sung utterance may be perceived as somewhat speech-like because of high congruency between its musical melody and lexical tones. On the other hand, the pitches of a spoken utterance could be embedded into the musical scale provided by the accompaniment music. Using this musical scale as a tonal schema, listeners may perceive this utterance as song.

The fact that the tonal context provided by an instrumental accompaniment could perceptually transform speech of a tonal language to song raises a possibility that the listener could perceive the melody of a spoken utterance as a musical melody when he/she is primed by appropriate musical

cues. In the present study, we reported a novel priming effect of musical melody on pitch perception of spoken Mandarin utterances. The target was a speech-like stimulus. The melody of this target was mimicked by a musical melody, which served as the prime. When the listener was primed by this musical melody, he/she tended to perceive the target as song.

Previous studies have reported that acoustically identical English utterances can be perceived as either speech or song. Deutsch and colleagues found that when a spoken English phrase was repeated several times, listeners were likely to perceive this phrase as song [6]. The authors hypothesized that exposure to repetition of a speech fragment may be associated with greater activity in the neural substrates of pitch processing, relative to the condition in which a spoken phrase was presented once. Moreover, this repetition effect may result in a re-evaluation of prosodic features of this spoken phrase [7]. These two hypotheses for the speech-to-song illusion were supported by previous neuroimaging studies demonstrating that (1) the effect of perceiving a spoken phrase as song via repetition localized to the right mid-posterior superior temporal sulcus (STS) and middle temporal gyrus (MTG) implicated in pitch processing [8,9], and (2) the subjective vividness of the speech-to-song illusion was positively correlated with activity in a left frontotemporal loop implicated in evaluation of linguistic prosody. This left frontotemporal loop comprises the inferior frontal gyrus (IFG), frontal pole, and temporal pole [9].

Using functional magnetic resonance imaging (fMRI), the present study aimed at specifying the neural underpinnings of the perceptual transformation from Mandarin speech-like utterances to song across a musical prime that mimics the melody of speech. In light of the aforementioned studies of the speech-to-song illusion in English, we hypothesized that the effect of melody priming on the pitch processing of Mandarin speech-like utterances would be associated with increased activity in the IFG, which may contribute to the cognitive processes for attending to the melodic features of speech and for comparing speech with the musical prime. Specifically, the pars triangularis of the right IFG (IFGtri) seems to play a prominent role in evaluation of prosodic information of speech [10–12]. We expected to observe greater activity within the right IFGtri during listening to Mandarin speech-like utterances preceded by a melody prime, compared to listening to the same stimulus without melody priming. In addition, we hypothesized that the anterior insula and supplementary motor area (SMA) implicated in subvocal rehearsal [13] may co-activate with the right IFGtri because participants may engage subvocal rehearsal strategies for encoding the melodic features of Mandarin speech-like utterances.

In addition to attention control and sensorimotor mechanisms, the present study was also expected to shed new light on the perceptual processing of song. Sammler and colleagues employed a functional magnetic resonance adaptation paradigm to identify the neural correlates of binding lyrics and tunes in unfamiliar song [14]. Results revealed that the left mid-posterior STS showed an interaction of the adaptation effects for lyrics and tunes. The authors suggested that this region may contribute to an integrative processing of lyrics and tunes. Alonso and colleagues reported that binding lyrics and tunes for the encoding of new songs was associated with the involvement of the bilateral mid-posterior MTG [15]. In the present study, a Mandarin speech-like utterance could be perceived as song when it was preceded by a musical melody mimicking the melody of the utterance. We hypothesized that melody priming would lead to increased activity in the STS/MTG implicated in binding lyrics and tunes during song perception.

#### **2. Materials and Methods**

#### *2.1. Participants*

Twenty native Mandarin speakers (age range 20–43 years; six males) participated in the fMRI experiment, in which three fMRI scanning runs for the present linguistic study (focusing on a tonal language) alternated with two fMRI scanning runs for a musical study (focusing on symphonies and concertos). This design was used to minimize affective habituation that could occur with repeated exposure to the same emotional music. The selection and recruitment of participants are mentioned in the next paragraph. Other methods and results of the musical study are not mentioned further in this paper.

Participants were recruited via a public announcement on the internet, which stated the requirement of a high familiarity with Western classical music. In a pre-scan test, volunteers were asked to write down their feelings in response to the musical stimuli in the musical study. Eight musical stimuli with duration of 30 s were presented in a fixed order. After listening to each 30-s stimulus, the volunteers were asked to write down their feelings in response to the passage just before the theme recurrence and their feelings in response to the theme recurrence. They were also asked to explain their feelings in terms of musical features. The first author of this article (a musicologist) selected participants for the fMRI experiment according to the following inclusion criteria: (1) more than five passages just prior to the theme recurrence evoked his/her anticipation; (2) the recurrence of more than five themes evoked a feeling of resolution; and (3) their feelings were appropriately explained in terms of musical features for more than five excerpts. Thirty-one adult volunteers completed this questionnaire. Twenty-seven volunteers met our screening criteria. Twenty of them completed the fMRI experiment. They were free from neurological, psychiatric, or auditory problems. Fifteen participants studied musical instruments for six years or more. The participants were compensated with approximately 16 USD after the completion of fMRI scan.

In the present linguistic study, participants were excluded from analyses of fMRI data if they were unable to discriminate between matched and mismatched trials at better than chance levels (see 2.5. Data Analysis). A female participant was excluded in this way. The data of another female participant were discarded because of incomplete behavioral data acquisition during the fMRI session. Thus, the final sample included in fMRI analyses consisted of 18 adults (mean age = 27.1 years; SD = 6.5 years; mean experience in most experienced instrument = 8.9 years; SD = 4.2 years; six males). Written informed consent was obtained from each participant prior to participation in the study. All research procedures were performed in accordance with a protocol approved by the Institutional Review Board of National Taiwan University (201611HM008). This study was conducted in accordance with the Declaration of Helsinki.

#### *2.2. Stimuli*

Auditory stimuli were noise, linguistic stimuli, and musical stimuli. The noise stimulus was white noise with a duration of 1.6 s. The duration of each linguistic and musical stimulus was 1.9–2.4 s. The linguistic stimuli were spoken sentences in Mandarin. Each sentence contained six characters. A female broadcaster was invited to recite 65 sentences with flat affect and natural prosody. These materials were recorded and saved as digital sound files.

To generate stimuli that can be perceived as either speech or song, three steps of pitch adjustment were applied on the spoken utterances. These steps are similar to the auto-tune processes used to "songify" news reports or any normal speech. The first step was "quantizing"; the pitches were adjusted to match the nearest note in the chromatic musical scale. The second step was "flattening"; pitch glides of these spoken utterances were flattened by approximately 95%. Figure 1 illustrates the effects of quantizing and flattening on a spoken Mandarin sentence. Third, the first author adjusted the melody of each utterance to match the C major or B-flat major scales by transposing some pitches by a semitone. These three steps were carried out using Cubase (Steinberg Media Technologies GmbH by VeriSign, Inc., HH, Germany) The modified spoken utterances can be perceived as speech because of high congruency between their pitch contours and lexical tones. They can also be perceived as song because their pitches can be embedded into the musical scale. Among the 65 utterances, 50 utterances for the scanning session and 6 utterances for the training session were selected by the first author.

**Figure 1.** The first and second steps (quantizing and flattening) of pitch adjustment of spoken utterances using Cubase (Steinberg Media Technologies GmbH by VeriSign, Inc.) The pitches were adjusted to match the nearest note in the chromatic musical scale (quantizing). All pitch glides of the spoken utterances were flattened by approximately 95% (flattening).

All musical stimuli were melodies containing 6–10 notes, with the pitch ranging from E3 to F4 (fundamental frequency 164.8–349.2 Hz). There were three types of musical stimuli: Match, Mismatch, and Melody-Variation types. The musical stimuli of the Match type were melodies extracted from the linguistic stimuli using "MIDI extraction" in Cubase. As a result, each melody of the Match type closely resembles the melody of a linguistic stimulus. Each melody of the Mismatch type was generated from each melody of the Match type by elevating or lowering the pitches of 4–7 notes by 2–9 semitones while keeping the rhythm and tonality unchanged. The musical stimuli of the Melody-Variation type were paired melodies; the first melody (prime) was composed by the first author of this article, and the second melody (target) was generated from the first melody by elevating or lowering the pitches of 4–7 notes by 2–9 semitones while keeping the rhythm and tonality unchanged. All musical stimuli were in the C major or B-flat major tonalities and generated with a virtual musical instrument named "oboe" using Reason 7.0 (Propellerhead Inc., STH, Sweden).

There were five conditions in this study, as depicted in Figure 2. The experimental condition was melody-language-match (ML-match) condition, in which the prime was a musical stimulus mimicking the melody of the target linguistic stimulus. In the noise-language (NL) condition, the prime was noise, and the target was a linguistic stimulus. In the melody-melody-match (MM-match) condition, the prime and the target were the same musical melody. The NL and MM-match conditions were two control conditions of this study. Compared to NL, ML-match additionally demanded attention to the melodic features of speech, pitch processing, and tonal working memory. Both ML-match and MM-match demanded pitch processing and tonal working memory, as participants were asked to judge the melodic similarity between the prime and target. Compared to MM-match, ML-match

additionally demanded selective allocation of attention to the melodic features of speech. For audio examples of the stimuli for NL and ML-match, see Supplementary Materials.

The stimuli of the Mismatch and Melody-Variation types were used in the melody-languagemismatch (ML-mismatch) and melody-melody-mismatch (MM-mismatch) conditions, respectively. In the ML-mismatch condition, the prime was a musical melody that mismatched the melody of the target linguistic stimulus. In the MM-mismatch condition, the prime was a musical melody that mismatched the target musical melody. There were 20 trials in each of the experimental and control conditions, whereas there were 10 trials in each of the two mismatch conditions. The fMRI data for the two mismatch conditions were not analyzed. The target stimuli for ML-match and NL were counter-balanced across participants; the spoken utterances in ML-match were used as the target stimuli in NL for the other 10 participants, and vice versa.

**Figure 2.** Examples of spectrograms of the stimuli in the five conditions. (ML-match: melodylanguage-match; ML-mismatch: melody-language-mismatch; MM-match: melody-melody-match; MM-mismatch: melody-melody-mismatch; NL: noise-language).

#### *2.3. Procedure*

The current study included a training session and a scanning session, which were separated by 5–10 min. In the training session, the participants were trained outside the MRI scanner room to familiarize themselves with the tasks. The experimenter explained to each participant with a PowerPoint presentation and sound examples that the melody of a Mandarin utterance can be compared to a musical melody. For demonstration, the experimenter rated the melodic similarity between a target and a prime on a 4-point Likert scale for five trials (one trial for each condition) by pressing a button (a right-most button for "very similar", a right button for "slightly similar", a left button for "slightly dissimilar", and a left-most button for "very dissimilar"). In a similar manner, the participant practiced six trials, including two trials for ML-match and one trial for each of the other four conditions. This training session lasted approximately 15 min. None of the stimuli used in the training session were presented in the scanning session.

Schematic description of the procedure of the fMRI experiment is illustrated in Figure 3. There were five runs for the whole fMRI design, with three musical runs alternating with two linguistic (speech–melody) runs. The duration of each run was approximately 450 s. In the musical runs, participants were instructed to listen to famous symphonies and concertos or atonal random sequences. Methods and results for these musical runs will be detailed in another article.

**Figure 3.** Schematic description of the procedure of the functional magnetic resonance imaging (fMRI) experiment.

Auditory stimuli in the linguistic runs were delivered through scanner-compatible headphones at a volume sufficiently loud enough that participants could readily perceive the stimuli over the scanner noise. Eighty trials of the five conditions were presented in two linguistic runs in a pseudorandom order. Each trial began with a warning tone (2.8 kHz, 0.3 s). Then, the prime and target stimuli were sequentially presented. The participants were instructed to listen to them and to rate the similarity of their melodies by pressing a button (a right-most button for "very similar" and similarity-score of 4, a right button for "slightly similar" and similarity-score of 3, a left button for "slightly dissimilar" and similarity-score of 2, and a left-most button for "very dissimilar" and similarity-score of 1). This task was used to assess whether participants were attending to the task. In total, the pre-scan training session and the scanning session took approximately 85 min for each participant.

Twenty-four to twenty-seven months after the fMRI experiment, the participants were asked to fill out a short online questionnaire. Fourteen participants completed this questionnaire. In this online questionnaire, auditory stimuli of five trials of the ML-match condition and five trials of the NL condition were randomly selected and presented in a random order. The participants rated each of the utterance (target stimulus) using a sliding scale to indicate how speech-like or song-like it was. Then, they rated on a sliding scale how much they agreed or disagreed these two statements: 'I paid more attention to the musical melodies of the utterances that were preceded by matched melodies, compared to those preceded by noise'; 'I more tended to covertly imitate the utterances that were preceded by matched melodies, compared to those preceded by noise.' The results of this questionnaire were expected to reveal how the melody prime affected the processing of the target utterance.

#### *2.4. MRI Data Acquisition*

For imaging data collection, participants were scanned using a 3T MR system (MAGNETOM Prisma, Siemens, Erlangen, Germany) and a 20-channel array head coil at the Imaging Center for Integrated Body, Mind, and Culture Research, National Taiwan University. In the functional scanning, about 2.5 mm slices of axial images were acquired using a gradient echo planar imaging (EPI) with the following parameters: time to repetition = 2500 ms, echo time = 30 ms, flip angle = 87◦, in-plane field of view = 192 × 192 mm, and acquisition matrix = 78 × 78 × 45 to cover whole cerebral area. For spatial individual-to-template normalization in preprocessing, a Magnetization Prepared Rapid Gradient Echo T1-weighted imaging with spatial resolution of 0.9 mm isotropic was acquired for each participant.

#### *2.5. Data Analyses*

One-sample one-tailed *t*-tests were used to determine whether each participant's ratings of prime-target similarity for the 20 trials in ML-match were significantly higher than the chance-level score of 2.5. Participants were excluded from analyses of fMRI data if their ratings for ML-match were not significantly higher than this chance-level score. For the final sample included in fMRI analyses, paired-sample *t*-tests were performed to assess differences in the similarity ratings between ML-match and ML-mismatch, as well as between MM-match and MM-mismatch. For the ratings of speech-like or song-like traits of the target stimuli in ML-match and NL, a paired-sample two-tailed *t*-test was used to assess the effect of the melody prime. For the ratings of participants' agreement with the two statements about auditory attention and subvocal imitation, one-sample two-tailed *t*-tests were used to determine whether these ratings were significantly greater than the neutral midpoint of the scale (neither agreed nor disagreed).

Preprocessing and analyses of the fMRI data were performed using SPM12 (Wellcome Trust Centre for Neuroimaging, LDN, United Kingdom). The first four volumes of each run were discarded to allow for magnetic saturation effects. The remaining functional images were corrected for head movement artifacts and timing differences in slice acquisitions. Preprocessed functional images were coregistered to the individual's anatomical image, normalized to the standard Montreal Neurological Institute (MNI) brain template, and resampled to a 2-mm isotropic voxel size. Normalized images were spatially smoothed using a Gaussian kernel of 6-mm full width at half maximum to accommodate any anatomical variability across participants.

We performed an event-related analysis to recover the response evoked by each target stimulus. Statistical inference was based on a random effect approach at two levels. The data of each participant were analyzed using the general linear model via fitting the time series data with the canonical hemodynamic response function (HRF) modeled at the event (target). Linear contrasts were computed to characterize responses of interest, averaging across fMRI runs. The group-level analysis consisted of two paired *t*-tests for (1) the contrast of ML-match minus NL, and (2) the contrast of ML-match minus MM-match. We then identified regions that were significantly active for both the ML-match minus NL contrast and the ML-match minus MM-match contrast. This was done because both ML-match and MM-match involved pitch processing and tonal working memory processing, as the melody of the prime needed to be stored and compared to that of the target. To reveal activation clusters related to the perceptual transformation from speech to song across a musical prime, we applied an inclusive mask of the ML-match minus MM-match contrast on the ML-match minus NL contrast. In the fMRI analyses, statistical significance was thresholded at FDR-corrected *p* < 0.05 with a minimum cluster size of 10 voxels.

#### **3. Results**

Analysis of the subjective ratings of prime-target similarity showed that one participant's ratings for ML-match were not significantly higher than chance level (*p* = 0.44). This participant was excluded from the analyses of fMRI data because she was unable to discriminate between matched and mismatched trials at better than chance levels. The data of another participant were discarded because of incomplete behavioral data acquisition during the fMRI session. The final sample for fMRI analyses was therefore 18 participants, whose similarity ratings for ML-match were significantly higher than the chance-level score (*p* < 0.001). Figure 4 displays their rating data for five conditions. The similarity ratings for ML-match were significantly higher than ML-mismatch (*p* < 0.0001), and those for MM-match were significantly higher than MM-mismatch (*p* < 0.0001).

Analysis of the ratings of speech-like or song-like traits of the target stimuli in ML-match and NL showed that the target stimuli in ML-match were perceived as significantly more song-like than those in NL (*p* < 0.001). Analysis of the ratings of participants' agreement with the two statements showed that the participants paid significantly more attention to the musical melodies of the utterances that were preceded by matched melodies, compared to those preceded by noise (*p* < 0.005). The participants significantly more tended to covertly imitate the utterances that were preceded by matched melodies, compared to those preceded by noise (*p* < 0.01), as shown in Figure 5.

**Figure 4.** Participants' ratings of prime-target similarity for five conditions. Error bars indicate standard deviation.

**Figure 5.** Results of the online questionnaire after the fMRI experiment. (**a**) Rating scores of speech-like or song-like traits of the target stimuli in NL and ML-match showed that the target stimuli in ML-match were perceived as more song-like than those in NL. (**b**) Rating scores of participants' agreement with the two statements showed that the participants paid more attention to the musical melodies of the utterances that were preceded by matched melodies and more tended to covertly imitate these utterances, compared to the target utterances that were preceded by noise. Error bars indicate standard deviation. Note: \* *p* < 0.01, \*\* *p* < 0.005, \*\*\* *p* < 0.001.

Results of the whole-brain analyses of fMRI data were summarized by Table 1, Table 2, and Figure 6. Compared to the NL condition, the ML-match condition was associated with significantly increased activity in a number of regions, including the motor/premotor cortex, superior parietal areas, Rolandic operculum, temporal pole, anterior insula, IFG, superior/middle temporal gyrus (STG/MTG), SMA, caudate, thalamus, and cerebellum. Compared to the MM-match condition, the ML-match condition was associated with significantly increased activity in STG/MTG, temporal pole, IFG, anterior insula, superior parietal areas, hippocampus, SMA, putamen, caudate, and cerebellum. The intersection of ML-match minus NL and ML-match minus MM-match yielded activity in IFG, STG/MTG, dorsal premotor cortex, temporal pole, anterior insula, SMA, caudate, and thalamus.

**Table 1.** Activation clusters for the contrasts of ML-match minus NL and ML-match minus MMmatch. (MNI: Montreal Neurological Institute; ML-match: melody-language-match; ML-mismatch: melody-language-mismatch; MM-match: melody-melody-match; MM-mismatch: melody-melodymismatch; NL: noise-language).


**Figure 6.** Group-level activation maps for ML-match minus NL (red), ML-match minus MM-match (blue), and their intersection (yellow).

**Table 2.** Activation clusters for the intersection of ML-match minus NL and ML-match minus MM-match.



**Table 2.** *Cont*.

#### **4. Discussion**

Spoken utterances in tonal languages are intrinsically rich in melodic content, and therefore the differentiation between speech and song in tonal languages is sometimes difficult to make. When a Mandarin speech-like utterance is preceded by a musical melody that mimics the speech melody, the listener may perceive this utterance as if it is being sung. In the present study, we used fMRI to explore this melody priming effect on pitch processing of Mandarin speech. Pitch contours of spoken utterances were modified so that the utterances can be perceived as either speech or song. Participants were asked to rate the melodic similarity between the prime and target. Analyses of fMRI data revealed increased activity in a number of regions for the intersection of speech preceded by matched music minus speech preceded by noise (ML-match > NL) and speech preceded by matched music minus music preceded by identical music (ML-match > MM-match), including the bilateral IFG, anterior insula, SMA, and STG/MTG. This finding echoes previous hypotheses and results of the speech-to-song illusion that exposure to repetition of a speech fragment is associated with greater activity in the neural substrates of pitch processing and re-evaluation of melodic features of this speech fragment [6,7,9].

The task of judging the melodic similarity between the prime and target in ML-match demanded the processing of melodic features of the target. Based on prior research on the neural correlates of prosody processing, we speculate that the right IFGtri, which showed activity for the intersection of ML-match minus NL and ML-match minus MM-match, may support the melodic processing of the speech-like target in ML-match. It has been reported that listening to "prosodic" speech (speech with no linguistic meaning, but retaining the slow prosodic modulations of speech) was associated with enhanced activity in the right IFGtri (extending into the pars opercularis of IFG) compared to normal speech [10]. The right IFGtri also responded to pitch patterns in song [16]. Moreover, a study of sarcasm comprehension in the auditory modality demonstrated that negative prosody incongruent with positive semantic content activated the right anterior insula extending into the IFGtri [12]. During perception of neutral, sad, and happy prosody, individuals with autism spectrum disorder displayed reduced activity in the right IFGtri compared to normal controls [10]. Taken together, we suggest that the right IFGtri may allocate attentional resources for the melodic processing of the target stimulus in ML-match. This view is supported by participants stating that they paid more attention to the melodic features of the target stimuli in ML-match, compared to those in NL.

One may speculate that the right IFGtri activity for ML-match minus MM-match reflects its role in working memory. However, both ML-match and MM-match involved a comparison of two melodies, a task demanding tonal working memory. One interpretation of increased activity in the right IFGtri for ML-match is that the task of melody comparison in ML-match preferentially relied on action-related sensorimotor coding of tonal information, whereas this coding played a lesser role in MM-match. There has been evidence indicating that the right IFGtri is engaged in the multi-modal processing of tonal or verbal information. For example, McCormick and colleagues investigated the neural

basis of the crossmodal correspondence between auditory pitch and visuospatial elevation, finding a modulatory effect of pitch-elevation congruency on activity in the IFGtri and anterior insula [17]. Golfinopoulos and colleagues demonstrated that the right IFGtri exhibited increased activity when speech production was perturbed by unpredictably blocking subjects' jaw movements [18]. Moreover, a study of sensory feedback to vocal motor control also reported that trained singers showed increased activation in the right IFGtri, anterior insula, and SMA in response to noise-masking [19]. This finding is especially relevant to our study, as we found co-activation of the right IFGtri, anterior insula, and SMA for the intersection of ML-match minus NL and ML-match minus MM-match. We suggest that the participants may use subvocal rehearsal to facilitate the task of melody comparison in ML-match. Indeed, our participants reported that they more tended to covertly imitate the target stimuli in ML-match compared to those in NL. During covert vocal imitation of the target stimulus, the anterior insula may be responsible for the laryngeal somatosensory functions and voice pitch control [20–22], the SMA may support motor planning and monitoring/evaluation of this plan [23–27], and the right IFGtri may allocate cognitive resources to integrate the auditory coding and action-related sensorimotor coding of the melodic pattern of the target.

Besides speech perception, the finding of the involvement of the bilateral STG/MTG in the melody priming effect on the pitch processing of Mandarin speech also provides an enriched perspective on song perception. Results of the online questionnaire showed that the target stimuli in ML-match were perceived as more song-like than those in NL. We found that the left mid-posterior STG/MTG was activated for the intersection of ML-match minus NL and ML-match minus MM-match. This cluster was effectively identical to that described by two previous neuroimaging studies on song perception. Sammler and colleagues reported mid-posterior STS activation for the interaction effect of lyrics and tunes during passive listening to unfamiliar songs, suggesting its role in the integrative processing of lyrics and tunes at prelexical, phonemic levels [14]. Alonso and colleagues reported that binding lyrics and tunes for the encoding of new songs was associated with the involvement of the bilateral mid-posterior MTG [15]. We suggest that the right STG/MTG may integrate the musical (melodic) and phonological information of the targets in ML-match. This view parallels an earlier report finding that STG/MTG was activated for the intersection of listening to sung words minus listening to "vocalize" (i.e., singing without words) and listening to sung words minus listening to speech [28].

A study of the speech-to-song illusion [9] showed a positive correlation between the subjective vividness of this illusion and activity in the pars orbitalis of the bilateral IFG, which also exhibited activation for the intersection of ML-match minus NL and ML-match minus MM-match in the present study. These regions have been implicated in a broad range of cognitive processes, such as response inhibition [29,30], response selection [31], working memory [32,33], semantic processing [34,35], and prosody processing [36]. The pars orbitalis of the bilateral IFG appeared to contribute to certain high-level cognitive processes necessary for the melody-similarity-judgment task. Its exact role remains to be specified by future research.

A few limitations of the present study should be noted. First, in the speech-to-song illusion [6] a spoken phrase was repeated without modification, whereas we modified the pitch contours of spoken Mandarin utterances so that they differed from normal speech. Caution should be exercised when comparison is made between the results of this study and those of the speech-to-song illusion. Future research could explore how the manipulations of pitch flattening and the clarity of tonality of spoken utterances impact the melody priming effect of on the pitch processing of Mandarin speech. It is also interesting to examine whether native speakers, non-native speakers (second language speakers), and non-speakers differ in the pitch processing of "songified" speech. A previous study compared speech-to-song illusions in tonal and non-tonal language speakers, finding that both non-tonal native language and inability to understand the speech stream as a verbal message predicted the speech-to-song illusion [37]. Second, the final sample included in fMRI analyses mainly consisted of amateur musicians. We cannot ascertain whether this melody priming effect can also be observed in

non-musicians. Specific musical or cognitive abilities may correlate with the tendency of perceptual transformation from Mandarin speech to song. However, this idea remains to be tested in future studies.

#### **5. Conclusions**

The present study has examined the neural underpinnings of the perceptual transformation from modified Mandarin speech to song across a musical prime that mimics the melody of speech. Based on our fMRI data and previous literature, we suggest that the right IFGtri may play a role in allocation of attentional resources to the multi-modal processing of the melodic pattern of this stimulus. Moreover, the STG/MTG may integrate its phonological and musical (melodic) information. While these findings corroborate and extend previous studies on the speech-to-song illusion, we believe that further exploration of the melodic characteristics of tonal and non-tonal languages would significantly advance our understanding of the relationship between speech and song.

**Supplementary Materials:** Audio examples of the stimuli for NL and ML-match are available online at http: //www.mdpi.com/2076-3425/9/10/286/s1.

**Author Contributions:** Conceptualization, C.-G.T.; methodology, C.-G.T. and C.-W.L.; software, C.-W.L.; validation, C.-G.T. and C.-W.L.; formal analysis, C.-G.T. and C.-W.L.; investigation, C.-G.T. and C.-W.L.; resources, C.-G.T. and C.-W.L.; data curation, C.-G.T. and C.-W.L.; writing—original draft preparation, C.-G.T. and C.-W.L.; writing—review and editing, C.-G.T. and C.-W.L.; visualization, C.-W.L.; supervision, C.-G.T.; project administration, C.-G.T.; funding acquisition, C.-G.T.

**Funding:** This research was funded by grant projects (MOST 106-2420-H-002-009 and MOST 108-2410-H-002-216) from Ministry of Science and Technology, Taiwan.

**Acknowledgments:** The authors would like to express our gratitude to Prof. Tai-Li Chou for helpful discussion. We also thank Chao-Ju Chen for data collection.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Electrical Brain Responses Reveal Sequential Constraints on Planning during Music Performance**

**Brian Mathias 1,2,\*, William J. Gehring <sup>3</sup> and Caroline Palmer 1,\***


Received: 11 January 2019; Accepted: 26 January 2019; Published: 28 January 2019

**Abstract:** Elements in speech and music unfold sequentially over time. To produce sentences and melodies quickly and accurately, individuals must plan upcoming sequence events, as well as monitor outcomes via auditory feedback. We investigated the neural correlates of sequential planning and monitoring processes by manipulating auditory feedback during music performance. Pianists performed isochronous melodies from memory at an initially cued rate while their electroencephalogram was recorded. Pitch feedback was occasionally altered to match either an immediately upcoming Near-Future pitch (next sequence event) or a more distant Far-Future pitch (two events ahead of the current event). Near-Future, but not Far-Future altered feedback perturbed the timing of pianists' performances, suggesting greater interference of Near-Future sequential events with current planning processes. Near-Future feedback triggered a greater reduction in auditory sensory suppression (enhanced response) than Far-Future feedback, reflected in the P2 component elicited by the pitch event following the unexpected pitch change. Greater timing perturbations were associated with enhanced cortical sensory processing of the pitch event following the Near-Future altered feedback. Both types of feedback alterations elicited feedback-related negativity (FRN) and P3a potentials and amplified spectral power in the theta frequency range. These findings suggest similar constraints on producers' sequential planning to those reported in speech production.

**Keywords:** sensorimotor learning; sequence production; sequence planning; feedback monitoring; EEG; N1; FRN; music performance; music cognition; altered auditory feedback

#### **1. Introduction**

Many everyday behaviors, such as having a conversation, writing a note, and driving a car, involve the production of action sequences. A core tenet of theories of sequential behavior is that, in order to produce action sequences quickly and accurately, individuals must plan appropriate movements prior to their execution [1]. Evidence for future-oriented planning during sequential tasks comes from anticipatory ordering errors, in which upcoming sequence events are produced earlier in the sequence than intended. Documented in both speech production [2,3] and music performance [4–6], anticipatory errors suggest that producers have access to a range of upcoming events in a sequence at any given time during production. The span of sequence positions between an event's correct (intended) position and its incorrect (error-produced) position is taken to indicate a producer's range of planning, due to the items' simultaneous accessibility [5]. Serial ordering errors during music performance tend to arise more often from closer sequence distances than from farther distances [7–9]. This tendency suggests that producers have increased access to events intended for nearer in the future compared to events that are intended for farther ahead in the future [8]. These proximity constraints on planning have

been attributed to interference of future events with current events during memory retrieval, decay of item information over time, and individual differences in producers' working memory spans [8,10,11]. Sequential models of sequence planning in both speech [12] and music [8] use the term "planning gradient" to refer to a decrease in memory activation of upcoming sequence events as the distance from the current event increases.

In addition to planning upcoming units of speech and music during production, speakers and musicians monitor the perceptual outcomes of their previous productions. In order to monitor perceptual outcomes during auditory-motor tasks, producers compare perceived auditory feedback with an intended auditory outcome [10]. Theoretical approaches to feedback monitoring have focused heavily on the concept of internal models, or representations that simulate a response to estimate an outcome (for a review, see [13]). Internal models are thought to arise from interactions between bottom-up, incoming sensory information and top-down expectations or predictions formed by the motor system during production [14]. A framework known as "predictive coding" assumes that the goal of production is to minimize prediction error (i.e., mismatches between predictions that are generated by an internal model and sensory information originating in the environment) [15,16]. Musicians possess strong associations between musical actions and their sensory outcomes [17], which may explain why the perception of inaccurate auditory feedback during the production of auditory-motor sequences can disrupt production [18]. Mismatches between auditory feedback from musician's planned movements [7] as well as nonmusicians' planned movements can generate prediction errors, evidenced by an increasing error-related negativity [19]. Experimentally altering the contents of pitch feedback during music performance can disrupt the regular timing of key presses [20] and increase pitch error rates [21,22]. The computer-controlled removal of auditory feedback in laboratory environments does not disrupt well-learned performance; musicians can continue performing well-learned music when auditory feedback is removed [23,24], and altering feedback so that it is highly different from expected feedback has little effect on a previously learned performance [25]. Performance is disrupted when the altered auditory feedback is similar to the planned events [25,26]. Thus, current evidence suggests that disruption caused by altered auditory feedback may depend on similarity-based interference with planned sequence representations, leading to a novel prediction: If the planning of future events occurs in a graded fashion (higher activation for immediately upcoming events compared to more distant future events), then altered feedback that matches immediately upcoming events should disrupt performance more than altered feedback matching sequentially distant events.

To our knowledge, no studies in the domain of speech production have tested neural effects of hearing upcoming linguistic content that is presented sooner than expected while speaking. This may be due to the difficulties of independently manipulating auditory feedback during concurrent speech production. Auditory feedback during speech production can be electronically delayed, so that instead of hearing current feedback, one hears feedback that matches previous utterances. Presenting linguistic content that matches upcoming utterances is more difficult, however, because it requires the presentation of speech content that has not yet been produced by the speaker. In electronic music performance, one can present auditory feedback that matches future keypresses, due to a simpler sound production apparatus. One study examined musicians' neural responses to future-oriented altered auditory feedback as they performed tone sequences on a piano [7]. Occasional alterations in auditory feedback were presented that matched upcoming (future) events as pianists performed melodies from memory. An example of future-oriented feedback is if a pianist was currently producing tone A and was planning to produce tone B later in the sequence, tone B would be presented auditorily when the pianist's hand struck the A-key on the keyboard. Future-oriented feedback pitches elicited larger (event-related potential (ERP)) responses than altered feedback that matched previous (past) events, and amplitudes of the ERPs elicited by the altered feedback pitches correlated with the amount of temporal disruption elicited in the pianists' key presses [7]. It is unknown, however, whether disruptive effects of altered auditory feedback that match future events depend on an individual's

planning gradient: If producers' plans are biased toward the activation of immediately upcoming events compared to events planned for the distant future, then we would expect pitch feedback that matches immediate future events to generate greater similarity-based interference, and in turn greater performance disruption, than future-oriented feedback that matches distant future events. Thus, future-oriented theories of planning predict greater performance disruption for altered feedback that matches near future events compared to far future events.

Several studies have suggested that sensory processing of altered auditory feedback during production is marked by early- to middle-latency ERP responses to tone onsets [27–29]. N1 and P2 ERP components in particular are sensitive to whether speech or tones are generated by oneself versus others [30–32]. The N1 is a negative-going ERP component that peaks at about 100 ms following sound onsets and is followed by the positive-going P2 component [33,34]. Amplitudes of these components are more negative when sounds are generated by others than when they are self-generated, which is thought to reflect motor-induced suppression of auditory cortical processing [35,36]. Perceptual studies have demonstrated that N1 and P2 amplitudes also become more negative (larger N1 and smaller P2) in response to tones that are selectively attended to, compared to unattended tones, suggesting a role of these components in early auditory sensory processing [37–40]. N1 and P2 waves occur in quick succession about 50–150 ms following sound onsets, and arise from several temporally overlapping, spatially-distributed sources, with primary generators in the auditory cortices and planum temporale [33,34,41–43]. Thus, N1 and P2 amplitudes may serve as a proxy for the degree to which sensory processing of auditory feedback is suppressed during sound production: A negative-going shift in amplitudes occurs when processing of auditory feedback is enhanced, and a positive-going shift occurs when processing of auditory feedback is suppressed. Combined with the notion that future-oriented feedback that matches near future events may generate greater similarity-based interference than feedback matching far future events, this principle leads to the prediction that altered auditory feedback that matches near future events should decrease the expected N1 and P2 suppression compared to altered feedback that matches far future events.

Additional ERP components linked to action-related expectations are elicited when sensory feedback indicates that an action has resulted in an unexpected outcome. Frontally maximal feedback-related negativities (FRNs) are elicited roughly 150–250 ms following the unexpected outcome in music performance tasks [44–47], as well as during other tasks, such as reward prediction and monetary gambling tasks [48]. FRN amplitudes may be associated with the degree to which unexpected feedback violates a producer's feedback-related expectations [49–53]. The FRN component often co-occurs with neural oscillations in the theta frequency range (4–8 Hz), thought to reflect the implementation of cognitive control [54–56]. The FRN is typically followed by a frontally-maximal P3a component, which peaks around 300–500 ms following the onset of unexpected feedback. The P3a may reflect the updating of stimulus memory representations [57,58], decision-making processes [59,60], and voluntary shifts of attention to unexpected stimuli [61,62]. If altered auditory feedback during music performance triggers the emergence of a more cognitively-controlled (e.g., deliberative, goal-directed, model-based, prefrontal) state, as opposed to a habitual (e.g., automatic, model-free, striatal) performance state [63], then we would expect theta frequency activity to be enhanced following any feedback alterations, and to be accompanied by FRN and P3a potentials. A benefit of extracting theta band activity related to the FRN is that it can account for potential overlap of neighboring FRN and P3a potentials in the ERP waveform [56,64].

The current study investigated the relationship between performers' planning and feedback monitoring processes by presenting altered auditory feedback corresponding to upcoming (future) sequence events during music performance. The timing of pianists' key presses in response to altered auditory feedback pitches was measured. Pianists memorized and performed isochronous melodic sequences on an electronic keyboard while hearing feedback triggered by their key presses over headphones. Altered pitch feedback was manipulated in four conditions: Future +1 ("near future"), future +2 ("far future"), noncontextual, and baseline. In the future +1 condition, participants heard an

altered pitch presented at the current location that matched the intended (memorized) pitch at the next location in the sequence. In the future +2 condition, participants heard an altered pitch presented at the current location that matched the intended (memorized) pitch at the location two events ahead of the current location. In the noncontextual condition, participants heard a pitch that was not present in the sequence; this control condition tested effects of hearing an altered feedback pitch that was unrelated to performers' planning processes. Finally, in the baseline condition, participants heard the expected auditory feedback with no pitch alterations.

We tested three predictions: First, near future (future +1) altered auditory feedback was expected to induce greater interference with the production of currently planned events than far future (future +2) altered auditory feedback. This prediction is based on producers' use of planning gradients, in which plans are weighted toward near compared to distant sequence events [8,12]. Pianists were therefore expected to show greater temporal disruption following future +1 altered auditory feedback compared to future +2 altered feedback. Second, we expected performance disruption to be associated with decreased N1 and P2 suppression following future +1 feedback compared to future +2 feedback. Third, future +1, future +2, and noncontextual altered feedback pitches were expected to elicit FRN and P3a ERP components (relative to the baseline condition), as well as corresponding theta oscillations within the timeframe of the FRN.

#### **2. Materials and Methods**

#### *2.1. Participants*

Twenty-eight right-handed adult pianists with at least 6 years of private piano instruction were recruited from the Montreal community. Four participants were excluded from analysis due to insufficient data after trials performed from memory that contained pitch errors (*n* = 3) or EEG artifacts (*n* = 1) were removed. The remaining 24 pianists (15 women, age *M* = 21.1 years, SD = 2.7 years) had between 6 and 20 years of piano lessons (*M* = 11.5 years, SD = 3.9 years). Participants reported having no hearing problems. Two of the pianists reported possessing absolute pitch. Participants provided written informed consent, and the study was reviewed by the McGill University Research Ethics Board.

#### *2.2. Stimulus Materials*

Four novel melodies that were notated in a binary meter (2/4 time signature), conforming to conventions of Western tonal music, were used in the study. An example of a melody is shown in Figure 1. All melodies were isochronous (containing only 8 quarter notes), were notated for the right hand, and were designed to be repeated without stopping 3 times in each trial (totaling 24 quarter-note events). Each stimulus melody was composed to have no repeating pitches within two sequence positions. The 4 melodies were composed in the keys of G major, D minor, C major, and B minor. Suggested fingering instructions were also notated.

During the experiment, auditory feedback pitches triggered by participants' key presses while performing the melodies were occasionally replaced by an altered pitch. The altered pitches were chosen from the same diatonic key as the original melody to maintain the melodic contour of the original melody, and to avoid tritone intervals. Altered feedback pitches occurred in one of 8 possible locations within each trial. As metrical accent strength has been found to influence both correct (error-free) music performance and the likelihood of performance errors [8,65,66] among performing musicians, half of the altered feedback locations occurred at odd-numbered serial positions in the tone sequence (aligning with strong metrical accents in the melody's binary time signature), and the other half occurred at even-numbered serial positions (aligning with weak metrical accents in the melody's time signature).

Examples of potential altered feedback pitches for one stimulus melody are shown in Figure 1. In the future +1 condition, participants heard the pitch that corresponded to the next intended (memorized) pitch in the melodic sequence when they pressed the piano key. In the future +2

condition, participants heard the pitch that corresponded to the intended pitch that was 2 events ahead in the melodic sequence. In the noncontextual condition, participants heard a pitch from the melody's diatonic key (determined by the key signature of each stimulus melody) that was not present in the melodic sequence. Noncontextual pitches were chosen to match the contour and interval size as closely as possible to that of the intended pitch. The noncontextual condition was intended to serve as a control condition, to test effects of hearing an altered feedback pitch that was unrelated to performers' planning processes. Finally, in a baseline condition, no auditory feedback pitches were altered (participants heard the intended auditory feedback).

Each stimulus trial contained three and a half continuous iterations (without pausing) of a repeated melody (described below in Procedure). Each trial began with a 12-beat metronome sounded every 500 ms; the first four beats indicated the intended pace and the remaining eight beats coincided with the pianists' first iteration of the melody, forming the synchronization phase of the trial (see Figure 2). The metronome then stopped and the pianists continued performing for two and a half more iterations of the melody, forming the continuation phase of the trial. Altered feedback pitches could occur during the continuation phase only. A minimum of zero and a maximum of two pitches were altered within a single trial, with a maximum of one altered pitch per melody iteration. When two altered pitches occurred in a single trial, they were always separated by at least three unaltered pitch events. No alterations occurred on the first pitch of any iteration or on the last four pitches of any trial.

**Figure 1.** Example of a notated stimulus melody. Sample altered feedback pitches for the four auditory feedback conditions (baseline, noncontextual, future +1, and future +2), and the three target event positions (−1, 0, +1) over which interonset intervals (IOIs) and event-related potentials (ERPs) were analyzed are shown. Target event positions are numbered with respect to the distance of the altered feedback from its intended sequence position. Arrows show the location at which the altered feedback pitches occurred, and dashed lines indicate the origin of the altered feedback pitches.

**Figure 2.** Synchronization-continuation trial. Participants synchronized the first iteration of each melody with a metronome ('Synchronization'), and then performed two and a half additional melody iterations without the metronome ('Continuation'). Four initial metronome beats set the performance tempo. The metronome sounded every 500 ms.

#### *2.3. Equipment*

Participants performed the stimulus melodies on a Roland RD-700SX musical instrument digital interface (MIDI) digital piano keyboard (Roland Corporation, Ontario, CA, USA) in a sound- and electrically-attenuated chamber while EEG was recorded. As pianists performed, sound was emitted from a Roland Edirol SD-50 system (Roland Corporation, Ontario, CA, USA) and delivered through EEG-compatible air-delivery earphones (ER1-14B, Etymotic Research). Two channels were used for auditory feedback: "GMT piano 002" for piano key press auditory feedback, and "Rhy 001" for the metronome that signaled the performance rate at the start of each trial. Auditory feedback pitches were controlled using FTAP version 2.1.06 [67]. FTAP presented pre-programmed pitches at the time that pianists pressed each key, and measured key press timing information with 1-ms resolution.

#### *2.4. Design*

The study used a repeated measures within-participant design in which altered auditory feedback pitches were manipulated in four conditions: Future +1, future +2, noncontextual, and baseline. Participants completed trials in three blocks, each corresponding to an altered auditory feedback type (future +1, future +2, and noncontextual). Each block contained 32 trials, 50% of which contained no altered auditory feedback (baseline condition), and 50% which contained an altered feedback pitch (future +1, future +2, or noncontextual). Each trial containing altered auditory feedback was unique across the entire experiment and therefore was heard only once by participants. Block and melody orders were counterbalanced across the 24 participants. Participants performed a total of 96 (3 blocks × 32) trials, equivalent to 192 continuation iterations (32 future +1, 32 future +2, 32 noncontextual, and 96 baseline), over the course of the entire experiment. The dependent variables of the tone interonset interval (IOI), ERP component amplitudes, and theta band power were analyzed at sequential positions, −1, 0, and +1, relative to the altered tone location (as shown in Figure 1).

#### *2.5. Procedure*

Participants first completed a musical background questionnaire, followed by a piano performance memory test. Participants were then presented with a short novel right-hand melody (not included in the experiment) to practice and memorize; those who were able to memorize and perform it to a note-perfect criterion within three attempts, after up to three minutes of practice with the music notation, were invited to participate in the experiment. All pianists met this criterion. Following completion of the memory test, participants were outfitted with EEG caps and electrodes.

Participants were then asked to complete three practice trials in order to become familiar with the task. At the start of the practice trials, the participants were again presented the music notation of the single-hand melody that they had previously performed in the memory test. They were asked to indicate when they had memorized the melody. The music notation was then removed and replaced with a fixation cross. Participants were then asked to perform the melody from memory at the rate indicated by four clicks of a metronome cue (500 ms per quarter note beat). They were told that they would sometimes hear a tone that did not match the key that they pressed, but that they should keep performing at the rate cued by the metronome and try not to stop or slow down. Participants were also instructed to view the fixation cross while they were performing. The purpose of the fixation cross was to inhibit large eye movements and control participants' gaze locations during the performance task, following other EEG studies [68,69]. During each of the three practice trials, a single feedback pitch was altered to correspond to the future +1, future +2, and noncontextual experimental conditions. The order of the three practice trials was counterbalanced across participants.

Following the three practice trials, participants were presented with the music notation of one of the four experimental stimulus melodies. They were asked to practice the melody for a maximum of three minutes, using the notated fingering, with the goal of performing it from memory. Following memorization, the notation was removed and replaced with a fixation cross. Participants

then performed the melody from memory in the synchronization-continuation trials. The first three synchronization-continuation trials contained no altered feedback, so that the experimenters could verify that participants had successfully memorized the melody; all participants were able to perform at least one of the three verification trials without producing any pitch errors.

In each synchronization-continuation trial, participants were instructed to perform the melody from memory at the rate indicated by the metronome (500 ms per quarter-note beat), to not stop or slow down if they heard a tone that did not match the key that they pressed, and to continuously repeat the melody until they stopped hearing auditory feedback from their key presses. The metronome stopped when the participant began the second iteration of the melody. Participants were asked to refrain from moving their head or body while performing in order to minimize movement-related EEG artifacts. Eyeblinks typically create artifacts in the EEG signal, which can be addressed using a variety of artifact rejection procedures (for a review, see [70]). In order to minimize eyeblink-related artifacts, participants in some studies may be asked to refrain from blinking during certain parts of EEG trials. Since the duration of each synchronization-continuation trial in the current study exceeded 15 s, participants were not asked to refrain from blinking during the trial. Following each trial, participants indicated when they were ready to proceed to the next trial. This procedure was repeated for each of the 4 stimulus melodies and for each of the 3 feedback blocks. The synchronization-continuation trials lasted approximately 45 min. At the end of the experiment, participants were asked if they noticed any specific aspects of the altered feedback or its manipulation across the experiment; none of the participants reported an awareness of any relationship between the altered feedback and performance.

#### *2.6. Data Recording and Analysis*

#### 2.6.1. Behavioral Data

Behavioral disruption associated with the presentation of altered auditory feedback was evaluated by analyzing IOIs from the time of one key press to the next key press (in ms) for pitches that occurred before (position −1), during (position 0), and after (position +1) the altered auditory feedback pitch (position +1; see Figure 1). Errors in pitch accuracy were identified by computer comparison of pianists' performances with the information in the notated musical score (Large, 1993). Pitch errors were defined as pitch additions, deletions, and corrections (errors in which pianists stopped after an error and corrected their performance). A mean of 7.9% of trials (SD = 7.3%) across subjects and conditions contained pitch errors; these trials were excluded from analyses, since any error that added or subtracted a tone from the melodic sequence changed the relationship between the participants' key presses and the pre-programmed auditory feedback.

#### 2.6.2. EEG Data

Electrical activity was recorded at the scalp using a 64-channel Ag/AgCl electrode BioSemi ActiveTwo System (BioSemi, Inc., Amsterdam, The Netherlands). A sampling rate of 1024 Hz, recording bandwidth of 0 to 205 Hz, and resolution of 24 bits were used. Electrode locations were prescribed by the 10–20 international electrode configuration system. Horizontal and vertical eye movements were monitored by electrodes placed adjacent to the outer canthi of the eyes and above and below the right eye, respectively.

EEG data were analyzed using BrainVision Analyzer 2.0.2 (Brain Products GmbH, Gilching, Germany). Activity was re-referenced off-line to the average of all scalp electrodes, and signals were bandpass-filtered between 0.1 and 30 Hz. The EEG data were then segmented into 500 ms epochs beginning 100 ms prior to and continuing 400 ms after pitch onsets at positions −1, 0, and +1. Activity during the 100 ms prior to pitch onsets served as a baseline. An epoch duration of 500 ms was selected since it included activity that was shorter than three standard deviations below the mean IOI (=487 ms) of key presses recorded during the continuation period, and therefore avoided contamination of the observed waveforms with ERPs related to the subsequent pitch onset. Artifact rejection was performed

automatically using a ±50 μV rejection threshold at the 64 scalp electrodes, as well as the horizontal and vertical right eye electrodes. Artifacts were considered excessive for a given subject when more than half of the epochs from a given condition of the experiment exceeded the ±50 μV rejection threshold at one of the 64 scalp electrodes or at the horizontal or vertical eye electrodes. Trials that contained pitch errors were also excluded from EEG analyses, resulting in the inclusion of 30.4/32 epochs (SD = 3.2) in the future +1 condition, 28.2/32 epochs (SD = 3.3) in the future +2 condition, 28.2/32 epochs (SD = 2.3) in the noncontextual condition, and 85.3/96 epochs (SD = 6.8) in the baseline condition (which contained three times as many stimuli as it was matched to the other conditions).

Average ERPs by participant and experimental condition were then computed for the 500-ms window time-locked to the 100 ms prior to pitch onsets. Mean ERP amplitudes were statistically evaluated at 3 topographical regions of interest (ROIs), based on related findings [7]: Anterior (electrodes Fz and FCz), central (electrodes Cz and CPz), and posterior (electrodes Pz and POz). ERP amplitudes were statistically evaluated over 40-ms time windows selected based on previous findings [7] as follows: 80–120 ms (labeled N1), 120–160 ms (labeled P2), 180–220 ms (labeled FRN), and 250–290 ms (labeled P3a). All of the ERP components were maximal at the anterior ROI; results are therefore reported for the anterior ROI only, following previous work [7,56,71]. Repeated-measures analyses of variance (ANOVAs) were conducted on ERP component amplitudes to analyze the effects of feedback type (future +1, future +2, noncontextual, and baseline). Scalp topographic maps showing ERP component distributions were generated by plotting amplitude values on the scalp. Activity was averaged across the time window used for the analysis of each component. Within-participant correlations between mean ERP amplitudes and behavioral measures for each participant were computed using simple linear regression.

Because increases in spectral power in the theta frequency range (4–8 Hz) typically accompany the FRN [63], we analyzed theta power at the anterior ROI within the 200–300 ms that followed pitch onsets at the three event positions [7,56,72]. To allow for the specification of a temporal baseline period as well as a temporal buffer, with the purpose of preventing edge artifacts within the 100-ms epoch of interest, time-frequency decompositions were calculated for each participant in a −1000 to +1000 ms time window centered on pitch onsets [73]. Our goal in using time-frequency analysis was to ensure that any potential ERP component overlap in the average ERP waveforms did not provide an alternative interpretation of our results. In order to eliminate influences of faster or slower components overlapping the FRN in the average ERP waveforms, decompositions were computed using a Morlet wavelet transform based on each participant's average ERP waveforms for each experimental condition [56,64,74]. To achieve sufficient temporal resolution for the theta frequency range, the number of Morlet wavelet cycles used for analysis of the theta band was set to *n* =7[75,76]. Mean power in a pre-stimulus baseline period of −100 to 0 ms was subtracted from the 2-s time-frequency analysis window to permit the assessment of event-related changes in theta activity [77]. Repeated-measures ANOVAs on mean theta power within the 200–300 ms following pitch onsets with factors' feedback type (future +1, future +2, noncontextual, baseline) and event position (0, +1) were conducted to analyze the effects of feedback conditions on theta power. Post-hoc pairwise comparisons were made using Tukey's honestly significant difference (HSD) test for both behavioral and neural measures. *η*<sup>2</sup> *<sup>p</sup>* was used as a measure of effect size.

#### **3. Results**

#### *3.1. Future +1 Altered Feedback Disrupts Key Press Timing*

The mean performance rate, indicated by the mean IOI per trial, during the continuation phase of the synchronization-continuation trials was 486.5 ms (SE = 0.3 ms), slightly faster than the metronome-indicated rate of 500 ms from the earlier synchronization phase. An ANOVA on mean IOIs per trial within the continuation phase by feedback condition yielded no main effect of feedback, F (3, 69) = 1.78, *p* = 0.16, suggesting that performance rates did not differ across the four conditions (future +1 *M* = 485.8 ms, SE = 0.5; future +2 *M* = 486.2 ms, SE = 0.6; noncontextual *M* = 486.4, SE = 0.5; baseline *M* = 487.7, SE = 0.5). Thus, performers successfully maintained the same tempo for all feedback conditions, with slightly faster rates than the prescribed rate overall, consistent with similar previous studies [7,78].

Figure 3 shows IOIs at melody positions preceding, at, and following at the altered feedback pitches and the same positions in the unchanged baseline pitches. An ANOVA on mean IOIs by feedback condition (future +1, future +2, noncontextual, baseline) and event position (−1, 0, +1) revealed a significant interaction of feedback condition with event position, F (6, 138) = 3.60, *p* < 0.005, *η*2 *<sup>p</sup>* = 0.14. IOIs at position 0 were significantly shorter than IOIs at positions −1 and +1 for the future +1 feedback condition only (Tukey HSD = 3.81, *p* < 0.05). IOIs did not significantly differ between positions −1, 0, and +1 for any other condition. There were no main effects of feedback type or position on IOIs. Thus, the only condition in which the altered auditory feedback temporally disrupted performance was the future +1 feedback condition, in which performers shortened the time interval during which they heard the altered feedback tone.

**Figure 3.** Pianists' mean interonset intervals (IOIs) by altered feedback condition (baseline, future +1, future +2, and noncontextual) by target event position (−1, 0, +1). Error bars represent one standard error. \* *p* < 0.05.

We next analyzed participants' key press errors (7.9% of all trials) by feedback condition. There was no significant main effect of feedback condition on the mean proportion of trials that contained errors, F (3, 69) = 0.37, *p* = 0.78. Pitch errors occurred at roughly equivalent rates across trials in all four feedback conditions (future +1 *M* = 7.8%, SE = 1.7%; future +2 *M* = 9.0%, SE = 1.5%; noncontextual *M* = 7.5%, SE = 1.2%; baseline *M* = 7.1%, SE = 1.6%).

#### *3.2. EEG Results*

#### 3.2.1. Event-Related Potentials

Figure 4 shows grand averaged ERP waveforms time-locked to key press onsets, averaged across error-free trials. ERP components are time-locked to key presses corresponding to the feedback pitch onset at position 0, as well as to the key presses at melody positions −1 (preceding location) and +1 (following location). N1 components and P2 ERP components, labeled in Figure 4, were observed at positions −1, 0, and +1 for all feedback conditions. Additionally, FRN and P3a components were observed at position 0 for the three altered feedback conditions. Scalp topographies corresponding to the N1 and P2 components at positions −1, 0, and +1 by feedback condition are shown in Figure 5. Topographies corresponding to the FRN and P3a components at position 0 are shown in Figure 6. Analyses of each ERP component are reported in turn.

**Figure 4.** Grand average event-related potentials (ERPs) elicited by the four experimental conditions relative to target event positions −1, 0, and +1. Activity shown is averaged across all electrodes contained within the anterior region of interest (ROI). Negative is plotted upward.

**Figure 5.** Voltage (in μV) scalp topographies of N1 and P2 components relative to target event positions −1, 0, and +1 by feedback condition. Activity averaged over 40 ms surrounding each component's grand average peak is shown.

**Figure 6.** Voltage (in μV) scalp topographies of feedback-related negativity (FRN) and P3a components elicited by pitches at target event position 0 by feedback condition. Activity averaged over 40 ms surrounding each component's grand average peak is shown.

*N1 component (80–120 ms).* We first evaluated whether mean amplitudes within the N1 time window differed across auditory feedback conditions. We conducted one-way ANOVAs on N1 amplitudes at each event position with the factor feedback type. N1 amplitudes did not significantly differ across feedback conditions at position −1, F (3, 69) = 0.18, *p* = 0.91. N1 amplitudes also did not significantly differ across feedback conditions at position 0, F (3, 69) = 1.47, *p* = 0.23. Analysis of N1 amplitudes at position +1 yielded a significant main effect of feedback type, F (3, 69) = 7.42, *p* < 0.001, *η*2 *<sup>p</sup>* = 0.24. All three altered feedback types elicited a significantly more negative N1 than did baseline feedback pitches (Tukey HSD = 1.73, *p* < 0.05). Thus, N1 amplitudes at event position +1 were sensitive to whether altered auditory feedback was presented one tone earlier (altered feedback conditions) or not (baseline condition). Specifically, N1 amplitudes were more negative following altered compared to baseline feedback.

*P2 component (120–160 ms).* We next evaluated whether mean amplitudes within the P2 time window differed across auditory feedback conditions. We conducted one-way ANOVAs on P2 amplitudes at each event position with the factor feedback type. P2 amplitudes did not significantly differ across feedback conditions at position −1, F (3, 69) = 0.25, *p* = 0.86. P2 amplitudes also did not significantly differ across feedback conditions at position 0, F (3, 69) = 1.04, *p* = 0.38. Analysis of P2 amplitudes at position +1 yielded a significant main effect of feedback type, F (3, 69) = 13.95, *p* < 0.001, *η*<sup>2</sup> *<sup>p</sup>* = 0.38. All three altered feedback types elicited a significantly less positive P2 than baseline feedback pitches (Tukey HSD = 2.12, *p* < 0.01). Furthermore, the P2 elicited by future +1 feedback was significantly less positive than the P2 elicited by future +2 and noncontextual altered feedback (Tukey HSD = 1.73, *p* < 0.05). Thus, like the N1 component, the P2 was sensitive to whether altered auditory feedback was presented one tone earlier or not. Critically, P2 amplitudes were more negative following future +1 feedback compared to future +2 feedback.

*Correlation of N1 and P2 amplitudes.* The temporal proximity of N1 and P2 components as well as their co-occurrence following both altered and unaltered feedback is consistent with their interpretation as joint indices of auditory sensory processing [34]. To test the relationship between N1 and P2 components, mean amplitudes within the N1 time window (80–120 ms) were compared with amplitudes within the adjacent P2 time window (120–160 ms) for each position and feedback condition. As shown in Table 1, amplitudes within the time windows of the N1 and P2 were significantly correlated for all feedback conditions at positions −1, 0, and +1 (all *p*s < 0.001).

*FRN component (180–220 ms).* Analysis of mean amplitudes within the FRN time window at position 0 yielded a significant main effect of feedback type, F (3, 69) = 31.53, *p* < 0.001, *η*<sup>2</sup> *<sup>p</sup>* = 0.58. All three altered feedback types elicited a significantly more negative FRN compared to the baseline condition (Tukey HSD = 2.58, *p* < 0.05). No other comparisons reached significance. Thus, all three altered auditory feedback types elicited an FRN response.

*P3a component (250–290 ms).* Analysis of mean amplitudes within the P3a time window at position 0 yielded a significant main effect of feedback type, F (3, 69) = 7.70, *p* < 0.001, *η*<sup>2</sup> *<sup>p</sup>* = 0.25. All three altered feedback types elicited a significantly more positive P3a compared to the baseline condition (Tukey HSD = 2.44, *p* < 0.05). No other comparisons reached significance. Thus, as predicted, all three altered auditory feedback types elicited a P3a response.

**Table 1.** Correlations of mean N1 and P2 amplitudes at target event positions −1, 0, and +1 for each feedback condition. \* *df* = 22, *p* < 0.001.


#### 3.2.2. Evoked Oscillatory Responses

To assess whether altered auditory feedback influenced spectral power within the theta frequency range, we computed spectral power in the 4–8 Hz frequency range within the anterior ROI at each event position, as shown in Figure 7. Analysis of theta spectral power during the 200–300 ms following pitch onsets by feedback condition (future +1, future +2, noncontextual, and baseline) and position (−1, 0, and +1) yielded main effects of both feedback condition, F (3, 69) = 6.49, *<sup>p</sup>* = 0.001, *<sup>η</sup>*<sup>2</sup> *<sup>p</sup>* = 0.22, and position, F (2, 46) = 8.24, *p* = 0.001, *η*<sup>2</sup> *<sup>p</sup>* = 0.26. There was also a significant interaction between feedback condition and position, F (6, 138) = 7.68, *p* < 0.001, *η*<sup>2</sup> *<sup>p</sup>* = 0.25. Theta power was greater for each of the three altered feedback conditions compared to the baseline feedback condition at position 0 (Tukey HSD = 157.2, *p* < 0.01). Theta power was also greater at position 0 compared to position +1 within each of the three altered feedback conditions (Tukey HSD = 157.2, *p* < 0.01). In sum, theta power increased only following altered feedback pitches that occurred at position 0, and not following (unaltered) feedback pitches that occurred at position +1. Thus, changes in theta power depended on whether the feedback was altered or not, and not on whether the feedback contents were repeated (future +1) or not (future +2).

**Figure 7.** Evoked spectral power within the 4-8 Hz (theta) frequency range following pitch onsets at target event positions −1, 0, and +1. Brighter colors indicate greater spectral power.

#### *3.3. Correlations of Neural and Behavioral Measures*

#### ERP Amplitudes and IOIs

To examine the relationship between the temporal disruption to key press timing and the ERP components, we first tested whether the temporal disruption arising from future +1 auditory feedback—the shortening of the position 0 IOI—correlated with mean amplitudes of ERP components at position +1 that immediately followed the disrupted timing. As shown in Figure 8, the shortened mean IOIs at position 0 correlated significantly with mean amplitudes of the subsequent N1 in the future +1 condition, *r* (22) = 0.47, *p* < 0.05. Shorter IOIs at position 0 were associated with a larger N1 response to the pitch that followed the altered feedback. The correlation of mean IOIs at position 0 with amplitudes of the P2 at position +1 yielded a similar pattern of association, but the correlation did not reach significance, *r* (22) = 0.27, *p* = 0.22. Mean N1 and P2 amplitudes did not correlate with mean IOIs at position 0 for any other feedback condition (future +2, noncontextual, and baseline feedback). Thus, auditory sensory processing of the tone following the altered feedback, reflected in the N1, was associated with temporal disruption only when near future altered feedback was presented.

**Figure 8.** Correlation of mean IOIs at target event position 0 in the future +1 altered feedback condition with mean N1 amplitudes elicited by the tone that followed the altered auditory feedback pitch in the future +1 condition. Each dot represents one participant.

To examine the relation between temporal disruption and FRN responses to altered auditory feedback, we computed the interonset change (in ms) from the IOI at position 0 to the IOI at position +1. Participants' mean difference in IOIs between positions 0 and +1 across all three altered feedback conditions correlated significantly with mean FRN amplitudes time-locked to the altered feedback pitch (position 0), *r* (21) = 0.41, *p* < 0.05, shown in Figure 9. Mean amplitudes within the time window of the FRN were not correlated with the difference in IOIs across positions 0 and +1 for the baseline condition, *r* (21) = 0.27, *p* = 0.24. Therefore, amplitudes of the FRN elicited by altered auditory feedback were associated with changes in the performance rate that succeeded the altered feedback: More negative FRNs were associated with increases in the performance rate. No other ERP component amplitudes correlated significantly with IOIs or with IOI differences at event positions preceding or following altered auditory feedback.

**Figure 9.** Correlation of mean IOI differences (target event position 1 minus position 0) from the three altered feedback conditions (future +1, future +2, and noncontextual) with mean FRN amplitudes elicited by altered feedback (target event position 0) across the three altered feedback conditions (future +1, future +2, and noncontextual). Each dot represents one participant.

#### **4. Discussion**

We examined the relationship between future-oriented planning processes and feedback monitoring during music performance. Skilled pianists performed short melodies from memory. Perceived auditory feedback was occasionally altered to match immediately upcoming sequence events (future +1), later future events (future +2), or unrelated pitches that were not contained within the performed sequences (noncontextual). There were several novel findings. First, only future +1 altered feedback—not future +2 or noncontextual altered feedback—perturbed the timing of pianists' key presses. Second, the length of time it took performers to initiate the pitch following the future +1 altered feedback pitch was associated with larger auditory sensory potentials to the post-altered feedback pitch. Third, all types of altered feedback elicited FRN and P3a potentials. Fourth, FRN amplitudes increased as performers sped up following the altered feedback pitch, in response to all types of altered auditory feedback. Together, these findings suggest that future-oriented planning during production influences how performers monitor their auditory feedback. The range of sequential planning may be constrained by distance: Events at nearby sequence positions had a greater influence on planning and monitoring processes than did events at farther positions, consistent with theories of sequence production in which planned events are activated along a gradient that is defined by sequential distance [8,9,12]. According to a predictive coding model [15], a cascade of forward models for upcoming movements may generate an error signal in response to altered auditory feedback that is stronger when the feedback matches nearby sequence positions than when it matches farther positions.

#### *4.1. Behavioral Findings*

The timing of pianists' performances was disrupted following the perception of altered auditory feedback that corresponded to near future, but not far future, events. According to future-oriented theories of planning during music and speech production, immediately upcoming events receive stronger activation than events that are farther ahead in a melody or utterance [8,12]. When pianists heard an altered feedback pitch that matched an event that was already strongly activated in memory, the altered pitch may have generated similarity-based interference with the event that was currently being produced. Thus, temporal perturbations observed in the future +1 condition may reflect the greater interference of near future altered feedback with currently planned pitch events compared to far future altered feedback. This interpretation is consistent with theories of sensorimotor production in which actions and their auditory effects share common cognitive representations [79,80], as well as theories in which actions are planned in terms of their sensory effects [81–83]. We previously demonstrated that future-oriented, but not past-oriented, altered auditory feedback induced compensatory adjustments in keystroke timing [7]. The current results

extend this finding by suggesting that future-oriented interference interacts with graded planning and monitoring processes during music performance.

Another important factor that constrains memory retrieval of sequence of events is the similarity between sequence elements. Evidence from production errors and priming paradigms has indicated that grammatical and phonological similarity influence lexical retrieval [84,85], and tonal and metrical accent relationships influence event retrieval during music performance [8,65]. For example, musicians are more likely to produce pitch errors in metrically weak than in metrically strong accent positions [65]; sequence events that align with greater metrical accent strength tend to be produced with greater intensity [86]. The melodies used in the current study were designed so that metrical accents of the future +2 altered feedback pitch were more similar to the currently planned pitch event than were the metrical accents of future +1 feedback [87]. This metrical similarity approach would predict that the future +2 altered feedback should generate greater interference and performance disruption than future +1 feedback. This prediction was not supported by the current results: Instead, altered feedback that contained serially proximal pitches was more disruptive to performance than altered feedback that contained metrically similar pitches. This suggests that serial proximity may play a greater role than metrical accent strength in generating interference with planned representations for the short sequences used in the current study. One explanation for the lesser contribution of metrical accent to the disruptive effects of altered auditory feedback could be that metrical relationships between sequence events tend to span longer timeframes than the timespans between serially proximal events [65].

Serially-shifted feedback, like the future-oriented altered auditory feedback presented in our study, is known to increase performers' overall key press error rates [21]. We observed heightened error rates in all altered auditory feedback conditions compared to baseline (unchanged) feedback. Error rates were relatively low compared to rates as high as 40% observed in other studies employing serially-shifted auditory feedback [22]. A likely explanation for this difference is that single pitches were altered at random sequence locations in the current study, which prevented performers from anticipating the alterations, unlike in previous studies, in which auditory feedback was continuously and consistently altered. When auditory feedback is predictably altered, performers can develop strategies to compensate for predictable deviations from expected feedback. Even under conditions in which every feedback tone is altered during music performance, pitch errors begin to occur only after several melody repetitions [88]. Future studies could further investigate interactions between hierarchical and distance constraints on sequence planning using musical materials that amplify differences between strongly and weakly accented events.

#### *4.2. EEG Findings*

Altered auditory feedback attenuated cortical sensory suppression compared to baseline feedback, reflected in amplitude-shifted N1 and P2 ERP components. Sensory suppression is widely believed to result from the congruence between sensory consequences of actions and sensory predictions generated by forward models of motor commands (for a review, see [29]). Theories of motor control have proposed that efference copies of motor commands are used to predict sensory outcomes of those commands, and that sensory suppression results from the subtraction of an efference copy from actual sensory input [14,89]. Sensory suppression is often used as an implicit measure of agency, as actions must be volitional in order to generate predictive models of motor commands [90,91]. Increased auditory sensory processing following altered feedback pitches could therefore indicate that altered feedback disrupted pianists' sense of agency or control over the sounds that they were producing (cf. [7]). This interpretation also fits with the proposal that sensory suppression during production may serve the purpose of allowing producers to differentiate self-generated from externally-generated sensations [36].

We observed a greater reduction of sensory suppression following future +1 altered feedback compared to future +2 altered feedback, reflected in the P2. This finding suggests that the post-altered feedback pitch received enhanced cortical sensory processing in the future +1 condition compared to the future +2 condition. Further, reduced sensory suppression in the future +1 condition was associated with a quicker initiation of the tone following the future +1 feedback pitch. Together, these findings suggest that enhanced cortical sensory processing following the future +1 altered auditory feedback may have aided the recovery from perturbations caused by the unexpected feedback. Indeed, expectancy violations tend to receive enhanced neural processing compared to events that fulfill expectations, in line with a predictive coding view of cortical responses [92]. It is unlikely that differences in P2 amplitudes for future +1 and future +2 conditions were driven by differences in selective attention between altered feedback conditions, since sensory suppression during auditory production appears to be uninfluenced by whether attention is directed toward or away from one's own actions or their auditory effects [93]. It is also unlikely that this amplitude difference is due to differences between future +1 and future +2 conditions in terms of pitch repetition. From a repetition suppression perspective, we would expect decreased—not increased—cortical processing of the tone that followed the future +1 altered feedback tone, since this tone was repeated and stimulus repetition classically results in a decreased brain response due to sensory adaptation [94,95]. The fact that theta power did not distinguish future +1 from future +2 responses supports this interpretation. We propose that sensory suppression depended on the differences in interference generated by the future +1 and future +2 altered feedback pitches with concurrent planning processes. Amplitudes may indicate the degree of conflict or mismatch between perceived altered auditory feedback and concurrent planning processes, which are biased towards the immediate future.

Both N1 and P2 components are sensitive to a variety of acoustic features of incoming auditory signals, highlighting a role of these components in early auditory sensory processing. For example, pitch changes in vocal stimuli during active vocalization elicit larger N1 and P2 responses than pitch changes in non-voice complex stimuli, which in turn elicit larger amplitudes than pure tones [96]. Acoustic spectral complexity [42], pitch discrimination and speech-sound training [43,97], and the rate of speech formant transition [98] have all been shown to modulate N1 and P2 responses. The current results extend these findings by demonstrating that N1 and P2 amplitudes also take into account the relationship between pitch changes and planned events in an auditory sequence. Speech sounds are generally more spectrally complex than musical sounds [99]. An open question for future research is therefore whether alterations of auditory feedback during speech production are better detected by the auditory system than feedback alterations during music performance.

FRN and P3a ERP components were elicited by all altered auditory feedback (future +1, future +2, and noncontextual) pitches. ERP amplitudes were equivalent across all altered feedback conditions. FRN and P3a components have been elicited by altered auditory feedback during music performance in previous studies [45–47]. None of these studies compared neural responses to different types of altered auditory feedback, with the exception of Katahira and colleagues [45], who manipulated the diatonicity of altered feedback tones. The current finding suggests that performers identified and subsequently oriented toward all types of unexpected feedback. This finding fits with the principle that any alteration of feedback during auditory-motor tasks creates a mismatch between movements and expected auditory outcomes, which create larger violations for producers with higher skill levels [19] or with greater sequence familiarity [100]. Studies using flanker gambling tasks have demonstrated that the FRN is sensitive to the perceptual distinctiveness of unexpected stimuli [101–103]. The noncontextual control condition presented diatonically-related altered feedback pitches that were more distinct from the pitch set of the produced melodies than were the altered pitches in the future +1 and future +2 melodies. Yet, the FRN elicited by noncontextual altered feedback did not differ from that elicited by contextual (future +1 and future +2) feedback. The association between FRN amplitudes and speed of the altered feedback pitch for all three altered feedback conditions further supports this interpretation. Thus, FRN responses may be less affected by perceptual distinctiveness or by performers' planning processes and more dependent on action-related expectations. Future studies may address this possibility directly with manipulations of perceptual distinctiveness.

Theta power increases were also observed following all types of altered feedback tones, about 200–300 ms after the altered pitch onsets. The lack of differences in theta power across feedback conditions confirms the FRN results, and suggests that amplitudes of the FRN elicited by altered feedback were unaltered by overlapping ERP components. Increases in theta power within the approximate timeframe of the FRN component suggest that identification of expectancy violating pitches coincided with the emergence of a more cognitively controlled, deliberative mental state, as opposed to a mental state relying primarily on habit or performance routines [63]. Just as the FRN has been suggested to reflect surprising action-based outcomes [104], theta has been referred to as a "surprise signal" that leads to task-specific adjustments in cognitive control [63]. Theta frequency oscillations may coordinate the excitability of populations of mid-frontal neurons, thereby providing a temporal window in which cognitive control can be instantiated [105]. Orienting to altered auditory feedback during music performance may therefore involve a switch from a state of relatively automatic performance to performance that is more deliberative and goal-directed, characterized by transitory changes in the production rate. Finally, increases in theta power did not differentiate between altered feedback types. Similar to the notion that the FRN may depend more on action-related expectations than on performers' planning processes, equivalent increases in theta across feedback conditions suggest that any violation of any action-sound association is sufficient for invoking the need for cognitive control.

#### **5. Conclusions**

This study provides the first neural support for the finding in speech production and music performance that planning of upcoming events in a sequence is influenced by the serial proximity of the future events. Feedback monitoring processes interacted with planning processes: Performers' perception of altered feedback tones that matched immediately upcoming future events resulted in behavioral and neural adaptations, including temporal disruption (speeded IOIs), enhanced cortical sensory processing following the altered feedback (amplitude-shifted N1 and P2 responses), and increased theta frequency activity. These findings support models of sequence production in which the planning of future events is modulated by their serial distance from the current event [8,12], and contribute to our understanding of the link between sensory suppression and action planning during the performance of complex action sequences. The N1-P2 complex may serve as a neural marker for disruptive effects of altered auditory feedback in sensorimotor tasks.

**Author Contributions:** B.M., W.J.G. and C.P. conceived and designed the experiment; B.M. performed the experiment; B.M., W.J.G. and C.P. analyzed the data; B.M., W.J.G. and C.P. wrote the paper.

**Funding:** National Science Foundation Graduate Research Fellowship to B.M. Canada Research Chairs grant and Natural Sciences and Engineering Research Council of Canada grant 298173 to C.P.

**Acknowledgments:** We thank Pierre Gianferrara, Erik Koopmans, and Frances Spidle of the Sequence Production Lab for their assistance.

**Conflicts of Interest:** The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Attention Modulates Electrophysiological Responses to Simultaneous Music and Language Syntax Processing**

#### **Daniel J. Lee 1, Harim Jung <sup>1</sup> and Psyche Loui 1,2,\***


Received: 18 September 2019; Accepted: 29 October 2019; Published: 1 November 2019

**Abstract:** Music and language are hypothesized to engage the same neural resources, particularly at the level of syntax processing. Recent reports suggest that attention modulates the shared processing of music and language, but the time-course of the effects of attention on music and language syntax processing are yet unclear. In this EEG study we vary top-down attention to language and music, while manipulating the syntactic structure of simultaneously presented musical chord progressions and garden-path sentences in a modified rapid serial visual presentation paradigm. The Early Right Anterior Negativity (ERAN) was observed in response to both attended and unattended musical syntax violations. In contrast, an N400 was only observed in response to attended linguistic syntax violations, and a P3/P600 only in response to attended musical syntax violations. Results suggest that early processing of musical syntax, as indexed by the ERAN, is relatively automatic; however, top-down allocation of attention changes the processing of syntax in both music and language at later stages of cognitive processing.

**Keywords:** music; language; syntax; attention; comprehension; electroencephalography; event-related potentials

#### **1. Introduction**

Music and language are both fundamental to human experience. The two domains, while apparently different, rely on several notable similarities: both exhibit syntactic structure, and both rely on sensory, cognitive, and vocal-motor apparatus of the central nervous system. The nature of this relationship between syntactic structure in language and music, and their underlying neural substrates, is a topic of intense interest to the cognitive and brain sciences community.

The Shared Syntactic Integration Resource Hypothesis (SSIRH [1]) is an influential theoretical account of similarities and differences between cognitive processing for music and language. The SSIRH posits that neural resources for music and language overlap at the level of the syntax; in other words, processing of music and language should interact at the syntactic level, but not at other levels such as semantics or acoustic or phonemic structure.

Support for the SSIRH comes from a variety of behavioral and neural studies. Several studies have presented music and language simultaneously with and without syntactic violations, to test for effects of separate and simultaneous syntax violations on behavioral and neural measures [2–6]. Strong support for the SSIRH comes from a self-paced reading paradigm [2], where sentence segments were presented concurrently with musical chord progressions. One subset of the trials contained syntactic violations in language (garden path sentences), and another subset contained syntactic violations in music (out-of-key chords); a third subset contained simultaneous syntactic violations in both domains. Reaction time results showed that during simultaneous violations of music and language, participants were slowest to respond to the double violation than they were to respond to each violation alone. This superadditive effect was not observed in a control experiment which manipulated the timbre of the music and the semantics of the language.

Although these results seem to offer convincing support for SSIRH, Perruchet and Poulin-Charronnat (2013) [7] showed that under semantic garden path manipulations (as opposed to syntactic garden path manipulations), violations of semantics can also yield the same pattern. Based on these results, Perruchet and Poulin-Charronnat (2013) suggested that increased attentional resources, rather than syntax processing per se, could lead to these statistical interactions.

The idea that attention can influence the pattern of interaction between music and language processing has since received more support. More recent work has argued that the processing resources of syntax for language and music might both rely on domain-general attentional resources, especially when simultaneously processing music and language in a dual-task situation [3,8]. In that regard, classic theories of attention distinguish between the endogenous, voluntary maintenance of a vigilant state, and the exogenous, involuntary orienting to stimulus events [9]. These attentional systems both affect reaction time, but involve different neural resources and unfold differentially over time [10]. Since reaction time during simultaneous linguistic and musical syntax processing may not readily differentiate between overlapping syntax-specific resources and the engagement of attentional resources, we turned to more time-sensitive measures of neural activity during the processing of music and language. This enables direct comparisons between neural responses to musical syntax violations and neural responses to language syntax violations at multiple time windows throughout the temporal cascade of attentional processes that are triggered during music and language processing. By comparing neural markers of syntax violations in the two domains of music and language, and testing for the interaction between attention and violations in each domain, we can clarify the roles that attentional resources might play in the processing of syntax in music and language.

The Early Right Anterior Negativity (ERAN) and the Early Left Anterior Negativity (ELAN) are reliably elicited event-related potential (ERP) markers of syntax processing in music and language respectively [5,11–14]. The ERAN is a frontally-generated negative waveform around 200 ms after the onset of musical syntax violations, whereas the ELAN is an analogous frontally-generated negativity after violations in linguistic syntax, such as violations in word category [15] or phrase structure [16]. Musical syntax processing has been localized to the inferior frontal gyrus (IFG) in magnetoencephalographic (MEG) and fMRI studies [17–20]. Additional results from ERAN of lesioned patients [21] and in children with Specific Language Impairment [22] have also provided evidence for the reliance of musical syntax processing on classic language-related areas such as the inferior frontal gyrus. The ERAN is also posited as an index of predictive processes in the brain, especially in the case of music, due to its reliance on the formation and subsequent violation of predictions that are learned from exposure to musical sound sequences [23]. Impaired ERAN is observed in adults with lesions in the left inferior frontal gyrus (Broca's area), which provides additional support for the SSIRH. Importantly, the ERAN is also sensitive to top-down task demands, such as attentional resources devoted to the musical stimuli in contrast to a concurrent, non-musical task [24]. When music and speech were simultaneously presented from different locations, irregular chords elicited an ERAN whereas irregular sentences elicited an ELAN; moreover, the ERAN was slightly reduced when irregular sentences were presented, but only when music was ignored, suggesting that the processing of musical syntax is partially automatic [25].

While ERAN and ELAN are markers of syntax processing, semantic processing is indicated by the N400, a negative-going centroparietal waveform beginning around 400–500 ms after a semantic anomaly [26,27]. In addition to being sensitive to semantic content of words, the N400 effect reflects the semantic associations between words and the expectancy for them more generally, showing a larger waveform as an incoming word is unexpected or semantically incongruous with the previous context. In response to ambiguities in linguistic syntax that violate the ongoing context, the P600 is another effect

that has also been observed [28]. The P600 is a positive waveform centered around the parietal channels and has been observed during garden path sentences, which are syntactically ambiguous sentences when a newly presented word or words require a reinterpretation of the preceding context [29]. Patel et al. (1998) tested the language-specificity of the P600 by presenting chord progressions in music and garden path sentences in language in separate experiments. Their results showed statistically indistinguishable positive waveforms in the P600 range; in addition, they observed an ERAN-like waveform specifically for music [30]

The P600 is similar in topography and latency to the P3, a complex of positive-going event-related potentials elicited from 300 ms and onwards following an unexpected and task-relevant event. The P3 is separable into two components: P3a, a fronto-central ERP, largest around FCz that reflects novelty processing, and P3b, a later parietally-centered ERP largest around Pz that is more sensitive to motivation and task demands [31]. Patients with frontal lobe lesions show altered habituation of the P3, suggesting that the amplitude of the P3 is subject to frontally-mediated processes [32]. The P3a and P3b have both been observed during top-down attention to syntactically incongruous events in music, and these waveforms are sensitive to different levels and genres of expertise [11,33]. Taken together, the literature suggests two main classes of ERPs during unexpectedness in music and language processing: one class within a relatively early time window of approximately 200 ms (ERAN) and another class during the later time window of 500–600 ms (P3, P600, N4). The earlier class of waveforms are thought to be partially automatic, that is, they are elicited even without top-down attention but their amplitude is modulated by attention. The later class of waveforms is highly sensitive to top-down demands including attention.

In this study we compare ERPs elicited by violations in musical and linguistic syntax, while attention was directed separately towards language or music. We used the stimulus materials from Slevc et al's (2009), but extended this study by adding a musical analog of the language comprehension task. Thus, across two experiments we were able to compare behavioral results during task-manipulated attention to language and music, while independently manipulating syntax in each domain at a finer timescale in order to test for effects in ERPs that are known markers of syntax processing and attention.

#### **2. Materials and Methods**

#### *2.1. Subjects*

Thirty-five undergraduate students from Wesleyan University participated in return for course credit. All participants reported normal hearing. Informed consent was obtained from all subjects as approved by the Ethics Board of Psychology at Wesleyan University. Sixteen students (11 males and 5 females, mean age = 19.63, SD = 2.03) were assigned to the Attend-language group: 15/16 participants in this group reported English as their first language, and 9/16 participants reported prior music training (total mean of training in years = 2.23, SD = 3.42). Nineteen students (8 males and 11 females, mean age = 19.40, SD = 2.03) were assigned to the Attend-music group. Background survey and baseline tests of one participant in this group was missing as the result of a technical error. Of the remaining 18 participants, 12/18 reported English as their first language, and 11/18 participants reported having prior music training (total mean of training in years = 3.11, SD = 4.01).

The two groups of subjects did not differ in terms of general intellectual ability, as measured by the Shipley Institute of Living scale for measuring intellectual impairment and deterioration [34]. Nor did they differ in low-level pitch discrimination abilities as assessed by a pitch-discrimination task [35]. (Two-up-one-down staircase procedure around the center frequency of 500 Hz showed similar thresholds between the two groups.) They also did not differ in musical ability as assessed using the Montreal Battery for Evaluation of Amusia [36], or in duration of musical training (years of musical training was not different between the two groups, X<sup>2</sup> = 0.0215, *p* = 0.88). Table 1 shows the demographics and baseline test performance of the participants in both conditions.


**Table 1.** Demographics and baseline test performance of the participants. Data are shown as mean (SD), range, or proportion. SD: Standard Deviation. *n*: Count in proportion.

#### *2.2. Stimuli*

The stimuli were adapted from Slevc, Rosenberg [2]. There were 144 trials in the study, including 48 congruent trials, 48 musically-incongruent trials, and 48 language-incongruent trials. In each trial, an English sentence was presented in segments simultaneously with a musical chord progression. Each segment of a sentence was paired with a chord that followed the rules of Western tonal harmony in the key of C major, played in a grand piano timbre. Linguistic syntax expectancy was manipulated through syntactic garden-path sentences, whereas musical syntax expectancy was manipulated through chords that were either in-key or out-of-key at the highlighted critical region (Figure 1). The chords and sentence segments were presented at the regular inter-onset interval of 1200 ms. At the end of each sentence and chord progression, a yes/no comprehension question was presented on the screen: In the Attend-Language group, this question was about the content of the sentence to direct participants' attention to language (e.g., "Did the attorney think that the defendant was guilty?"). For the Attend-Music group, the question at the end of the trial asked about the content of the music (e.g., "Did the music sound good?") to direct participants' attention to the music. Participants were randomly assigned to Attend-Language and Attend-Music groups. Participants' task, in both Attend-Language and Attend-Music groups, was always to respond to the question at the end of each trial by choosing "Yes" or "No" on the screen.

**Figure 1.** Example trials for Attend-language and Attend-music conditions.

#### *2.3. Procedure*

Participants first gave informed consent and filled out a background survey on their musical training, as well as a battery of behavioral tasks including the Shipley Institute of Living Scale to screen for impairments in intellectual functioning [34], the Montreal Battery for Evaluation of Amusia (MBEA) to screen for impairments in musical functioning [36], and a pitch discrimination test as a three-up-one-down psychophysical staircase procedure around the center frequency of 500 Hz to assess pitch discrimination accuracy [35]. The experiment was run on a Macbook Pro laptop computer using Max/MSP [37]. At the start of the experiment, participants were told to pay attention to every trial, and to answer a yes-or-no comprehension question about the language (Attend-Language condition) or about the music (Attend-Music condition) at the end of each trial. They were given a short practice run of 5 trials of the experiment in order to familiarize themselves with the task before EEG recording began. EEG was recorded using PyCorder software from a 64-channel BrainVision actiCHamp setup with electrodes corresponding to the international 10–20 EEG system. Impedance was kept below 10 kOhms. The recording was continuous with a raw sampling rate of 1000 Hz. EEG recording took place in a sound attenuated, electrically shielded chamber.

#### *2.4. Behavioral Data Analysis*

Behavioral data from Max/MSP were imported to Excel to compute the accuracy of each participant. Accuracy was evaluated against 50% chance-level in one-sample two-tailed t-tests in SPSS. For the Attend-Music condition, two subjects' behavioral data were lost due to technical error.

#### *2.5. EEG Preprocessing*

BrainVision Analyzer software (Brain Product Gmbh) 2.1 was used to preprocess raw data. EEG data were first re-referenced to TP9 and TP10 mastoid electrodes, and filtered with high-pass cutoff of 0.5 Hz, low-pass cutoff of 30 Hz, roll-off of 24 dB/oct, and a notch filter of 60 Hz. These filter settings were chosen based on previous work that looked at target ERPs similar to the current study [33,38], since filter settings introduce artifacts in ERP data [39]. Ocular correction ICA was applied to remove eye artifacts for each subject. Raw data inspection was done semi-automatically by first setting maximal allowed voltage step as 200 μV/ms, maximal difference of values over a 200 ms interval as 400 μV, and maximal absolute amplitude as 400 μV. Then, manual data inspection was performed to remove segments with noise due to physical movements.

#### *2.6. Event-Related Potential Analysis*

The preprocessed data were segmented into four conditions: music congruent, music incongruent, language congruent, and language incongruent. Each segment was 1200 ms long, spanning from a 200 ms baseline before the onset of the stimulus to 1000 ms after stimulus onset. The segments were averaged across trials, baseline-corrected, and grand-averaged across the subjects. To identify effects specific to syntax violations in each modality, a difference wave was created for each violation condition by subtracting ERPs for congruent conditions from ERPs for incongruent conditions, resulting in a Music-specific difference wave (Music violation minus no violation) and a Language-specific difference wave (Language violation minus no violation). From these difference waves we isolated ERP amplitudes at two recording sites, one at each time window of interest: E(L/R)AN from site FCz at 180–280 ms, and the N4 and P3 at site Pz at 500–600 ms. The mean amplitude of each ERP was exported for each participant from BrainVision Analyzer into SPSS for analysis.

Because both groups of participants experienced both types of syntactic violations (music and language), but each group of participants attended to only one modality (music or language), we used a mixed-effects analysis of variance (ANOVA) with the within-subjects factor of Violation (two levels: music and language) and the between-subjects factor of Attention (two levels: attend-music and

attend-language). This was separately tested for the two time windows: 1) the early ERAN/ELAN time window of 180–280 ms, and 2) the later N4/P3 time window of 500–600 ms.

#### **3. Results**

#### *3.1. Behavioral Results*

Participants performed well above the 50% chance level on language comprehension questions during the Attend-Language condition (M = 0.8457, SD = 0.0703, two-tailed t-test against chance level of 50% correct: *t*(15) = 19.661, *p* < 0.001), and on music comprehension questions during the Attend-Music condition (M = 0.6631, SD = 0.1253, two-tailed t-test against chance level of 50% correct: *t*(16) = 5.371, *p* < 0.001). This confirms that participants successfully attended to both language and music stimuli. Participants performed better on the Attend-Language than on the Attend-Music questions (*t*(31) = 5.12, *p* < 0.001).

#### *3.2. Event-Related Potentials*

Figure 2 shows each ERP and scalp topographies of difference waves. Figure 3 shows the specific effects for each ERP; statistics are shown in Table 2. A right anterior negative waveform was observed 180–210 ms following music violations, consistent with the ERAN. This ERAN was observed during both Attend-Language and Attend-Music conditions. During the Attend-Language condition, a centroparietal negative waveform was observed 500–600 ms following language violations, consistent with an N400 effect. This N400 effect was not observed during the Attend-Music condition. Instead, a posterior positive waveform was observed 500–600 ms after music violations during the Attend-Music condition, consistent with the Late Positive Complex or the P3 or P600 effect. These selective effects are tested in a mixed-model ANOVA for each time-point, with a between-subjects factor of attention (two levels: attend-music vs. attend-language) and a within-subjects factor of modality of syntax violation (two levels: music and language), as described below.

**Figure 2.** Overlays of ERPs from each condition with topographic maps of the difference wave between violation and no-violation conditions. Music syntax violation condition is shown in red and linguistic syntax violation condition is shown in blue. Black represents a condition when neither stimulus was violated. Topographic plots show difference waves between music violation and no-violation, or between language violation and no-violation. (**A**) When attending to language. (**B**) When attending to music.

**Figure 3.** ERP effects of violation (amplitude of difference waves) across different conditions.


**Table 2.** ERP statistics.

*180–280 ms:* A significant negative waveform was observed for the music violation but not for the language violation. The within-subjects effect of Violation showed a significant difference between music and language violations (F(1,33) = 33.198, *p* < 0.001). The between-subjects effect of Attention was not significant (F(1,33) = 1.381, *p* = 0.248). There was no significant interaction between the Violation and Attention factors (Figure 2). Tests of between-subjects effects showed no significant difference between the Attend-Music and the Attend-Language conditions (Figure 3).

*500–600 ms.* For the late time window, the within-subjects effect of Violation was significant (F(1,33) = 31.317, *p* < 0.001), and the between-subjects effect of Attention was significant (F(1,33) = 9.763, *p* = 0.004). Here, an Attention by Violation interaction was also significant (F(1,33) = 9.951, *p* = 0.003). This interaction is visible in the ERP traces and topographic plots in Figure 2 as well as in the amplitude results plotted in Figure 3: in the Attend-Language condition, only language violations elicited a negative waveform resembling an N400, whereas music violations were no different from the no-violation condition. The N400 shows a latency of 400–800 ms and a centro-parietal topography (Figure 2A), consistent with classic reports of the N400 effect (Kutas and Hillyard, 1984). In contrast, during the Attend-Music condition, only music violations elicited a large positive P3 waveform, whereas language violations showed no difference from the no-violation condition. The P3 shows a latency of 400–1000 ms and a centro-parietal topography (Figure 2B), consistent with the P3b subcomponent of the P3 complex (Polich, 2007). The P3 was only observed for music violations when attention was directed to music, and the N400 was only observed for language violations when attention was directed to language. This attention-dependent double dissociation between P3 and N400 is visible in Figures 2 and 3.

While musical violations elicited an ERAN in the early time window and a P3 in the late time window, language violations only showed an N400 in the late time window, and no effect in the early time window. One potential explanation is that a minority of participants were not first-language English speakers; these participants may have been less sensitive to syntax violations as manipulated by the garden path sentences. Removing the subjects whose first language was not English resulted in a smaller sample size of all first-language English speakers: *n* = 15 in the Attend-Language condition, and *n* = 12 in the Attend-Music condition. Repeating the above behavioral and ERP analyses on these smaller samples showed the same pattern of results. Behavioral results showed significantly above-chance performance on both Attend-language condition (*t*(14) = 28.66, *p* < 0.001) and Attend-music conditions (*t*(11) = 4.93, *p* = 0.002). ERP statistics for first-language English speakers are shown in Table 3.


**Table 3.** ERP statistics for first-language English speakers only.

#### **4. Discussion**

By separately manipulating linguistic syntax, musical syntax, and attention via task demands during simultaneous music and language processing, we were able to disentangle the effects of top-down attention on bottom-up processing of syntax and syntactic violations. Three main findings come from the current results: 1) For both music and language, syntactic violation processing activates a cascade of neural events, indexed by early and late ERP components as seen using time-sensitive methods. This replicates prior work (Koelsch et al., 2000 [11,12], Hahne and Friederici, 1999 [13,14], and many others). 2) Early components are less sensitive to attentional manipulation than late components, also replicating prior work [40,41]. 3) Attention affects musical and linguistic syntax processing differently at late time windows. This finding is novel as it extends previous work that identify early and late components in music and language syntax processing, by showing that the late components are most affected by attention, whereas the earlier stages of processing are less so. Taken together, results expand on the SSIRH by showing that top-down manipulations of attention differently affect the bottom-up processing of music and language, with effects of attention becoming more prominent throughout the temporal cascade of neural events that is engaged during music and language processing. We posit that the early stages of processing includes mismatch detection between the perceived and the expected events, with the expectation being core to syntactical knowledge in both language and music. In contrast, the late attention-dependent processes may include cognitive reanalysis, integration, and/or updating processes, which may require general attentional resources but are not specific to linguistic or musical syntax.

In some respects, the present results add to a modern revision of the classic debate on earlyvs. late-selection theories of attention. While early-selection theories (Broadbent, 1958) posited that attention functions as a perceptual filter to select for task-relevant features in the stimulus stream, late-selection theories have provided evidence for relatively intact feature processing until semantic processing [42,43] or until feature integration [44]. Due to their fine temporal resolution, ERP studies provide an ideal window into this debate, allowing researchers to quantify the temporal cascade of neural events that subserve perceptual-cognitive events such as pitch and phoneme perception, and syntax and semantics processing. ERP results from dual-task paradigms such as dichotic listening have

shown that attention modulates a broad array of neural processes from early sensory events [45,46] to late cognitive events [47,48]. Here we observe the ERAN in response to musical syntax violations regardless of whether attention was directed to language or to music. The ERAN was elicited for music violations even when in the attend-language condition; furthermore its amplitude was not significantly larger during the attend-music condition. This result differs from previous work showing that the ERAN is larger during attended than during unattended conditions [24]. The difference likely stems from the fact that while in the previous study the visual task and the musical task were temporally uncorrelated, in the present study the language stimuli (sentence segments) and musical stimuli (chords) were simultaneously presented, with each language-music pair appearing in a time-locked fashion. Thus, when in the attend-language condition, the onset of musical chords became predictably coupled with the onset of task-relevant stimuli (sentence segments), even though the musical chords themselves were not task-relevant. This predictable coupling of task-irrelevant musical onsets with task-relevant linguistic stimulus onsets meant that it became more advantageous for subjects to allocate some bottom-up attentional resources to the music, or to allocate attentional resources to all incoming sensory stimuli at precisely those moments in time when stimuli were expected [49], as one modality could help predict the other. The fact that the ERAN was observed even when only slightly attended provides some support for a partially automatic processing of musical syntax, as posited in previous work [24]. When musical syntax violations were not task-relevant but were temporally correlated with task-relevant stimuli, they elicited intact early anterior negativity but no late differences from no-violation conditions. This early-intact and late-attenuated pattern of ERP results is also consistent with the relative attenuation model of attention, which posits that unselected stimulus features are processed with decreasing intensity [50].

One remaining question concerns whether the ERAN is driven by music-syntax violations, or whether the effects may be due to sensory violations alone. Indeed, musical syntax violations often co-occur with low-level sensory violations, such as changes in roughness or sensory dissonance. In that regard, the musical syntax violations used in the present study are carefully constructed to avoid sensory dissonance and roughness (see supplementary materials of Slevc et al., 2009 for a list of chord stimuli used). Thus the effects cannot be explained by sensory violations. Furthermore, Koelsch et al. (2007) had shown that ERAN is elicited even when irregular chords are not detectable based on sensory violations, which supports the role of ERAN in music-syntax violations. Given our stimuli as well as previous evidence, we believe that the currently observed ERAN reflects music-syntax violations rather than sensory violations. In contrast, no ELAN was observed in response to language violations. This may be because we used garden path stimuli for language violations, while previous studies that elicited early negative-going ERPs used word category violations [14] and phrase structure violations [16] rather than garden path sentences. The introduction of the linguistic garden path requires that participants re-parse the syntactic tree structure during the critical region of the trial; this effort to re-parse the tree likely elicited the N4 at the later time window, but lacks the more perceptual aspect of the violation that likely elicited the ELAN in prior studies (Hahne et al., 1999). Thus, the garden-path sentences and music-syntactic violations used in the present study may have tapped into distinct sub-processes of syntax processing.

It is remarkable that linguistic syntax violations only elicited a significant N400 effect, and no significant effects over any other time windows, even when language was attended. In contrast, musical syntax violations elicited the ERAN as well as the P3 in the attended condition, with the ERAN being observed even when musical syntax was unattended. Note that the P3 effect in this experiment is similar in topography and latency to the P600, which has been observed for semantic processing during garden path sentences. It could also be the Central-Parietal Positivity (CPP), which reflects accumulating evidence for perceptual decisions [51], which can resemble the P3 [52]. During the attend-music condition, linguistic syntax violations elicited no significant ERP components compared to no-violation conditions. This suggests a strong effect of attention on language processing. It is also worth noting that we saw a clear N400 and not a P600 or a P3 in response to garden path sentences in

language. The relationship between experimental conditions and N400 vs. P600 or P3 is an ongoing debate in neurolinguistics: Kuperberg (2007) posits that the N400 reflects semantic memory-based mechanisms whereas the P600 reflects prolonged processing of the combinatorial mechanism involved in resolving ambiguities [28]. Others argue that whether an N400 or a P600 is observed may in fact depend on the same latent component structure; in other words, the presence and absence of N400 and P600 may reflect two sides of the same cognitive continuum, rather than two different processes per se [53–55]. If the N400 and P600 are indeed two sides of the same coin, then this could mean that language and music processing are also more related than the different effects would otherwise suggest.

#### **5. Limitations**

One caveat is that, similar to the original paradigm from which we borrow in this study [2], music was always presented auditorily, whereas language was always presented visually. Thus, the differences we observe between musical and linguistic syntax violation processing could also be due to differences in the modality of presentation. In future studies it may be possible to reverse the modality of presentation, such as by visually presenting musical notation or images of hand positions on a piano [56] with spoken sentence segments. Although doing so would require a more musically trained subject pool who can read musical notation or understand the images of hand positions, prior ERP studies suggest that visually presented musical-syntactic anomalies would still elicit ERP effects of musical syntax violation, albeit with different topography and latency [56]. Furthermore, although participants performed above chance on both attend-language and attend-music comprehension questions, they did perform better on the attend-language task; this imposes a behavioral confound that may affect these results. Future testing on expert musicians may address this behavioral confound. Future studies may also work to increase the sample size, and to validate and match the samples with sensitive baseline measures in both behavioral and EEG testing in order to minimize confounding factors arising from potential differences between participant groups. Importantly, garden path sentences are only one type of syntactic violation; it remains to be seen how other types of violations in linguistic syntax, such as word category violations, may affect the results. Finally, it is yet unclear how syntax and semantics could be independently manipulated in music, or indeed the degree to which syntax and semantics are fully independent, in music as well as in language [57]. In fact, changing musical syntax most likely affects the meaning participants derive from the music; however, specifically composed pieces with target words in mind might be a way to get at a musical semantics task without overtly manipulating syntax [58]. Nevertheless, by separately manipulating music and language during their simultaneous processing, and crossing these manipulations experimentally with top-down manipulations of attention via task demands, we observe a progressive influence of attention on the temporal cascade of neural events for the processing of music and language.

**Author Contributions:** Conceptualization, H.J. and P.L.; Data curation, D.J.L., H.J. and P.L.; Formal analysis, D.J.L. and P.L.; Funding acquisition P.L.; Investigation, P.L.; Methodology, H.J. and P.L.; Project administration, D.J.L. and H.J.; Resources, P.L.; Software, D.J.L. and P.L.; Visualization, D.J.L. and P.L.; Writing – original draft, D.J.L.; Writing – review & editing, P.L.

**Funding:** This research was funded by grants from NSF-STTR #1720698, Imagination Institute RFP-15-15, and Grammy Foundation to P.L., D.J.L. and H.J. acknowledge support from Ronald E. McNair Post-Baccalaureate Achievement Program.

**Acknowledgments:** We thank C.J. Mathew and Emily Przysinda for assistance with data collection, and all the participants of this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

1. Patel, A.D. Language, music, syntax and the brain. *Nat. Neurosci.* **2003**, *6*, 674–681. [CrossRef]


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Early Influence of Musical Abilities and Working Memory on Speech Imitation Abilities: Study with Pre-School Children**

#### **Markus Christiner 1,\*,† and Susanne Maria Reiterer <sup>2</sup>**


Received: 25 May 2018; Accepted: 29 August 2018; Published: 1 September 2018

**Abstract:** Musical aptitude and language talent are highly intertwined when it comes to phonetic language ability. Research on pre-school children's musical abilities and foreign language abilities are rare but give further insights into the relationship between language and musical aptitude. We tested pre-school children's abilities to imitate unknown languages, to remember strings of digits, to sing, to discriminate musical statements and their intrinsic (spontaneous) singing behavior ("singing-lovers versus singing nerds"). The findings revealed that having an ear for music is linked to phonetic language abilities. The results of this investigation show that a working memory capacity and phonetic aptitude are linked to high musical perception and production ability already at around the age of 5. This suggests that music and (foreign) language learning capacity may be linked from childhood on. Furthermore, the findings put emphasis on the possibility that early developed abilities may be responsible for individual differences in both linguistic and musical performances.

**Keywords:** phonetic language aptitude; intrinsic singing; singing ability; musical aptitude; working memory

#### **1. Introduction**

Musical abilities and the link to language functions have gained considerable scientific interest in the past decade. Music and language are highly intertwined, but despite their similarity remarkably different in many respects. Music, song, and language are all to a large degree acoustic and sensory-motor phenomena, perceived and executed similarly, which might be one of the reasons why investigations have started to compare the three faculties intensively [1–4]. In language research, understanding positive transfer effects from music to language, which might be induced by musical input/training or may stem from enhanced musical abilities/aptitude, has been of remarkable interest [5–10]. Interdisciplinary research comparing and trying to account for the differences and commonalities between music, song and language functions ranges from brain to behavioral and evolutionary to ethological research [2,11–14]. Comparing musical abilities with language functions often focuses on testing foreign language learning rather than on first language acquisition [15–17]. This allows us to observe individual differences more effectively especially when it comes to the link between music and foreign language learning by analyzing the acoustic levels of speech, such as phonetics and pronunciation. New language material, which is unfamiliar to and imitated by the participants informs about individual differences in pronunciation performances illustrating how fast and accurately individuals can adapt to new languages which they have not been exposed to [18–20]. Foreign language learning capacity is also influenced by musical training and musicians seem to detect speech incongruities much faster and more accurately than non-musicians [21]. Furthermore, musical training partly influences novel speech processing and learning [7–9,22].

Even though language experts have increasingly provided more evidence for individual differences among native speakers' language proficiency [23,24], inter-learner variation in phonetic abilities is more difficult to observe within the mother tongue compared to other domains like grammar or vocabulary knowledge. In this research we use the term "phonetic language abilities", "phonetic aptitude" or "speech imitation ability", interchangeably and what we mean is the capacity to imitate, mimic and pronounce spoken speech based on holistic judgments of human native speaker raters, judging imitated prosody as well as phonetic (segmental) aspects [18]. Referring to aptitude as a more stable "trait" demands developing tasks which are untrained or testing requires minimizing educational/training/experiential influence. This is achieved best by choosing participants who lack experience in what they are tested for. In language aptitude research, pre-school children tested in foreign language capacity and musical abilities are ideal participants because they fulfill both above mentioned criteria. Interdisciplinary research on language aptitude and musical abilities has mostly focused on adults and pupils [3,4,15,16]. Research on pre-school children has largely been neglected so far, despite the fact, that the latter is informative in terms of (phonetic and musical) aptitude, since younger children are still less influenced by education and training (environmental/social influences). Education might be one of the important driving forces supporting children's progress in cognitive abilities, linguistic development and musical abilities which in turn are related to the social environment in which children grow up. The input children receive correlates to some degree with the output produced [25,26] suggesting that the less formal educational input children receive, the more other factors than training might impact on their performance in foreign languages and music. Even though individual differences will also depend on the input given by caretakers and parents, pre-school children may be most naïve in terms of educational influence compared to older participants and rely on their aptitude while solving problems or learning new skills. In psychology aptitude has been described as a raw material allowing individuals acquire new abilities or adapting behavior faster and more accurately than their peers [27,28]. Aptitude is often considered to be a domain-specific skill and individuals with particular aptitudes show genuine, exceptional and outstanding abilities compared to the general population [29]. This suggests that people who demonstrate certain aptitudes might have at least the potential for outperforming their less talented peers. Research on aptitude is diverse ranging from giftedness with sports, playing chess, composition or writing, to name but a few. Language aptitude has been studied behaviorally [3,4,15,16,28] and less intensively neuro-physiologically, but recent research accumulates knowledge about individual differences in brain structure and function on different levels of linguistic expertise for phonetics and grammar learning [30–34]. More general studies on the bilingual brain investigating polyglots, multi-linguals and bilinguals have revealed evidence of individual differences in language learning and brain processing based on working memory capacity, intelligence, different musical abilities, language exposure and age of acquisition [3,4,18–20,35,36]. Similar to language aptitude studies well-known cases of exceptional musicians such as Mozart or Bach do not leave any doubt about the role of aptitude in individual differences in abilities. First evidence that aptitude may also be gene-related and contributes to individual differences in either language or musical abilities has been provided recently. Genetic differences in the auditory pathway have been found to be responsible for differences in music perception [37]. Another longitudinal study has detected that the basic forms and shapes of Heschl's gyri, which are seen as markers for high musical aptitude in brain research, do not change over time [5,38] and differences in brain structure of musicians have been reported in multiple investigations [39–41]. Musicians' auditory stimulation evokes enhanced activity in a number of originally non-auditory regions in musicians' brains, such as the sensorimotor, the parietal, the dorsolateral prefrontal cortex, as well as pre-motorand supplementary motor areas [42–48]. Furthermore, musical training induces plastic changes and influences the complexity in white matter

architecture of the cortico-spinal tract [49] and the arcuate fasciculus [50]. Generally speaking, individual differences in musical abilities are said to be based on both nature and nurture related influence. The earlier infants find themselves in a music-rich environment, the better their musical abilities may develop [51]. Regarding nurture related effects it has been reported that musical training during childhood has a significant effect on motor and auditory skills and may lead to structural brain differences in a relatively short period of time [52]. Music training can explain structural brain differences in adult musicians [52] but also directly improves speech segmentation [7], duration perception [6] and pitch perception ability in children [9]. This suggests that both linguistic and musical skills are based on shared neural mechanisms [2,17]. In this interdisciplinary context, there is a rarely mentioned analogy to language acquisition processes. Infants can undoubtedly learn virtually all languages and, for instance, growing up bilingually leads to similar language skills in two languages which they learn without difficulty as a matter of environmental contact [7]. It therefore is an accepted notion that the earlier someone is exposed to a language, the better the achievement [53,54]. Exactly the same seems to be the case for music acquisition processes, even though most investigations that compared music and language focused on different analyses and did not directly compare the acquisition processes per se.

Music and speech are recognized as separate capacities but are perceived by the same auditory system requiring similar cognitive skills [1,2,5]. Achievement in musical abilities improves working memory (WM) capacity, which is again important for multiple cognitive abilities (being neither language- nor music-specific), found to be trainable with transfer effects to general intelligence, executive control and problem solving [55]. WM capacity and its link to musical aptitude has been observed among adults and children [56] and suggestions that verbal and tonal processing and execution show large overlaps (plus subtle differences as a function of musical expertise) have already been shown in a series of neurocognitive investigations e.g., [57–60]. In addition, research on WM training programs has noted remarkable improvements after training sessions with children suffering from attention deficits, not only in what was trained, but also in new unrelated tasks [61]. WM ability is age-related showing that 3 digits of strings of numbers in a forward order are recalled at around five years of age [62]. Following language aptitude research, it has often been argued that WM has some potential to replace the idea of aptitude and indeed many investigations have been able to detect that WM capacity is related to processing, retaining, and repeating unfamiliar language material [18–20,63], placing WM amongst one of the strongest predictors of linguistic success. As already mentioned, WM is age-related, following developmental steps from simpler to more complex. Likewise, this could be similar when learning a new language. In previous research on adults it has been illustrated that 9 to 11 syllable long language material allows us to observe individual differences [18,19]. For pre-school children it can be suggested that 5 to 6 syllable long unfamiliar language material might be appropriate to test their phonetic aptitude.

Apart from WM overlaps of music and speech, song represents a transitional or hybrid faculty, which comprises both linguistic and musical features. Studies focusing on singing capacity and language functions are still underrepresented in recent literature. Some investigations focused on comparing language learning and singing ability [18] and singing as a learning tool [4,64]. Research has also demonstrated the effects of vocal long-term training [50] and found an improved connectivity between the kinesthetic and auditory feedback system and the anterior insular cortex [65], which also contributes to voice motor/somatosensory control and expertise in singers [66]. Furthermore, structural adaptations in singers lead to changes in the complexity and volume of white matter tracts [50].

Comparisons between speech and song [67] concluded that vocalization of speech and song largely shares the same neural network and bilateral activation in the superior temporal sulcus, the inferior pre- and post-central gyrus and the superior temporal gyrus. Speaking and singing draw on common grounds, as body posture, emission, resonance or articulation are based on the same principles [68]. Singing compared to speech is slower in production and trains the motor ability and

the vocal apparatus. This is one fundamental reason why singing (intoned word production) is often used for therapeutic purposes to regain motor ability to vocalize and indeed children's language progress develops alongside motor control [69,70].

This investigation focuses on aptitude for acquiring phonetic patterns of unfamiliar languages and its relationship to musical abilities. We sought to uncover the link between phonetic aptitude, singing and musical abilities in pre-school children to better understand music and language acquisition processes from a developmental perspective. We hypothesized that if pre-school children already performed differently in music, singing and phonetic language tasks, like adults do when tested behaviorally, it would be evidence that language or speech imitation aptitude is either developed very early or at least a very stable trait. This could open new discussions and accumulate evidence of the distribution of language aptitude within the general population and thus suggest how aptitude and individual differences can be detected, used and integrated in learning settings to support language acquisition processes. Understanding learners' needs and individual differences in aptitude may eventually change educational programs which could improve language learning as well as positively affect other related cognitive abilities. 35 pre-school children were tested for their musical abilities (music perception and singing), their ability to imitate Turkish, Tagalog, Russian, and Chinese (phonetic aptitude, speech imitation), their WM (digit span) and social- environmental variables, such as the influence of caretakers and caretakers' musical activities. This was done with a view to ascertain whether foreign language and musical abilities of pre-school children were comparable to what had been found in adults.

#### **2. Methods**

#### *2.1. Participants*

In this investigation we selected 35 (16 female and 19 male) pre-school children at the age of 5 to 6 (mean age = 5.66 and *SD* = 0.48). All of them visited a private kindergarten, were German monolinguals and naïve in formal language and musical training, apart from counting numbers in English and simple singing activities like Happy Birthday. None of them grew up bilingually or participated in a specific language program. A questionnaire revealed that neither the participants nor the parents had had contact to the language material (stimuli) which was tested. The parents belonged to a higher socio-economic background and gave informed consent and agreed to the participation of their children in this investigation. The children were also orally asked whether they liked to participate in the study and all happily agreed because they had already been familiarized with the experimenter on several music teaching occasions. They were instructed to stop at any time, if they felt uncomfortable or if they wished to withdraw their consent. The testing frame took place within two weeks and all tests were performed separately at different times within this time window. The experimenter was well integrated into the kindergarten and started work half a year before their testing to make sure that the children knew him well. For analyzing the background information of the children and the parents, a questionnaire had been designed which focused on musical profiling, singing behavior and language contact.

#### *2.2. Speech Imitation*

For testing the children's ability to remember and repeat unfamiliar language material we selected threephrases for each of the 4 different languages which had been taken (Turkish, Tagalog, Russian, and Chinese). The language material was fiveand sixsyllables long. The original phrases had been spoken by native speakers and recorded in a sound proof room. The children were tested in a separate room in the kindergarten where they had to listen to the phrases threetimes before they repeated the language stimuli, which were recorded and rated by sixnative speakers for Tagalog and Turkish, by fourraters for Russian and sevenraters for Chinese. Ratings were performed on a scale between 0 and 10, where 10 was the highest and 0 the lowest score. Native speakers are said to make judgments comparable to those of phonetic experts [53,54] and multiple investigations used the same methodology to analyze individual differences in phonetic abilities [18–20,63]. We instructed the raters to immediately rate the overall performance (spontaneous global judgment) as well as to use headphones while rating the files.

The analysis of the participants' singing ability was based on two criteria. First of all, one was how well the children sang according to fourmusic teachers who regularly visited the kindergarten and analyzed the children's singing ability. They had to rate their singing ability on a scale between 0 and 10, where 0 was the lowest and 10 the highest number. This measurement focused on accuracy, intonation, timing and how well the children sang. For the second criterion the same scale was in use for the ratings. The kindergarten teachers were instructed to observe how intuitively the children started singing without having been instructed to do so (intrinsic motivation) over the period of 14 days. This aimed at isolating the children's inner needs and intrinsic motivation to sing which should reveal whether those who sing without being instructed may perform differently in language and musicality tasks compared to those who show less motivation to sing. Additionally, the parents were asked to estimate how many hours their children were singing during the week as well as to indicate how many hours they were singing with their children and playing a musical instrument.

#### *2.3. Music Perception*

The music perception abilities of the children were tested by employing the (Primary Measures of Music Audiation) PMMA [71], a test still widely used in research to measure musicality. This test measures the ability to discriminate tonal and rhythmical changes of paired musical statements and has been designed for children from kindergarten to third grade. The test is subdivided into two sections. While the first one analyzes children's ability to detect tonal changes, the second one focuses on their ability to discriminate rhythmical changes. Even though this test is widely used for measuring musical aptitude, there are some limitations regarding the validity of the test. For instance, studies reported inconsistent results for the two subtests which show deviations from the published norms [72,73]. Another investigation noted that especially the internal reliability of the rhythm subtest should be treated with caution for grade 1 students and kindergarten children [73].

#### *2.4. Working Memory*

For testing the working memory abilities of the pre-school children we used strings of numbers/digit span [74]. The numbers were recorded and the children had to listen to the numbers and repeat them in the same chronological order. As a familiarization task, two numbers were given in a string for testing whether the children understood their task. The strings of numbers increased in length and the test stopped after the children could not accurately repeat the strings of numbers a second time.

#### *2.5. Ethical Approval*

All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of the University Hospital and the Faculty of Medicine Tübingen, Project identification code 529/2009BO2.

#### **3. Results**

#### *3.1. Descriptives and Correlations*

For illustration of the relationships between the individual variables, tables are shown in the following sections. Table 1 contains the descriptive of the variables under consideration. The units are the actual scores of the variables measured. Table 2 shows the correlations of the variables.



false discovery rate *p* ≤ 0.05. PMMA: Primary Measures of Music Audiation.

#### *3.2. Musicality Test PMMA*

The PMMA total score was significantly correlated with the working memory test (strings of numbers which were repeated in forward order), *rs* = 0.53, *p* (two-tailed) < 0.01, and the singing parameters singing ability, *rs* = 0.44, *p* (two-tailed) < 0.01. The PMMA total score was also significantly correlated with all language imitation tasks, Tagalog (*rs* = 0.45, *p* (two-tailed) < 0.01), Chinese (*rs* = 0.40, *p* (two-tailed) < 0.05), Turkish (*rs* = 0.39, *p* (two-tailed) < 0.05), Russian (*rs* = 0.37, *p* (two-tailed) < 0.05) and with the overall speech imitation ability which comprisesof all language imitation tasks (*rs* = 0.44, *p* (two-tailed) < 0.01).

#### *3.3. Speech Imitation (ComprisesTagalog, Chinese, Turkish and Russian)*

Speech imitation showed a significant correlation with the PMMA total score, *rs* = 0.44, *p* (two-tailed) < 0.01, with the tonal subtest, *rs* = 0.38, *p* (two-tailed) < 0.05. Speech imitation also was significantly related to how accurately the children were repeating strings of numbers in forward order (working memory), *rs* = 0.56, *p* (two-tailed) < 0.01 and the singing parameter "singing behavior" which revealed how intuitively the children sang without being instructed to sing, *rs* = 0.39, *p* (two-tailed) < 0.05.

#### *3.4. Working Memory*

The working memory ability correlated with the musicality test PMMA, the total score, *rs* = 0.53, *p* (two-tailed) < 0.01, the tonal subtest, *rs* = 0.58, *p* (two-tailed) < 0.01. Further correlations of the working memory capacity to all language imitation tasks, Tagalog (*rs* = 0.42, *p* (two-tailed) < 0.05), Chinese (*rs* = 0.41, *p* (two-tailed) < 0.05), Turkish (*rs* = 0.58, *p* (two-tailed) < 0.01), Russian (*rs* = 0.38, *p* (two-tailed) < 0.05) and with the overall speech imitation ability (*rs* = 0.56, *p* (two-tailed) < 0.01) were also observed.

#### *3.5. Singing Ability*

Singing ability, which measures how accurately the children sang, was correlated with the PMMA total score (*rs* = 0.44, *p* (two-tailed) < 0.01), the tonal PMMA subtest (*rs* = 0.39, *p* (two-tailed) < 0.05) and the rhythm PMMA subtest (*rs* = 0.34, *p* (two-tailed) < 0.05). Singing ability also showed a significant relationship to singing behavior (*rs* = 0.80, *p* (two-tailed) < 0.01) and singing ability also correlated with the language imitation task Tagalog (*rs* = 0.44, *p* (two-tailed) < 0.01).

#### *3.6. Singing Behavior*

Singing behavior which should reveal the intrinsic motivation of the children to sing, correlated with the rhythm PMMA subtest (*rs* = 0.36, *p* (two-tailed) < 0.05) and singing ability, *rs* = 0.80, *p* (two-tailed) < 0.01 and showed a significant relationship to Tagalog *rs* = 0.49, *p* (two-tailed) < 0.01, and to the global speech imitation ability, *rs* = 0.36, *p* (two-tailed) < 0.05 as well.

#### *3.7. Inter-Rater Reliability*

The inter-rater reliability Cronbach's alpha was applied and was 0.88 for Tagalog, 0.85 for Chinese, 0.94 for Turkish and 0.86 for Russian which are all in the acceptable range above 0.70.

#### *3.8. Whitney–Mann Test (Group Comparisons)*

We divided our group into high and low musical aptitude based on the music perception task PMMA (tonal and rhythmic discrimination ability). The high musical aptitude group (*Median* = 4) performed significantly better than the low musical aptitude group (*Median* = 3) in the working memory task, U = 51.00, z = −3.58, *p* < 0.001, r = −0.61. The high musical aptitude group (*Median* = 3.25) also performed significantly better than the low musical aptitude group (*Median* = 2.75) in speech imitation, U = 83.00, z = −2.28, *p* < 0.005, r = −0.39. The high musical aptitude group

(*Median* = 7.88) performed significantly better than the low musical aptitude group (*Median* = 4) in singing (singing ability), U = 91.50, z = −2.01, *p* < 0.005, r = −0.34. The significance of all three group comparisons was also given after Bonferroni–Holm-Correction. Significance was inferred at *p* < 0.05 after Bonferroni–Holm-Correction for the three variables working memory, singing ability and speech imitation. For illustration see Figure 1 below.

**Figure 1.** Working memory, singing ability and speech imitation was higher in children with high (compared to low) musical aptitude.

#### **4. Discussion**

The music perception task (PMMA) was used to create group membership of the children and consists of rhythm and tonal discrimination tasks. Even though this test is widely used for measuring musical aptitude, there are some limitations concerning the validity of the test. According to Stamouet et al. [73], the results of the rhythm subtest should be regarded with caution for pre-school children and grade 1 students as a result of cross-cultural issues which may also be relevant for this investigation as the children were German native speakers. However, in this study similarly to research on adults, effects of individual differences in music perception, working memory capacity, speech imitation and singing have been found. For statistical analysis we split the groups based on music perception abilities, measured by the PMMA and we created two groups of high and low musical aptitude. Working memory, speech imitation and singing ability have been found to be significantly different between the two musical groups created, while singing behavior has failed to reach statistical significance within the model, even though there was a tendency.

#### *4.1. Musical Expertise, Plasticity, Musical Abilities and Working Memory*

Several investigations have found that music perception abilities improve the ability to remember, imitate and retrieve unfamiliar language material [5,15]. As shown in this investigation, the same seems to hold true for pre-school children: The higher their music perception ability, the better their speech imitation ability to memorize and imitate new, unfamiliar language material. This puts emphasis on the importance of addressing the relationship between musical and linguistic abilities. The building of expertise in the areas of linguistic or musical abilities is, in essence, the recurrent problem of the

relationship between "nature and nurture", difficult to investigate experimentally, given a lack of testable definitions of the "talent-ability" terms. There is considerable debate as to whether differences in behavior are due either to "innate talent" or to the quantity and quality of practice in a given domain. Indeed, the viewpoint of interaction between genetically and epi-genetically driven abilities modified by experience is perhaps dependent on the skill domain, but in general widely accepted in the domains of language or music acquisition.

Musicians' neurophysiological, auditory enhancements and beneficial transfer effects on language functions are often related to the years of training [50,52,75–78], and ontogenetically speaking, discrimination of rhythm is an early developed mechanism that starts prenatally and continues during infancy [77]. The effect of musical training and the impact of culture on music acquisition is undeniable, even though individual differences in high achievement may be based on other aspects as well. In this study the children were naïve in terms of individual musical training which might suggest that other factors than proper music training, such as very early developmental or pre-, perior post-natal influencescontribute to individual differences in their musical abilities. Auditory models have already proposed that primary capacities influence musical aptitude, while secondary musical skills are environmentally shaped by the culture and individual training received [79]. First evidence that primary capacities for musical aptitude may also be gene-related, like processing of auditory signals, which alter the auditory pathway crucial for discriminating musical input, has already been reported [37]. Multigenerational family studies have also revealed that several predisposing genes or variants contribute to musical aptitude e.g., [80] and evidence for alterations of the brain structure of musicians, which improve music perception and performance, are diverse e.g., [5,20,40,49,51]. Inter-individual changes and structural differences in the auditory cortex cannot be ascribed to training effects only and particular brain areas, seen as markers for high musical abilities (e.g., Heschl's gyrus) seem to be rather stable in shape [5,38,40,41]. For instance, duplications of Heschl's gyrus occur more often in musicians rather than in non-musicians [38]. Anatomical alterations of the gray matter [33,34] or volume and complexity differences in white matter tracts of singers [50] have been reported. While increasingly more evidence manifests that musical aptitude could be gene-related or at least a stable trait over life-time, studies that identify biological markers, such as genetic or neuroanatomical markers for phonetic aptitude, have been largely neglected so far within the area of second language acquisition or language learning research.

The underlying reasons for the link between musical and phonetic aptitude are also based on shared cognitive functions as well as on shared mechanisms in the execution and processing of music and speech [5–10]. Music and language functions require the recruitment of similar cognitive processes, attention control, anatomical and neuroanatomical endowment [5]. WM capacity, an elaborate cognitive skill, crucial for multiple abilities, is related to phonetic aptitude and musical abilities. The ability to remember rapid and temporary information is important for the learning of new language material which is poor in linguistic content [81]. There is an analogy to remembering melodies or discriminating different musical pieces. Language learners in the beginning phase will benefitmost from higher WM capacity. The basic acoustic signals of musical sounds (pitch, timing and timbre) play a key role in both speech and music, especially in conveying information [82]. Pitch, the property of sounds that is organized by a scale, can be judged as lower or higher. Timing refers to temporal events of sounds and timbre to the perceived sound quality also referred to as tone quality [82]. Evidence has been provided that the improved processing of timbre and pitch in musicians is based on functional adaptations, but seems to play a general role in musical development from infancy onwards [83]. Furthermore, musicians' improvements in detecting pitch and timbre cues largely rely on plastic adaptations [49,82,84]. Language acquisition processes are also based on differentiating timbres of speech that are meaningful and/or meaningless [84] and for musicians it is also necessary to discriminate timbre differences between various instruments which in turn seem to have an effect on speech sound discrimination as well [82]. Evidence for an overlap between processing tonal and verbal material, especially comes from brain research e.g., [51,53]. Higher WM abilities are also associated

with musical aptitude [18], but are also age related [85]. Children (aged 4 to 6) can remember around twowords in a forward order, and threedigits of strings of numbers in a forward order [62]. The results of this as well as earlier research corroborate the findings of previous investigations and suggest that WM is highly important for learning new languages. Thus, it can be assumed that people with high WM capacity are faster learners [86], showing that WM predicts not only overall language aptitude scores, but is related to the language analysis components of the aptitude construct. Therefore, the link between WM, language acquisition and musical aptitude is a very promising research area and the overlaps of WM capacity for music and language functions may lead or has led researchers to argue that WM has the potential to replace the whole language aptitude construct (equaling language aptitude with working memory capacity). This claim, however, should be treated with caution, since one limitation of direct imitation tasks, like used in this investigation, is that it always requires high working memory loads. Indirect imitation tasks retrieved from long-term memory may reduce WM influence and inform about phonemic awareness. Future research on phonetic aptitude should include both direct and indirect imitation tasks to get a multidimensional impression of phonetic language aptitude. The measurements used in this investigation provided evidence on how fast and accurately someone imitates, retrieves and memorizes unfamiliar speech material on an "ad-hoc" basis, while indirect imitation tasks would also yield information about achievements in meta-cognitive awareness (phonemic awareness).

#### *4.2. Singing Behavior, Singing Ability and Speech Imitation*

Singing behavior was not significantly different in the music groups created which is in line with previous research where music perception ability of singers and instrumentalists did not differ, while their ability to reproduce new language material was significantly better in the singer group [19]. The result of this investigation, however, could also be due to the fact that pre-school children do not have comprehensive musical skills. Even though sensory consonance and pitch discrimination ability develops relatively early during infancy, harmonic knowledge, for instance, develops significantly later between 6 and 12 years [87]. Furthermore, limitations of the study design must be mentioned here as well. The ratings of the singing behavior were rated based on mere observations by the caretakers in the kindergarten. Our reasons for choosing this design lay in keeping testing time appropriately short for pre-school children and in rendering the testing situation as natural as possible.

Individual differences in the performances of children were also based on intrinsic motivation to produce vocalizations, which might be driven by their inner needs to express their feelings to the outer world. The variable singing behavior correlated more with to the global speech imitation score than mere singing ability or singing accuracy. The reason for this might be that children's fine motor abilities are still under development and this affects both, the singing and the articulatory skills [3,4]. First language acquisition develops along motor control development [88] and singing as vocal behavior largely shares the same mechanisms. Evidence that vocal motor commands can alter and influence speech perception has already been provided e.g., [89]. Intrinsic singing behavior and the desire to sing, reflecting intrinsic motivation to sing, are maybe more important than accuracy to playfully expand vocal flexibility in this period of life. However, since correlations differed only between one another in the case of the single contributions of singing ability and singing behavior, these results should be interpreted with caution.

#### *4.3. Typologically Different Languages and Musical Abilities*

Tone languages and non-tone languages are different in many respects and even though causality cannot be explained, based on the correlations, there seems to be a tendency towards differences in non-tone languages and tone-languages in relation to singing. Future investigations may need to consider typologically different languages and their relationship to musical aspects which cannot be explained within this limited research design. Recent research has isolated different transfer effects from music to language, where pitch discrimination contributed to tone-languages, while rhythmic

discrimination contributed to non-tone languages [63]. Goswami and colleagues also showed that "novel remediation strategies on the basis of rhythm and music may offer benefits for phonological and linguistic development" [90] and early musical training during childhood supports foreign language perception, memory and later foreign language acquisition processes [91]. As a general rule, it has been accepted that during infancy, basically between six and twelve months, language specific phonetic contrasts are perceived, followed by a decline in the ability to perceive non-native contrasts [92]. The first twoyears of development, therefore, seem to leave a deep cognitive imprint on children's language performances also later in life [93]. Twelve months old infants' speech segmentation and speech processing abilities seem to be predictive measurements for observing individual differences in language development between fourand 6 years of age [94]. The children in this investigation were five-to six-year-olds and had already acquired native phonetic contrasts for the German language, a non-tone language. Singing ability and singing behavior did not seem to relate to the Chinese performances. Chinese only correlated with the PMMA total and tonal score and working memory capacity, but did not show correlations to the rhythm PMMA subtest, or to any of the singing criteria. Although correlations do not inform about causality, further research on tone language imitation and its relation to musical measurements should be investigated in more detail. This could include various musical measurements focusing on pitch, timing and timbre perception tasks, musical instrument playing or singing to better understand the impact of musical abilities on second language learning of non-tone language speakers.

An analysis of Chinese imitation performances of school children at the age of 9 have revealed that around 40% of the variance of Chinese imitation of non-tone (German) language native speakers could be explained by singing ability, tonal perception ability together with WM capacity [63]. Learning Chinese as a second language may require precise musical knowledge an ability which is developed at the age of 9 [95]. Research has shown that tone language speakers are better at discriminating tone contrasts in any language compared to non-tone language speakers [96] and evidence that musicians and tone-language speakers share cognitive and perceptual skills for pitch discrimination has also been reported recently [97]. Musicians and tone-language speakers largely share cognitive and perceptual skills for pitch discrimination which illustrates a bidirectional path between music and language [97]. Evidence has been provided that high melodic ability of non-tone language speakers leads to better performances in detecting tonal variations in Mandarin [98]. This shows that musical ability may be highly relevant for learning tone-languages as a non-tone language speaker. This may be the reason as to why Chinese native speakers who learn a non-tone language during adulthood face pronunciation difficulties, and vice versa, non-tone language speakers have difficulties to discriminate Chinese sounds. Singing should be more closely integrated in future research designs to isolate the influence of singing on the acquisition of tone languages as a second language of non-tone language speakers.

#### **5. Conclusions**

Musical ear, singing ability/behavior and working memory capacity are linked to speech imitation abilities already at a very early stage in development. Comparable to research on adults, we have found similar effects and links in pre-school children, who are naïve in terms of education and music training, suggesting that individual differences might also be based on very early developed factors. Group comparisons of children with high versus low music perception abilities reveal, that the high musicality groups perform better in novel speech imitation, can sing better and show enhanced WM capacity. Children at this particular age are less influenced by educational input than adults, which hints at early developmental factors contributing to individual differences in musical abilities and (novel) speech learning abilities. This shows that music and language capacities are ultimately linked in children and adults. Singing behavior did not yield statistically significant differences between groups, which could show that this behavioral measure displayed lower reliability. On the other hand, it is in line with research on adults that singing ability, singing motivation and music perception are

different, non-overlapping entities, sometimes leading to similar and sometimes to different transfer effects [19].

**Author Contributions:** M.C.: data curation, formal analysis, investigation, methodology, project idea, project administration, software, writing—original draft, resources. S.M.R.: resources, supervision, conceptualisation, methodology, discussion, writing—reviewing and editing.

**Funding:** This research received no external funding.

**Acknowledgments:** We want to thank the study participants.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Brain Sciences* Editorial Office E-mail: brainsci@mdpi.com www.mdpi.com/journal/brainsci

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18