**Pushing the Envelope: Developments in Neural Entrainment to Speech and the Biological Underpinnings of Prosody Perception**

#### **Brett R. Myers 1,2,\*, Miriam D. Lense 1,3,4,5 and Reyna L. Gordon 1,4,5,6,\***


Received: 31 December 2018; Accepted: 15 March 2019; Published: 22 March 2019

**Abstract:** Prosodic cues in speech are indispensable for comprehending a speaker's message, recognizing emphasis and emotion, parsing segmental units, and disambiguating syntactic structures. While it is commonly accepted that prosody provides a fundamental service to higher-level features of speech, the neural underpinnings of prosody processing are not clearly defined in the cognitive neuroscience literature. Many recent electrophysiological studies have examined speech comprehension by measuring neural entrainment to the speech amplitude envelope, using a variety of methods including phase-locking algorithms and stimulus reconstruction. Here we review recent evidence for neural tracking of the speech envelope and demonstrate the importance of prosodic contributions to the neural tracking of speech. Prosodic cues may offer a foundation for supporting neural synchronization to the speech envelope, which scaffolds linguistic processing. We argue that prosody has an inherent role in speech perception, and future research should fill the gap in our knowledge of how prosody contributes to speech envelope entrainment.

**Keywords:** prosody; speech envelope; neural entrainment; rhythm; EEG

"In a house constructed of speech, the bricks are phonemes, and the mortar is prosody. Without the latter, we'd simply live under a pile of rocks". —B.R.M.

#### **1. Prosody Perception**

Prosody is the stress, intonation, and rhythm of speech, which provides suprasegmental linguistic features across phonemes, syllables, and phrases [1–3]. Prosodic cues contribute affect and intent to an utterance [4] as well as emphasis, sarcasm, and more nuanced emotional states [5,6]. Certain prosodic cues are universal and can be interpreted cross-culturally even in an unfamiliar language [7,8]. Prosody also provides valuable markers for parsing a continuous speech stream into meaningful segments such as intonational phrase boundaries [9], dynamic pitch changes [10], and metrical information [11]. Parsing speech units based on prosodic perception is an imperative early stage in language acquisition, and it is considered a precursor to vocabulary and grammar development [12–14]. In addition, prosody can convey semantic information for context in a message [15,16]. Deficits in prosody perception have a negative downstream impact on linguistic abilities, literacy, and social interactions, e.g., [17–20].

Prosodic fluctuations are responsible for communicating a wealth of information, primarily through acoustic correlates such as duration, amplitude, and fundamental frequency. As any of these parameters changes, it influences the expression of stress, intonation, and rhythm of the spoken message [21,22]. One illustration of the dynamic and multidimensional nature of prosody is "motherese" or infant-directed speech, which is characterized by exaggerations in duration and fundamental frequency [23]. The exaggerated speech signal creates louder, longer, and higher pitch stressed syllables [24], which facilitates segmenting the speech into syllable components and disentangling word boundaries [25,26]. The modified prosodic qualities of infant-directed speech make the signal acoustically salient and engaging for infants [23,27], which yield later linguistic benefits such as boosts in vocabulary acquisition [28] and accessing syntactic structures [29]. This is one example of how prosody plays an important role in speech communication.

The importance of prosody to speech perception is widely acknowledged, yet it has been underrated in many studies examining neural entrainment to the speech envelope. The purpose for the current review is to demonstrate that prosodic processing is engrained in investigations of neural entrainment to speech and to encourage researchers to explicitly consider the effects of prosody in future investigations. We will review the speech envelope and its relation to the prosodic features of duration, amplitude, and fundamental frequency, and we will discuss electrophysiological methods for measuring speech envelope entrainment in neural oscillations. We will then highlight some previous research using these methods in typical and atypical populations with an emphasis on how the findings may be connected to prosody. Finally, we propose directions for future research in this field. It is our hope to draw attention to the role of prosody processing in neural entrainment to speech and to encourage researchers to examine the neural underpinnings of prosodic processing.

#### **2. Amplitude Modulation**

Prosody is determined by a series of acoustic correlates—duration, amplitude, and fundamental frequency—which can be represented in a number of ways, including the amplitude modulation (AM) envelope (also known as the temporal envelope) [30,31]. It is important to mention that a temporal waveform is composed of a "fine structure" and an "envelope". Fine structure consists of fast-moving spectral content (e.g., frequency characteristics of phonemes), while the envelope captures the broad contour of pressure variations in the signal (e.g., amplitude over time) [32]. In other words, the envelope is superimposed over the more rapidly oscillating fine structure. Both envelope and spectral components are important for speech comprehension—i.e., to "recognize speech" rather than "wreck a nice beach" (Figure 1) (see [33] but also [34]).

It has been suggested that the extraction of AM information is a fundamental procedure within the neural architecture of the auditory system [35]. The auditory cortex is particularly adept at rapidly processing spectro-temporal changes in the temporal fine structure [36], and it is possible that this processing is aided by the amplitude envelope first laying the foundation for more narrow linguistic structure [37]. For example, fine structure cues play an important role in speech processing, yet normal-hearing listeners are able to detect these cues from envelope information alone [38]. Even when spectral qualities are severely degraded, speech processing can be achieved with primarily envelope information [39], as the envelope provides helpful cues for parsing meaningful segments in speech [40,41]. Additionally, the temporal characteristics of an auditory object allow us to focus attention on the source and segregate it from competing sources [42], which makes detection of envelope cues essential in speech communication. Because the amplitude envelope captures suprasegmental features across the speech signal, it lends itself to being an excellent proxy for prosodic information, and we argue that studies that use the speech envelope are inherently targeting a response to prosody.

**Figure 1.** Two representations of an acoustic speech signal: Amplitude envelope (**A**) and spectrogram (**B**). Subtle differences between the phrases "recognize speech" and "wreck a nice beach" can be detected in both representations.

The speech amplitude envelope provides a linear representation of AM fluctuations over time. Acoustic stimuli are constructed of multiple temporal dimensions [31], and modulation energy varies based upon the selected band of carrier frequencies in the signal [35]. Speech can be portrayed through a hierarchical series of AM frequency scales [43]; that is, stress placement occurs at a rate of ~2 Hz [44], syllable rate occurs around 3–5 Hz [24], and phonemic structure has a faster rate of 8–50 Hz [31] (see Figure 2). Liss et al. [45] found that energy in the frequency bands below 4 Hz was intercorrelated, and energy above 4 Hz (up to 10 Hz) was separately intercorrelated. The frequency range between 4 and 16 Hz primarily affects speech intelligibility [46], while frequencies below 4 Hz strictly reflect prosodic variations, such as stress and syllable rate [47]. These multiple timescales of modulation energy within the speech envelope have been shown to elicit corresponding modulations in cortical activity during speech processing [48]. This correspondence appears to play a role in potentially challenging listening situations, such as: speech in noise [49], multiple speakers [50], complex auditory scenes [51], conflicting visual information [52], and divided attention [53,54]. Each of these situations (discussed in more detail later) requires the listener to exploit the natural timing of speech using prosodic cues, which are provided in the amplitude envelope [55].

*Brain Sci.* **2019**, *9*, 70

**Figure 2.** Acoustic waveform with its segmentation into phrases, words, syllables, and phonemes. Figure reproduced from [48].

#### **3. Neural Entrainment to the Speech Envelope**

Neural entrainment to the speech envelope has been a notoriously complex topic of study for several decades. In this section we will provide a broad overview of some investigative strides in this area. It is well known that neural oscillatory activity occurs in a constant stream of peaks and troughs while at rest and during cognitive processes. This stream becomes an adaptive spike train in response to environmental stimuli, such as the acoustic signal of speech. Numerous studies have shown that neural oscillatory activity in specific frequency bands is related to specific linguistic functions; for example, lower-level linguistic processing, such as detection of stress and syllable segmentation, occurs in lower frequencies (<4 Hz) [47], and semantic/syntactic processing may occur in higher frequencies (13–50 Hz) [56].

Traditional EEG approaches to prosody perception include analyzing event-related potential (ERP) activity at key events in the speech signal [57], such as stressed syllables [58], metric structure violations [59], pitch violations [60], and duration violations [11]. While these techniques are important for determining brain responses to prosodic features, they do not provide a comprehensive measure of how the brain tracks and encodes the multidimensional aspects of prosody over time. Because prosody refers to suprasegmental features (duration, amplitude, fundamental frequency), which vary throughout an utterance, it is useful to analyze prosody across the temporal domain rather than at one point in time. For this we turn to the speech amplitude envelope as a representation of suprasegmental information.

Recent developments in the literature have explored ways to measure continuous neural entrainment, which is a phenomenon where neuronal activity synchronizes to the periodic qualities of the incoming stimuli [61]. The oscillations of the auditory cortex reset their phase to the rhythm of the speech signal, which is an essential process for speech comprehension [33]. This is known as phase-locking, which can be measured with a cross-correlation procedure between the speech stimulus and the resultant M/EEG signal [62]. Cross-correlation uncovers similarities between two time series using a range of lag windows [63]. This is an efficient method for observing the response to continuous speech without requiring a large number of stimulus repetitions, since this analysis inherently increases the signal-to-noise ratio [64].

Speech processing occurs through a large network of cortical sources [65], and phase-locking can be measured to locate functionally independent sources [66]. These sources may occur bilaterally depending on the timescale [67], such that the left hemisphere favors rapid temporal features of speech, and the right hemisphere tracks slower features. The right hemisphere generally shows stronger tracking of the speech envelope [68,69]; however, envelope tracking has also been shown to be a bilateral process [62,70].

When measuring how the speech envelope is represented in neural data, one issue with a simple cross-correlation between envelope and neural response is that temporal smearing (from averaging across time points) will create noise in the correlation function [71]. A solution to this is to use a modeling approach, known as a temporal response function (TRF) [68], to describe the linear mapping between stimulus and response. This approach stems from a system identification technique [72] that models the human brain as a linear time-invariant system. Of course, the brain does not operate on a linear or time-invariant schedule, but these assumptions are commonly accepted in neurophysiology research for characterizing the system by its impulse response [73,74].

The modeling approach can operate in either the forward or backward direction. Forward modeling describes the mapping of a speech stimulus to a neural network [68,75,76] using a TRF that represents the linear transformation that generated the observed neural signal [77] (Figure 3). When using the envelope representation of speech, the forward model treats the stimulus as a univariate input affecting each recording channel separately. However, since the speech signal is transformed in the auditory pathway into multiple frequency bands [78], the forward modeling procedure may benefit from a multivariate temporal response function (mTRF) [71], which uses the spectrogram representation to evaluate speech encoding. Even in the multivariate domain, forward modeling still maps the stimulus to each response channel independently [79].

**Figure 3.** The temporal response function (TRF)—calculated with a linear least squares approach—represents the mapping from acoustic envelope onto each channel of EEG data (forward modeling). A multivariate reconstruction filter can be applied to data from all channels to estimate the acoustic envelope (backward modeling). Reconstruction accuracy can be measured by Pearson correlation between original and reconstructed envelopes. Figure reproduced from [75].

Backward modeling is a mathematical representation of the linear mapping from the multivariate neural response back to the stimulus [71]. This modeling approach yields a decoder that attempts to reconstruct a univariate stimulus feature, such as the speech envelope. As described in [71], this decoder function is derived by minimizing the mean squared error between the stimulus and reconstruction. In the backward direction, recording channels are weighted based on the information that they provide for the reconstruction [77], which removes inter-channel redundancies—an advantage over forward modeling. By modeling in the backward direction, researchers are able to compare stimulus reconstructions to the original stimulus, for instance with a correlation coefficient as a marker of reconstruction accuracy [80]. This provides a reliable index for the degree to which the envelope is encoded in the neural network. While other methods—such as cross-correlations and inter-trial phase coherence—are adequate for measuring phase-locking in speech comprehension, the modeling approach has been gaining attention as an attractive analysis method in recent years. Regardless of

the method used, measuring neural entrainment to the speech envelope is an excellent way to target prosodic processing, yet this has been underutilized in the literature.

#### **4. Selected Findings in Envelope Entrainment**

Many questions about speech processing can be investigated by looking at neural entrainment to the speech envelope, though we must be careful about how we interpret the results (see [34]; Table 1 provides a summary of selected studies). Peelle et al. [32] compared intelligible speech with unintelligible noise-vocoded speech, and they found that cortical oscillations in the theta (4–7 Hz) band are more closely phase-locked to intelligible speech. This may suggest that linguistic information and contextual associations enhance phase-locking to the envelope. However, others have measured envelope tracking in the auditory cortex even when the signal is devoid of communicative value. For example, Nourski et al. [81] found envelope entrainment even when speech rate was compressed to an unintelligible degree; Howard and Poeppel [82] found envelope entrainment to time-reversed speech stimuli; Mai et al. [56] found envelope entrainment to pseudo-word utterances. We acknowledge that envelope entrainment is often enhanced by intelligibility [49], but given the conflicting results described here, it is difficult to say whether intelligibility predicts entrainment or vice versa. What we can take away from these findings is that acoustic features of the stimulus—such as suprasegmental cues—seem to contribute to the neural entrainment effect and that the effect of neural entrainment on speech intelligibility warrants further investigation.

Attention has also been shown to influence envelope entrainment, and selective attention in a multi-speaker environment can be observed by the degree to which neural oscillations entrain to a given speech envelope [54,61,83]. The classic cocktail party situation has been studied for decades [84] and continues to be of interest today, e.g., [85]. In a natural auditory environment, many sounds are merged together and presented to the ear simultaneously, and the listener is tasked with segregating the sounds and attending to a particular source while ignoring the others [86]. By analyzing speech envelope representations, we can determine how the neural circuit parses and segregates these auditory objects. Ding and Simon [54] demonstrated that when a listener hears two speakers simultaneously, the neural decoding process is able to reconstruct the stimulus envelopes of both speech streams. The stimulus reconstruction is more strongly correlated to the envelope of the attended speaker (also [61,62,83]). Similar results have also been shown with invasive electrodes in electrocorticography (ECoG) research [53,87]. Despite the methodology used, these studies have suggested that neural encoding of an auditory scene involves selective phase-locking to specific auditory objects that are presented concurrently in a single auditory mixture. As mentioned previously, prosodic features of a speech stream help a listener to parse speech and attend to it, so prosody likely plays an important role in multi-speaker envelope entrainment, yet manipulations of the prosodic features of speech are rarely included as a variable in multi-speaker entrainment studies.

Speech envelopes are also of interest in studies examining audiovisual presentation of speech. Visual speech provides critical information regarding the timing and content of the acoustic signal [88]. It has long been acknowledged that listeners perceive speech better when they can both see and hear the individual speaking [89]. Articulatory and facial movements provide visual temporal cues that complement meaningful markers in the auditory stream. Visual rhythmic movements help parse syllabic boundaries [90], a wider mouth opening indicates louder amplitude [91], and seeing a conversational partner assists in segregating a speech stream from overlapping speakers [92]. Visual cues and gestures are tightly linked to speech prosody [93–95], and this alignment emphasizes suprasegmental features of the speech signal.

When auditory and visual information are incongruous, speech perception may be hindered and even lead the listener to falsely perceive a sound that was not presented in either modality (à la "The McGurk Effect") [96]. Congruent audiovisual speech enhances envelope tracking compared to incongruent information and also shows greater envelope encoding than auditory only speech, visual only speech, or the combination of the two unisensory modalities [97]. Audiovisual speech

also has marked benefits for neural tracking when presented in noisy conditions [98] (also [99,100]). This is indicative of multisensory enhancement during speech envelope encoding. At the same time, there appears to be a similar mechanism for visual entrainment in which cortical oscillations entrain to salient lip movements even when they are incongruous to the acoustic stream [101]. These studies of envelope responses to speech incongruence support an emerging model of correlated auditory and visual signals dynamically interacting in a discrete process of multisensory integration [88,102]. Prosody is a major factor in this integration, as it aligns a stable framework of temporal and acoustic–phonetic cues to be used in speech processing; however, the contribution of prosodic dimensions of the speech stimuli to neural entrainment in multisensory processing in these studies has not been explicitly considered.

Prosody shares a number of features with music, so an area for potential exploration is the connection between neural entrainment to speech and to music. Envelope entrainment is influenced by speech rhythm [103]. Because rhythm and temporal cues provide a common link between music and speech perception (e.g., [104,105]), several studies demonstrate associations between musical rhythm aptitude, speech perception, and literacy skills in children [106,107]. Some have hypothesized that entrainment to music leads to increased timing precision in the auditory system, which leads to increased perception of the timing of speech sounds [108,109]. Doelling and Poeppel [110] found that the accuracy of cortical entrainment to musical stimuli is contingent upon musical expertise, suggesting individual differences in cortical oscillations related to experience. However, musical expertise does not necessarily predict stronger entrainment to the speech envelope [111]. Additional work on individual differences between speech and music may help to target the neural mechanisms behind prosodic processing.


**Table 1.** List of papers investigating speech envelope tracking using various analysis approaches, data collection procedures, and topics of interest. Analysis abbreviations: CC—cross-correlation; PC—phase coherence; TRF—temporal response function; SR—stimulus reconstruction.

In summary, neural entrainment to the speech envelope likely reflects, at least in part, prosody perception. Prosodic fluctuations and prosody perception likely contribute to experimental findings linking envelope entrainment to intelligibility, selective attention, and audiovisual integration. Findings discussed in this section are highlighted in Table 1.

#### **5. Developmental and Clinical Relevance of Envelope Entrainment**

Children show a reliance on prosody processing from early infancy [122,123]; so, the envelope appears to be a critical tool for early language acquisition. The speech amplitude envelope contributes to the perception of linguistic stress, providing essential information for speech intelligibility and comprehension, e.g., [124]. Infant-directed speech is a manner of speaking that exaggerates prosodic cues, and infants show stronger cortical tracking of the infant-directed speech envelope compared to tracking of adult-directed speech [125]. Individuals who have difficulties with processing cues related to the speech amplitude envelope may demonstrate language-processing deficits [126].

Neuronal oscillatory activity in healthy adults entrains to adult-directed speech at various timescales, e.g., [33]. Frequencies in the delta band range (1–4 Hz) involve slower oscillations and track suprasegmental features of speech, such as phrase patterns, intonation, and stress [33,61]. Prosodic cues are particularly salient in the delta band and may be of particular relevance for envelope entrainment and language acquisition in children. Child-directed speech appears to bolster entrainment at the delta band specifically by amplifying these prosodic features [127]. The accuracy of delta band entrainment may also be indicative of higher-level linguistic abilities, as entrainment at the 0–2 Hz band is positively correlated with literacy [115,128]. The delta band may be crucially important because it provides the foundation for hierarchical linguistic structures of the incoming speech signal [33]. This could, in turn, affect cross-frequency neural synchronization, which may be particularly informative for the development of speech comprehension [129].

Autism spectrum disorders (ASD) are associated with atypical processing of various sensory modalities [130]. Individuals with ASD show less efficient neural integration of audio and visual information in non-speech [131] and speech input [132]. This is related to the temporal binding hypothesis in ASD, which suggests that these individuals have a deficit in synchronization across neural networks [133]. Jochaut et al. [134] showed deficient speech envelope tracking using fMRI and EEG when individuals with ASD perceive congruent audiovisual information. Possible impairment in coupling rhythms into oscillatory hierarchies could contribute to these results [135], and examining language deficits in ASD as oscillopathic traits may be a promising step forward in understanding these disorders [136,137].

Developmental dyslexia is a disorder of reading and spelling difficulties not associated with cognitive deficits or overt neurological conditions, and it is often considered a disorder of phonological processing skills [138]. Dyslexia is believed to affect the temporal coding in the auditory and visual modalities [139,140], and individuals with dyslexia often have difficulty identifying syllable structure or rhyme schemes, see [141]. The speech envelope is important to study in dyslexia because it carries syllable pattern information, and Abrams et al. [142] reported delayed phase-locking to the envelope in individuals with dyslexia. Specifically, the delta band in neuronal oscillations can reveal anomalies such as atypical phase of entrainment [43,143] and poor envelope reconstructions [115], which may ultimately have a downstream effect on establishing phonological representations [25]. Because the delta band reflects prosodic fluctuations, the atypical entrainment in this range suggests that individuals with dyslexia may have impaired encoding at the prosodic linguistic level [61].

Developmental language disorder (DLD) affects language abilities while leaving other cognitive skills intact, and it is sometimes studied in parallel with dyslexia due to similar deficits in phonological and auditory processing [144,145]. The prosodic phrasing hypothesis [146] suggests that children with DLD have difficulty detecting rhythmic patterns in speech, particularly related to impaired sensitivity to amplitude rise time [147] and sound duration [126], and difficulties in processing accelerated speech rate [148]. Given the growing behavioral evidence suggesting that children with DLD have deficits

in prosody perception (see [149]), it stands to reason that they would show poor speech envelope entrainment, particularly in the delta frequency band [33]. To our knowledge, there has not been an electrophysiological study looking at neural entrainment to the speech envelope in children with DLD, but this would be an illuminating endeavor.

#### **6. Directions for Future Research**

There have been many recent advances related to speech envelope entrainment, and we argue that prosody has had a substantial—though at times underrated—role in many studies. It is well accepted that prosodic cues facilitate speech processing, e.g., [3], and these suprasegmental features are represented in the amplitude envelope, e.g., [64]. Therefore, studies investigating speech envelope entrainment inherently capture a response to prosody to some degree, yet the underlying mechanisms of prosody perception, and their effect on speech processing, remain somewhat a mystery. We suggest that including experimental manipulations of the prosodic dimensions of speech in future studies may inform the findings of previous works, and it may shed light on the future interpretation of entrainment, particularly in the low-frequency range. Ding et al. [118] have shown that removing prosodic cues from speech weakens envelope entrainment, which suggests that synchrony between neural oscillations and the speech envelope reflects perception of the acoustic manifestations of prosody, and future work should continue testing this relationship. More broadly, we present a series of potential future directions in Table 2.


**Table 2.** Potential future directions including key points and methodological considerations.

Synchronization occurs when internal oscillators adjust their phase and period to rate changes of speech rhythm, e.g., [150]. According to the dynamic attending theory [151,152], attentional effort is not uniformly distributed over time, but rather, it occurs periodically with salient sensory input. Prosody offers meaningful information through stressed syllables, which gives attentional rhythms a structure for scaffolding speech processing mechanisms [108]. Suprasegmental elements are present in a wide array of stimuli that demonstrate neural entrainment to the speech envelope (e.g., intelligible and unintelligible speech; attended and unattended speech; audiovisual and audio only and visual only speech). The suprasegmental cues may be one reason why stimuli of varying salience continue to reveal entrainment. Future empirical investigations may consider how prosody supports neural entrainment under these different experimental conditions.

Of course, prosodic fluctuations alone cannot fully explain neural entrainment to speech or speech comprehension [34]. When Ding and colleagues [118,153] removed prosodic cues from connected speech stimuli, they did find some low-frequency entrainment (<10 Hz), which they attribute to syntactic processing. However, they pointed out that neural tracking would likely be more prominent in natural speech with the addition of rich prosodic information. In spoken English, syntax can exist without prosody, but the inclusion of prosody certainly facilitates syntactic processing (with phrase segmentation, pitch inflection, etc.). Therefore, further study of prosodic versus syntactical manipulations will shed light on their respective contributions—and their interaction—to neural entrainment to speech, including when examined together with behavioral measures of speech comprehension. Studies have shown that phonetic [117] and semantic [120] levels of processing also contribute to neural activity at different hierarchical timescales. It may be informative to consider how prosodic cues organize and facilitate processing at these different levels. Future work should attempt to isolate prosodic cues from phonetic and semantic details to specify the contributions of prosody to these other structures in continuous speech. This could be accomplished by restricting prosodic cues (using monotone pitch and constant word durations, as in [118]) or by creating stimuli with a prosodic mismatch (using unpredictable changes in amplitude, pitch, and duration). These manipulations would allow researchers to more directly target the role of prosody in entrainment.

Examining the links between prosody and neural encoding of the speech envelope may also have relevance for additional topics and clinical populations. For example, it has been shown that features of prosody are directly linked to emotional expressiveness in speech [6], and one novel area of research would be to connect patterns of envelope entrainment with perception of emotional states. This would likely have implications for the clinical populations discussed above, as well as typical emotional development. Other recent work has investigated rhythmic cueing and temporal dynamics of speech in patients with Parkinson's disease [154], aphasia [155], and even blindness [156]. Because these populations show difficulty with prosodic cues in speech, a next step could be to examine speech envelope entrainment in these individuals to examine if there is a neural deficit in prosody encoding.

As conveyed in this review, low-frequency neural oscillations likely reflect in part a response to prosodic cues in speech. Future research can investigate how prosody impacts neural envelope entrainment and scaffolds higher-level speech processing, as well as examine individual differences in prosody perception and neural entrainment. Future research in speech entrainment ought to search for connections to prosody perception and determine what it takes to get the speech envelope signed, sealed, and delivered to the cortex.

**Funding:** The authors were supported by the National Institute on Deafness and Other Communication Disorders and the Office of Behavioral and Social Sciences Research of the National Institutes of Health under Award Numbers R03DC014802, 1R21DC016710-01, and K18DC017383. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was additionally supported by the Program for Music, Mind and Society at Vanderbilt (with funding from the Trans-Institutional Programs Initiative), the VUMC Faculty Research Scholars Program, and the Department of Otolaryngology at Vanderbilt University Medical Center.

**Acknowledgments:** The authors would like to thank Duane Watson, Cyrille Magne, Stephen Camarata, and anonymous reviewers for invaluable theoretical and conceptual insight to this review, as well as Edmund Lalor, Giovanni Di Liberto, Fleur Bouwer, and Andrew Lotto for thoughtful methodological considerations herein.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Infants Segment Words from Songs—An EEG Study**

**Tineke M. Snijders 1,2,\*,**†**, Titia Benders 3,\*,**† **and Paula Fikkert 2,4**


Received: 25 November 2019; Accepted: 6 January 2020; Published: 9 January 2020

**Abstract:** Children's songs are omnipresent and highly attractive stimuli in infants' input. Previous work suggests that infants process linguistic–phonetic information from simplified sung melodies. The present study investigated whether infants learn words from ecologically valid children's songs. Testing 40 Dutch-learning 10-month-olds in a familiarization-then-test electroencephalography (EEG) paradigm, this study asked whether infants can segment repeated target words embedded in songs during familiarization and subsequently recognize those words in continuous speech in the test phase. To replicate previous speech work and compare segmentation across modalities, infants participated in both song and speech sessions. Results showed a positive event-related potential (ERP) familiarity effect to the final compared to the first target occurrences during both song and speech familiarization. No evidence was found for word recognition in the test phase following either song or speech. Comparisons across the stimuli of the present and a comparable previous study suggested that acoustic prominence and speech rate may have contributed to the polarity of the ERP familiarity effect and its absence in the test phase. Overall, the present study provides evidence that 10-month-old infants can segment words embedded in songs, and it raises questions about the acoustic and other factors that enable or hinder infant word segmentation from songs and speech.

**Keywords:** word segmentation; infant; speech; song; EEG; ERP; familiarity; recognition; polarity

#### **1. Introduction**

Parents across cultures sing songs, words sung to a tune, for their infants. They sing lullabies to soothe and comfort, and they often sing play songs to try and make their babies laugh [1]. While parents initially sing for affect regulation and social engagement, they add didactic reasons around their infant's 10th month [2]. In fact, vocabulary acquisition is one of the primary areas in which mothers expect to see progress when they participate with their one-year-old in a musical education program [3]. Could songs indeed be beneficial for vocabulary learning?

There is evidence that infants preferentially process linguistic–phonetic information from songs compared to speech: Infants of 7 and 11 months old detect changes to syllable sequences when the syllables are sung rather than spoken [4,5], and neonates can already detect syllable co-occurrences in a continuous stream if these syllables are sung rather than produced in a flat speech register [6]. However, research so far has not yet convincingly shown that infants can use actual children's songs to learn actual language. Firstly, the songs used in previous experiments had lyrics of only four or five words and consistently paired syllables with a single pitch pattern throughout the song, thus not reflecting the lyrical and musical complexity of actual children's songs. Secondly, the skills assessed

were only partially relevant to language acquisition: While the detection of syllable co-occurrences, as tested by [6], is important to linguistic word segmentation and was associated with the participants' vocabulary size at 18 months of age, the ability to detect changes to syllable sequences, as assessed by [4,5], may be less critical to infants' concurrent or later language acquisition. Finally, only [4] provided the critical evidence that children could transfer the material learned from song to recognize words in speech, which is ultimately the primary modality of spoken language communication.

Therefore, the present study aims to test whether infants are able to learn linguistically relevant units from ecologically valid children's songs, and then also transfer these units to recognition in the spoken register. Specifically, we will assess infant word segmentation from children's songs with full lyrical and musical complexity, asking whether infants can segment word forms within songs, and subsequently recognize those word forms in speech. Moreover, we directly compare infants' segmentation across songs and the same materials presented in speech to assess whether songs present an advantage compared to speech.

As most research on the role of input in infant language acquisition has focused on the role of speech (for reviews: [7,8]), we will first contextualize the present study by discussing the potentially beneficial and hindering effects of songs for general language acquisition in adults, children, and infants. Then, we will review the literature on infant word segmentation, the fundamental ability to extract word forms from continuous speech input, which infants acquire in their first year of life. The present study, which assesses infant word segmentation from songs, is detailed in the final section of this introduction.

Songs can be expected to provide a good source for infant language learning, considering the beneficial effects of songs as well as music more generally on later language acquisition and processing. Songs directly aid memory for verbal material in both adults [9,10] and children [11]. Those findings have inspired research into the efficacy of songs for foreign- or second-language vocabulary acquisition, with the benefits, in particular for vocabulary acquisition, extending to children in the foreign language classroom (for reviews: [12,13]). Musical training can also enhance general auditory encoding, which indirectly improves a range of language skills (for a review: [14]), including children's speech segmentation [15], phonological abilities [16], as well as the perception of speech prosody [17] and durational speech cues [18].

The beneficial effects of songs and music for language acquisition are generally understood in terms of both emotional–attentional and cognitive mechanisms. Musical expertise fine-tunes and enhances sensitivity to the acoustic features shared by music and speech, and it also enhances auditory attention and working memory [19–29]. These explanations can be extended to hypothesize that songs also provide useful linguistic input for infants. Firstly, songs grab infants' attention at least as effectively as infant-directed speech [30–34], and they are more effective than speech in delaying and ameliorating distress [35,36]. Secondly, songs employ many features that infants are sensitive to in their early language acquisition, including phrasing [37,38] and rhythm [39,40]. Finally, it has been proposed that some of the beneficial effects of song on infants' well-being are a direct result of internal rhythmic entrainment [35], which is a mechanism that has also been hypothesized to be responsible for the improved encoding of linguistic material [27–29]. These three effects of song on infants render it likely that infants can effectively engage their speech encoding networks to learn from songs.

Nevertheless, it is not trivial that infants pick up linguistic information from songs. Firstly, infants' speech-honed language-learning skills may not be successful when applied to the acoustic signal of songs: lyrics sung to a melody are produced with different acoustic features than regular speech [41], including a more compressed and less consistently produced acoustic vowel space [42], cf. [43]. Secondly, even adults at times mishear words in songs, both in their non-native and native language [44–46]. Finally, even if infants learn words in songs, they may not be able to recognize these words in speech, the modality that is overwhelmingly used for spoken language communication. For example, the developmental literature on word segmentation shows that infants' ability to transfer the recognition of a learned word to a new type of acoustic stimulus slowly emerges in the second half of infants' first year. The ability to generalize across speakers, genders, or emotions emerges around 10.5 months

of age [47,48], with evidence of infants' ability to generalize across accents emerging around their first birthday [49,50]. Related work on infant word recognition suggests that infants between 8 and 10 months old might be particularly negatively impacted by speech variation [51]. Thus, it is conceivable that the infants up to one year of age are not yet able to transfer words learned from song to speech. This would pose clear boundaries to the effectiveness of songs for language acquisition in the first year of life.

The present study assesses the efficacy of songs for infant language learning through a word segmentation paradigm, testing word segmentation within songs and speech as well as subsequent generalization to speech. Segmentation, i.e., extracting individual word forms from the continuous speech stream, in which word boundaries are not typically marked by pauses, presents a sensible starting point for this research agenda, as adults find songs easier to segment than speech [52]; musical expertise is associated with better and faster segmentation in adults [53–57]; and musical training facilitates word segmentation in children [15]. Moreover, segmentation is critical to successful language acquisition, as the vast majority of words spoken to infants appear in continuous speech [58,59], even if parents are instructed to teach their infant a word [60,61]. Once infants have extracted word forms from the speech stream, they can more easily associate these with their meaning and thus start building a lexicon [62–65]. Word segmentation is also important in developing the language-ready brain, with word segmentation skills in infancy predicting language ability in the toddler years [6,66–70], although possibly not beyond [70]. Although the group-level effects of word segmentation are not always replicated [71,72], which is a topic that we will return to in the discussion, infants' ability to segment words from continuous speech is well established (see [73] for a meta-analysis). In the present study, we ask whether songs provide one source of information in infants' input from which they could segment words and start building their lexicon.

Infants rely heavily on language-specific rhythmic cues for word segmentation, with English-learning 7.5-month-olds relying on the strong–weak trochaic word stress, a rhythmic property, to segment words [74–78]. This metrical segmentation strategy is also developed by infants learning Dutch, another language with trochaic lexical stress [79], albeit at a slower rate compared to their English-learning peers [76], but not by infants learning French, a language without lexical stress [80–82].

Infant word segmentation is facilitated by the exaggeration of prosodic cues on the target word and across the entire speech stream. Prosodic accentuation on the target word is essential for segmentation by 6-month-olds and facilitates segmentation for 9-month-olds, although it may become less important when infants are 12 months of age [83]. Moreover, exact alignment of the accentuated pitch peak with the stressed syllable of the word appears to be critical [84]. General prosodic exaggeration across the speech stream, as observed in infant-directed speech (IDS), facilitates segmentation on the basis of transitional probabilities in 8-month-olds [85] and possibly even newborns [86]. In addition, word segmentation is easier from natural speech than from prosodically exaggerated speech [72], although the extent of the beneficial effect of IDS is still under investigation [87].

Considering that infants strongly rely on rhythmic cues and that prosodic exaggeration facilitates segmentation, it is conceivable that the clear musical rhythm and melodic cues of songs will enable and possibly facilitate infants' speech segmentation. The aforementioned study by François and colleagues [6] has provided the first support for this hypothesis by showing that newborns are only able to extract words from an artificial speech stream that is musically enriched. However, every syllable in these "songs" was paired with a consistent tone, resulting in a song that presented each of the four tri-syllabic words with its own unique tune throughout. Therefore, it is still an open question whether infants can segment words from songs with the full melodic and lyrical complexity of actual children's songs. Moreover, infants' aforementioned difficulties generalizing segmented words across speakers, accents, and emotions raise the question of whether they will be able to recognize words segmented from song in speech. The present study aims to address these issues by testing whether infants can segment words from realistic children's songs and subsequently recognize those words in

continuous speech. In addition, it will assess how infants' segmentation from song compares to their segmentation from continuous speech.

The present study employed an electroencephalography (EEG) familiarization paradigm for word segmentation [88,89]. This procedure is adapted from the behavioral two-step familiarization-then-test procedure, which first familiarizes infants with words and then tests their word recognition by comparing the (head turn) preference for speech with the familiarized target versus a novel control [90]. The EEG version of this paradigm exposes infants to a series of familiarization-then-test blocks and assesses word recognition on each block by comparing event-related potentials (ERPs) to familiarized targets and matched novel control words in the test phase. The EEG paradigm was preferred, as previous research has found it to be more sensitive than the behavioral method to (emerging) segmentation abilities [69,80,91,92]. Moreover, the EEG paradigm can uniquely reveal the time-course of the developing word recognition by comparing ERPs to the first and last target occurrences within the familiarization phase [89]. For example, tracking this temporal development of recognition in the EEG has revealed faster segmentation by newborns from a musically enriched compared to a monotonous speech stream [6].

The setup of the present study, as illustrated in Figure 1, was adapted from Junge and colleagues [89], who presented continuous speech in both the familiarization and test phase. Each block in the current study familiarized infants with a word embedded eight times within a sung or spoken fragment. Within this familiarization phase, the comparison of ERPs to the first two and last two occurrences of the target word enabled us to assess infants' ability to segment words from songs and contrast it with their ability to segment words from speech. After each sung or spoken familiarization phase, infants were presented with a spoken test phase consisting of two spoken phrases with the familiarized target word, and two others with a matched novel control word. The difference in ERP response to familiarized target and novel control words after song familiarization would index infants' ability to transfer words that are segmented from song to recognition in speech.

**Figure 1.** Setup of one experimental block of the study design.

ERP responses to familiarized compared to novel words are generally largest over left-frontal electrode sites, and can be either positive- or negative-going, with a negativity being considered a more mature response (see review and discussion by [71,83]). The polarity of infants' ERP familiarity response partly depends on stimulus difficulty, with 7-month-olds displaying positive-going responses to words embedded in speech after having negative-going responses to those same words presented in isolation [69]. The polarity of the response to stimuli of the same difficulty also changes developmentally, shifting from a positive-going response typically displayed by 6-month-olds to-7-month-olds [69,83] to a negative-going response after 8 months of age [67,79,83,88,92,93]. However, group effects in polarity are not consistently observed across studies due to large individual differences within age bands [71]. This variation between infants appears to be significant to early language development, as the negative responders have more robust neural segmentation responses across various stages of the procedure [70,93], better concurrent vocabulary size [71,93], better and faster vocabulary development into toddlerhood [67,69,93], and better general language skills [69].

The general developmental shift from an initial positivity to a later negativity for word recognition responses has been ascribed to cortex maturation [83] (see also [94] for a similar reasoning on the polarity of the infant MMN for auditory discrimination), as the auditory cortex undergoes tremendous changes, specifically around 6 months of age [95,96], which can influence the polarity of ERP components [97]. However, the stimulus-dependent polarity shift within a single group of infants [69] reveals that a more functional explanation is required. The polarity of an ERP depends on the location and orientation of the underlying brain activation [98]. Männel and Friederici have proposed slightly different origins of the positive and negative ERP familiarity effects (secondary auditory cortices versus superior temporal cortex, respectively), with more lexical processing due to the infants' advancing linguistic experience resulting in the shift to superior temporal cortex activation [83]. In a similar vein, Kidd and colleagues have proposed that the negativity reflects the emergence of a lexicon [71]. We would argue similarly that the negative ERP familiarity effect reflects active lexical learning, and we interpret the negative familiarity effect as a 'repetition enhancement' effect (see below). When adults hear repetitions of words within sentences, a positive ERP repetition effect is elicited [99]. This can be interpreted as 'repetition suppression', which is a reduced neural response when a stimulus is repeated [100,101]. The positive infant ERP familiarity effect might reflect a similar repetition suppression response but now for low-level acoustic properties of the stimulus. However, the negative ERP familiarity effect might reflect 'repetition enhancement'-enhanced processing when a stimulus is repeated [102]. Repetition enhancement effects are thought to reflect a neural learning mechanism for building or strengthening novel neural representations [103,104]. This reasoning would support the notion of the negative infant ERP familiarity effect reflecting the active building of a lexicon, which is in accordance with the proposals of [71,83].

The present study tested word segmentation in 10-month-old Dutch infants, for whom a negative-going ERP familiarity response can generally be expected in speech [67,79,88,89]. For the speech sessions, we expect to replicate the negative ERP familiarity response seen in the work by Junge and colleagues [89]. Within the song familiarization, a left frontal negative-going response to the last two compared to the first two target occurrences would be taken as evidence that word segmentation from a song is unproblematic for infants. Both negative and positive ERP familiarity responses indicate that the repetition of the word form has been identified within the continuous speech stream. However, given the previous literature, a positive response would be interpreted as indicating difficulties with song segmentation. Within the subsequent spoken test phase, a negative-going response would similarly be interpreted as automatic generalization from song to speech, with a positivity indicating a more challenging transfer process.

#### **2. Materials and Methods**

#### *2.1. Participants*

Forty Dutch 10-month-old infants participated in two experimental sessions, resulting in eighty datasets (40 song, 40 speech). The number of participants tested was based on [89]. All infants were born in term (37–42 weeks gestational age), normally developing, without a history of neurological or language impairments in the immediate family. Twenty-one datasets were excluded from analysis because of too few artefact-free EEG trials (see below). One participant was excluded because he was raised bilingually. The remaining 57 included datasets came from 32 subjects, with 25 of them contributing good data in both the speech and the song session. The 32 included subjects (16 female) were all monolingual Dutch infants (session 1: mean age 299 days, range 288–313 days; session 2: mean age 306 days, range 293–321 days). Infants were recruited from the Nijmegen Baby and Child Research Center Database. The study was approved by the local ethics committee, and parent(s) gave

written informed consent for their infants prior to the experiment, in accordance with the Declaration of Helsinki.

#### *2.2. Materials*

The familiarization materials were 20 verses of eight phrases. Each verse contained one repeating target word in every phrase (see Table 1 and Figure 2 for an example). The verses were recorded in a sung and a spoken version, for the "song familiarization" and "speech familiarization", respectively. The "song" and "speech" stimuli used identical verses/lyrics, but only the "song" versions were recorded with the designated melodies. Each song and speech version of a verse was recorded with two different target words, for a total of four recordings per verse. The reader is referred to Supplementary Table S3 for the full set of materials.

**Table 1.** Example of a familiarization +test block with target word pair bellers-piefen. Target words are underlined. Materials were in Dutch. English word-for-word and semantic translations are given. The first two (first/second) target occurrences are indicated in blue, the last two (seventh/eighth) target occurrences of the familiarization phase are indicated in red. Familiarized target words in the test phase are indicated in purple, and novel control words are indicated in green.


**Figure 2.** Example of score for the song that was used for target word 'bellers'.

The 20 melodies for the "song" versions all consisted of eight phrases, or of four melodic phrases that were repeated twice (with different lyrics). The melodies were (variations on) melodies of German, English, French, Norwegian, and Dutch children's songs and unknown to a sample of 22 native Dutch parents with a 10-month-old infant (see Supplementary Table S2). The 20 original target words in [89] were supplemented with three further target words from [79,88] and 17 new target words. New target words were added to avoid the repetition of target words across blocks. All 40 target words were low-frequency trochees (see Table 2), each with a CELEX frequency lower than 19 per million [105]. The 40 target words were combined into 20 word pairs that shared a semantic category (e.g., "emoe" and "hinde"; English: emu and doe; see Table 2). Yoked pairs were created between the 20 melodies and the 20 target word pairs (Table 2). The verses that were written to each melody were made with both target words of the word pair.

The verses had a mean phrase length of 5.71 words (range: 3–10) and 7.82 syllables (range: 4–14). All eight phrases of a verse contained the target word. The target word was never the first word of a phrase, and the target word occurred maximally twice per verse in final phrase position. With one exception due to experimenter error, target words were never the last word of the first, second, seventh, or eighth sentence. The word preceding the target word was unique across the eight phrases. The main word stress of the target word consistently matched the meter of the melody in the phrase, and the text setting of the phrases to the melodies was correct, which is a condition that is considered critical for learning from songs [24]. We also created four test sentences per target word pair. The position of the target words was never the first or last of a test sentence and was otherwise variable.

Stimuli were recorded in a sound-attenuated booth, using Adobe Audition. The stimuli were annotated and further processed in Praat [106]. All stimuli were recorded by a trained singer (mezzo-soprano), and sung and spoken in a child-directed manner. The verses for the familiarization phase were typically recorded in one take per verse. Those original recordings were kept intact in terms of the order of the phrases, the duration of the phrase intervals, and the speaker's breathing in those intervals. Three speech and one song stimuli were created by combining multiple takes to obtain stimuli without disturbing noises. Recording in one take was required to render naturally sounding song versions. In this respect, our stimulus creation is different from that used in [89], who recorded their sentences in a randomized order and combined them after the fact.

The spoken test sentences for the test phase were recorded in one take per target word pair, extracted individually from the original recordings, and played back in a randomized order in the experiment.

Finally, the attentional phrase "Luister eens!" ("Listen to this!"), which was used as a precursor to each training and test stimulus in the experiment, was recorded in both song and speech versions.

Supplementary information about the materials is given in Supplementary Table S1, with acoustic properties (duration, pitch, and loudness measures) given in Supplementary Table S1 (for phrases) and Supplementary Table S2 (for target words). The mean 'focus' is also reported in Supplementary Table S2, which approaches 1 if the target word is always the highest or loudest in the phrase, thus measuring acoustic prominence. Acoustic properties of the target words in phrases one and two versus those in phrases seven and eight of the familiarization stimuli were matched (see Supplementary Table S3).


**Table 2.** The 20 target word pairs (English translation in parentheses), and songs that were the base for the melody of that specific word pair (language of lyrics of original song in parentheses).

#### *2.3. Procedure*

Infants participated in a separate song and speech session. The session order was counterbalanced across infants, with on average 7.6 days between sessions (range 5–14 days). Before the experiment started, the child could play on a play mat to get accustomed to the lab environment while the experimental procedure was explained to the parent. The EEG cap was pregelled to minimize the setup time. Then, the cap was fitted, electrode impedances were checked, and some extra gel was added where necessary. Next, the infants were seated on their parent's lap in a sound-attenuated booth with Faraday cage, and data collection was initiated. Sung and spoken sentences were presented to the infant over two loudspeakers at 65 dB. While listening to the sentences, the infant watched a silent screen-saver (not linked to auditory input) or played with silent toys. One experimenter sat next to the screen to maintain the engagement of the infant with silent toys or soap bubbles if necessary. Both parent and experimenter listened to masking music over closed headphones. A second experimenter ran the EEG acquisition from outside the experimental booth and monitored the infant through a closed-circuit video. The experiment was stopped if the infant became distressed. One full experimental session (including preparations and breaks) took about one hour, with the experiment proper taking about 20 min.

Stimuli were presented using Presentation software [107]. In each session, the infants listened to 20 blocks of familiarization-and-test trials, with a different melody and target word pair in every block. Each block consisted of a familiarization phase immediately followed by the corresponding test phase (see Figure 1; see Figure 2 and Table 1 for an example). The eight phrases of the verse in the familiarization phase, all containing the target word, were spoken (speech session) or sung (song session). The four sentences in the test phase were always spoken: two sentences contained the 'familiarized' word and two contained the second 'control' word of the word pair, presented in a randomized order. To reduce the effects of modality switching, the attentional phrase "Luister eens!" (English: Listen to this!) was played to the infants in the modality of the session before each familiarization block. The words "Luister eens!" were always presented in the spoken modality before each test phase.

The order of the blocks was counterbalanced across subjects. Within each session, every target word was the 'familiarized' word for half of the infants and the 'control' word for the other half, and this assignment of words was counterbalanced across subjects. For each infant, the familiarized words in the spoken session were the control words in the sung session and vice versa. Note that in [89], this repetition of critical words already occurred in the second half of the experiment, which was why we accepted a repetition in the second session (after 5–14 days). The order of the blocks was such that the target words never started more than twice in a row with vowels or with the same consonantal manner or place of articulation.

#### *2.4. EEG Recordings*

EEG was recorded from 32 electrodes placed according to the International 10–20 system, using active Ag/AgCl electrodes (ActiCAP), Brain Amp DC, and Brain Vision Recorder software (Brain Products GmbH, Germany). The FCz electrode was used as the on-line reference. Electro-oculogram (EOG) was recorded from electrodes above (Fp1) and below the eye, and at the outer canthi of the eyes (F9, F10). The recorded EEG electrodes were F7, F3, Fz, F4, F8, FC5, FC1, FC2, FC6, T7, C3, Cz, C4, T8, TP9, CP5, CP1, CP6, TP10, P7, P3, Pz, P4, P8, PO9, and Oz. The data were recorded with a sampling rate of 500 Hz and were filtered on-line with a time constant of 10 s and a high cutoff at 1000 Hz. Electrode impedances were typically kept below 25 kΩ.

#### *2.5. Data Processing*

EEG data were analyzed using Fieldtrip, an open source MATLAB toolbox (The MathWorks, Natick, MA, USA) for EEG and MEG analyses [108].

First, eye movement components and noise components in the EEG data were identified using independent component analysis (ICA, [109]). In order to identify components based on as much data as possible, prior to ICA analysis, all the EEG data of the whole session were filtered from 0.1 to 30 Hz and cut in 1 s segments. Bad channels were removed, as were data segments with flat channels or large artifacts (>150 μV for EEG channels, >250 μV for EOG channels). Then, we applied infomax ICA [110] as implemented in EEGlab [111]. A trained observer (T.M.S.) identified components that revealed eye components or noise on individual electrodes. Subsequently, time-locked data were made from the original EEG data by cutting the raw data in trials from 200 ms before to 900 ms after the onset of the critical words. Again, these data were filtered from 0.1 to 30 Hz, and the bad channels were removed, as well as trials with flat channels. Then, the identified eye and noise components were removed from the time-locked data. For the included datasets (see the end of this subsection), the mean number of removed eye and noise components was 2.8 and 2.7, respectively (range: 1–5 for eye components, 0–6 for noise components). After ICA component rejection, EEG channels were re-referenced to the linked mastoids. Electrodes PO10, Oz, and PO9 were discarded, because these were bad channels for too many infants. A baseline correction was applied in which the waveforms were normalized relative to the 200 ms epoch preceding the onset of the critical word, and trials containing EEG exceeding ±150 μV were removed. Six datasets were discarded because of too many (>4) bad channels, and 15 datasets were discarded because fewer than 10 trials per condition (<25%) remained after artifact rejection. For the remaining datasets (32 subjects, 57 datasets; 31 speech, 26 song; 29 first session, 28 second session; 25 subjects with good data in both sessions), bad channels were repaired using spherical spline interpolation ([112]; mean of 0.9 channels repaired, range 0–3). Finally, ERPs were made by averaging over relevant trials. For the analyses on the combined datasets of the speech and song sessions (see below), the trials were concatenated across sessions before averaging. The combined datasets had an average of 48 included trials per condition (range 16–69), while for the single sessions, this was 26 (range 12–36) for song and 28 (range 13–36) for speech.

#### *2.6. Planned ERP Analyses*

For the familiarization phase, the ERP familiarity effect was assessed by comparing the ERP in response to the last two (seventh/eighth) versus the first two (first/second) target occurrences. For the test phase, the ERP familiarity effect was assessed by comparing the ERP in response to familiarized target words versus novel control words.

Analyses were performed first for the combined song and speech sessions (32 subjects). In a second step, differences between song and speech sessions were assessed, this time only including the 25 subjects that had >10 trials per condition in both sessions.

Analyses to assess the ERP familiarity effect were performed both on predefined time windows and left-frontal electrodes (based on previous literature), as well as on all time points or electrodes in a single test (to assess possible deviations from previous literature and considering the novel inclusion of the song modality).

Time windows of interest (250–500 ms and 600–800 ms) were defined based on previous literature reporting the infant ERP familiarity effect (see [71]). The left-frontal region of interest was defined as electrodes F7, F3, and F5, which are consistently included in the calculation of average amplitudes in previous literature [67,70,71,83,88,89].

The average ERPs of the left frontal region for the two time windows were assessed using SPSS, with a 2 × 2 repeated measures ANOVA on mean left-frontal ERP amplitude, with familiarity (familiar, novel) and time window (250–500, 600–800) as within-subject factors. For the modality comparison (on the 25 subjects who had good data in both sessions), modality (song, speech) was added as a within-subjects factor. For the ANOVAs, we used the Huynh–Feldt epsilon correction and reported original degrees of freedom, adjusted *p*-values, and adjusted effect sizes (partial eta-squared: η*p*2).

Additionally, to explore the possible effects outside the regions and time windows of interest, ERP familiarity effects were assessed using cluster randomization tests [113]. Cluster randomization tests use the clustering of neighboring significant electrodes or time points to effectively control for multiple comparisons, while taking the electrophysiological properties of EEG into account. In these tests, first, all the electrodes or all time points are identified that exceed some prior threshold (in this case, a dependent samples *t-*test was conducted and the *p*-value was compared to a threshold alpha level of 0.05, uncorrected for multiple comparisons). In a second step, clusters are made in which neighboring electrodes or time points that exceed the threshold are grouped. For every identified cluster, a cluster-level statistic is calculated, summing the *t-*statistics of all the included electrodes/time points. A reference randomization null distribution of the maximum cluster-level statistic is obtained by randomly pairing data with conditions (e.g., familiarized and novel conditions) within every participant. This reference distribution is created from 1000 random draws, and the observed cluster *p-*value can then be estimated as the proportion from this randomization null distribution with a maximum cluster-level test statistic exceeding the observed cluster-level test statistic (Monte Carlo *p-*value). Thus, the cluster randomization *p-*value denotes the chance that such a large summed cluster-level statistic will be observed in the absence of an effect (see [113]). Note that although the observed clusters provide some information about latency and location, the cluster statistic does not define the actual spatial and temporal extent of the effect, as time points and electrodes are included based on uncorrected statistics (see [114]). Thus, precise cluster onset and extent should not be over-interpreted.

For our purposes, three exploratory cluster randomization tests were performed: two assessing all the electrodes in the two time windows (for both time windows comparing the condition averages of the time window for all electrodes and then clustering over electrodes), and one assessing all the time points from 100 to 900 ms in the left-frontal region of interest (comparing the condition averages of the three left-frontal electrodes for all time points, and then clustering over time points, starting from 100 ms after stimulus onset, as no effects were expected to occur before this time).

#### **3. Results**

*3.1. Planned Analyses—Segmentation from Song and Speech*

3.1.1. ERP Familiarity Effect in the Familiarization Phase, Song and Speech Combined

At the left-frontal region of interest, the repeated measure ANOVA on the 32 combined datasets showed an ERP familiarity effect (*M(SD)*unfamiliar12 = −1.44(0.67), *M(SD*)familiarized78 = 0.97(0.64); *F*familiarity(1,31)=11.8, *p*=0.002, η*p*<sup>2</sup> =0.28) with no detected difference across time windows (250–500 ms: *M(SD)*unfamiliar12 = −1.41(0.69), *M(SD)*familiarized78 = 0.93(0.65); 600–800 ms: *M(SD)*unfamiliar12 = <sup>−</sup>1.47(0.74), *M(SD)*familiarized78 = 1.00(0.76); *F*familiarity x time window(1,31) = 0.05, *p* = 0.83, <sup>η</sup>*p*<sup>2</sup> = 0.002). As can be seen in Figures 3 and 4, the ERP was more positive for the last two (seventh/eighth) than for the first two (first/second) target occurrences.

**Figure 3.** Familiarity effect in the familiarization phase for the combined sessions (32 subjects). Topographic isovoltage maps of the difference between the last two target occurrences (7th/8th) and the first two target occurrences (1st/2nd), in the (**a**) 250–500 ms and (**b**) 600–800 ms latency ranges. Electrodes that are part of output clusters in the cluster randomization test are shown with stars. The red circles indicate electrodes that are part of the left-frontal region of interest.

Subsequently, all the electrodes were assessed using a cluster-randomization test, comparing the average ERP amplitudes for the last two (seventh/eighth) and first two (first/second) target occurrences for both time windows of interest (see Figure 3). For the 250–500 ms time window, this resulted in a significant cluster (*p* = 0.034) of left-frontal electrodes (Fp1, F7, F3, FC5, and T7). A second cluster over the right hemisphere (F4) did not survive multiple comparison correction (cluster *p* = 0.52). For the 600–800 ms time window, a positive cluster over the left-frontal electrodes was marginally significant (cluster *p* = 0.06, electrodes Fp1, F7, F3, and FC5). Two other clusters did not survive multiple comparisons (cluster 2: F4 and F8, *p* = 0.10; cluster 3: CP1, *p* = 0.48).

To assess all time points between 100 and 900 ms after target onset, a cluster randomization was performed on the mean of the left-frontal region (electrodes F7, F3, and F5, see Figure 4). This resulted in two significant clusters, the first ranging from 268 to 594 ms (cluster *p* = 0.004) and the second ranging from 612 to 792 ms (cluster *p* = 0.028).

The clusters identified in the cluster randomization tests match the timing and topography of the previously reported infant ERP familiarity effect. However, note that the present ERP familiarity effect is a positive shift, while previous literature on segmentation by infants of the same age has most often reported a negative shift [79,88,89]. This discrepancy will be discussed in detail in the Discussion section.

**Figure 4.** Familiarity effect in the familiarization phase. Event-related potentials (ERP) averaged over left-frontal electrodes for (**a**) the combined sessions (32 subjects), (**b**) the speech sessions (31 subjects), and (**c**) the song sessions (26 subjects). The solid lines are the ERPs from the first two target occurrences (1 and 2, in blue) and the last two target occurrences (7 and 8, in red). The means ±1 SD are given as dotted lines. The shaded areas indicate the clusters identified in the cluster randomization test.

3.1.2. ERP Familiarity Effect in the Familiarization Phase, Comparing Song to Speech

As can be seen in Figure 4, the ERP familiarity effects in the familiarization phase were similar across the song and speech modalities, although this effect occurred possibly somewhat later in the songs. For the 25 subjects that contributed enough data for both the speech and song sessions, the left-frontal ERP familiarity effect during the familiarization phase was compared between modalities. The repeated measures ANOVA showed a main effect of familiarity (*M(SD)*unfamiliar12 = −1.23(0.76), *M(SD)*familiarized78 = 0.32(0.71); *F*familiarity (1,24) = 4.14, *p* = 0.053, η*p2* = 0.15). There was no significant difference in the ERP familiarity effect between modalities (speech: *M(SD)*unfamiliar12 = −0.74(1.19), *M(SD)*familiarized78 = 0.42(0.80); song: *M(SD)*unfamiliar12 = −1.72(1.18), *M(SD)*familiarized78 = 0.23(1.10); *F*familiarity x modality (1,24) = 0.13, *p* = 0.72, η*p<sup>2</sup>* = 0.005), and no significant interaction between familiarity, modality, and time window (*F*familiarity x modality x time window (1,24) = 0.14, *p* = 0.71, η*p2* = 0.006).

Cluster-randomization tests were performed to explore the possible modality differences in ERP familiarity effects outside the regions and time windows of interest. First, all electrodes were assessed in the 250–500 ms as well as the 600–800 ms time window, comparing the difference between the first two (first/second) and last two (seventh/eighth) target occurrences across the speech and the song modality. No clusters were identified, meaning that not one electrode showed a significant interaction between modality and familiarity, even prior to corrections for multiple comparisons. Then, all time points were assessed in the left-frontal region of interest, comparing the familiarity effect across modalities. Again, no clusters were identified, meaning that not one time point showed a significant interaction between modality and familiarity, even when uncorrected for multiple comparisons. In sum, no differences could be identified in the ERP familiarity effect between the song and the speech modality, and there was no evidence for an earlier start of the ERP familiarity effect in speech.

#### *3.2. Planned Analyses E*ff*ects in Test Phase (Transfer to Speech)*

#### 3.2.1. ERP Familiarity Effect in the Test Phase, Song and Speech Combined

Figure 5 displays the ERPs for familiarized target words and novel control words in the test phase, as well as the topography of their difference. For the left-frontal region of interest, the repeated measures ANOVA showed no difference between the ERP to the familiarized and the novel test items (*M(SD)*novelTest = −1.32(0.73), *M(SD)*familiarizedTest = −0.51(0.80); *F*familiarity (1,31) = 0.80, *p* = 0.38, η*p*<sup>2</sup> = 0.025), with no difference in the ERP familiarity effect between time windows (250–500 ms: *M(SD)*novelTest = −1.49(0.68), *M(SD)*familiarizedTest = −0.07(0.84); 600–800 ms: *M(SD)*novelTest = −1.16(0.89), *M(SD)*familiarizedTest <sup>=</sup> <sup>−</sup>0.95(0.92); *<sup>F</sup>*familiarity x time window (1,31) <sup>=</sup> 1.69, *<sup>p</sup>* <sup>=</sup> 0.20, <sup>η</sup>*p*<sup>2</sup> <sup>=</sup> 0.05).

**Figure 5.** Familiarity effect in the test phase for the combined sessions (32 subjects). (**a**) Event-related potentials (ERP) averaged over left-frontal electrodes (red circles in (**b**) and (**c**)). The solid lines are the ERPs from the novel control words in the test phase (in green) and the familiarized target words in the test phase (in purple). The means ±1 SD are given as dotted lines. Right: Topographic isovoltage maps of the difference between the familiarized target words and novel control words during test, in the (**b**) 250–500 ms and (**c**) 600–800 ms latency ranges. The red circles indicate electrodes that are part of the left-frontal region of interest.

When assessing the ERP familiarity effect during test on all electrodes using the cluster randomization test, one effect was identified at T7 in the 250–500 ms time window, which did not survive multiple comparisons correction (cluster *p* = 0.32). No clusters were identified in the 600–800 ms time window. When assessing all time points between 100 and 900 ms for the left-frontal region of interest, one cluster was identified from 330 to 350 ms, which did not survive multiple comparisons (cluster *p* = 0.26). Thus, no significant ERP familiarity effect could be identified in the test phase.

#### 3.2.2. ERP Familiarity Effect in the Test Phase, Comparing Song to Speech

To assess the possible differences in ERP familiarity effect in the test phase across modalities, a repeated measures ANOVA was performed on the 25 subjects that contributed enough data in both speech and song sessions. There was no significant difference between modalities in the left-frontal ERP familiarity effect (speech: *M(SD*)novelTest = −2.28(1.17), *M(SD)*familiarizedTest = −1.15(1.26); song: *M(SD)*novelTest = −0.86(1.20), *M(SD)*familiarizedTest = 0.75(1.28); *F*familiarity x modality(1,24) = 0.045, *p* = 0.83, η*p*<sup>2</sup> = 0.002), with also no significant interaction between familiarity, modality, and time window (*F*familiarity x modality x time window(1,24) = 0.42, *p* = 0.52, η*p<sup>2</sup>* = 0.017).

When comparing the ERP familiarity effect across modalities for all electrodes using cluster randomization, one cluster was identified (FC2, CP2) in the 250–500 ms time window, which did not survive multiple comparisons correction (cluster *p* = 0.17). An effect at CP2 was also identified in the 600–800 ms time window, again not surviving multiple comparisons correction (cluster *p* = 0.31). When comparing the ERP familiarity effect across modalities on all time points for the left frontal

region of interest, no clusters were identified. Thus, no differences in the ERP familiarity effect were identified for song compared to speech in the test phase.

To summarize, in the familiarization phase, we identified a positive ERP familiarity effect over the left-frontal electrodes in both the 250–500 and 600–800 ms time windows (see Figures 3 and 4). This effect did not differ significantly between the song and speech modality (see Figure 3b,c). In the test phase, there was neither an ERP familiarity effect in song nor in speech.

#### *3.3. Follow-Up Analyses—Motivation and Methods*

The planned analyses did not render all predicted effects, most notably a positive-going instead of a negative-going familiarity response in the familiarization phase and the absence of any group-level effects in the test phase. As reviewed in the Introduction, the polarity of infants' responses to familiar words has been associated with stimulus difficulty as well as developmental maturity [71,83], and the absence of group-level effects in previous work has been ascribed to large individual variation in development even within narrow age bands [71]. Therefore, we conducted a series of follow-up analyses to explore individual differences and to assess whether the ERP polarity to the target word shifted with target occurrence during the familiarization phase.

The first follow-up analysis considered the familiarization phase in more detail. In order to expand the comparison of the first two (first/second) versus last two (seventh/eighth) target occurrences from the planned analyses, it assessed how the familiarity responses developed across all eight occurrences of the target word therein. Additional follow-up analyses are reported in Supplementary Table S4, asking whether the lack of a group-level effect in the test phase might be due to a mix of positive and negative responders and associating the responder type to the responses across the familiarization phase.

#### 3.3.1. Follow-Up Analysis #1—Development Over Eight Familiarization Occurrences

The first set of follow-up analyses was conducted to scrutinize the identified positive-going response in the last two (seventh/eighth) compared to the first two (first/second) occurrences of the target word in the familiarization passage. Building on the work by Junge and colleagues [89], this analysis targeted the development of the word recognition response across all eight target word occurrences. Figure 6 displays this development, averaged over participants, separated by time window (250–500 ms versus 600–800 ms) and modality (speech versus song). As can be seen in Figure 6 as well as from the model results described below, the word recognition response increased in positivity over the first four occurrences of the target word for both speech and song passages. Then, the response became more negative on occurrences five and six when children listened to speech, but it remained stable when they listened to song. On the final occurrences seven and eight, the word recognition response became more positive again when children listened to speech, whereas it now became more negative when children listened to song.

The statistical analyses were conducted using mixed effects regression models as implemented in the *lmer* function in the *lme4* package [115] in the *R* statistical programming environment [116]. The dependent variable was the per-occurrence average EEG amplitude (in μV) over the left-frontal electrodes (F7, F3, and F5) in the two time windows of interest (250–500 ms and 600–800 ms after stimulus onset), thus adhering to the same electrodes and time windows of interest as in the planned analyses. The random-effects structures were selected to be parsimonious, i.e., containing only those random-effects parameters required to account for the variance in the data, and were determined following an iterative procedure adapted from [117]. The resulting *p*-values were interpreted against the conventional alpha level of 0.05, while *p*-values between 0.05 and 0.07 were interpreted as marginally significant and also worthy of discussion.

The first model assessed the word recognition effect over the eight familiarization occurrences (captured by the first to third-order polynomials, to assess linear, cubic, and quadratic trends over occurrences), comparing across the two modalities (speech = −1; song = 1), the two sessions (first session = −1; second session = 1), and the two time windows (250–500 ms = −1; 600–800 ms = 1). The three fixed factors were fully crossed (i.e., all the possible two-way and three-way interactions were included), and each interacted with the three polynomials. In the first stage of determining the random-effects structure, the dimensionality of the by-subject random effects was reduced to include the (non-correlated) random intercept and slope for modality. In the second stage, the by-word random effects nested within word pairs were added and systematically reduced to by-word random intercepts and by-word slopes for modality and session.

**Figure 6.** Average electroencephalography (EEG) amplitude in μV (averaged over blocks) in the early (250–500 ms) time window across the eight target occurrences in the familiarization phase in song (left panel) and speech (right panel). Gray lines connect individual participants' averages, indicated by gray points. The black lines provide a locally estimated scatterplot smoothing (LOESS)-smoothed development averaged over participants.

The follow-up models assessed the development of the word recognition effect separately in speech and in song, with as predictors the three polynomials for the eight occurrences, the contrasts for session and time window, the fully crossed fixed effects, and their interactions with the polynomials. To maintain consistency across analyses, we aimed for the random-effects structure of these analyses to contain a by-subject random intercept as well as the by-word random intercepts and slope for session. A model with this random-effects structure converged successfully for the song data, but it had to be reduced to by-subject and by-word random intercepts for the speech data.

The first model, which included both song and speech, revealed a marginally significant positive linear trend over occurrences (β = 51.31, *t* = 1.862, *p* = 0.063), a significant quadratic trend (β = −78.66, *t* = −2.858, *p* = 0.004), and a significant cubic trend (β = 81.74, *t* = 2.968, *p* = 0.003). The linear trend was modulated by a significant positive interaction with session (β = 5.634, *t* = 2.044, *p* = 0.041). The quadratic trend was modulated by a significant interaction with modality (β = −82.44, *t* = −2.996, *p* = 0.003). No other effects were statistically significant or marginally significant. Due to the interactions with session and, especially, modality, the interpretation of these effects requires analyzing subsets of

the data. Since the effect of modality was of primary interest, these subset analyses separately analyzed the speech and the song data.

The model on only the speech data revealed a significant cubic trend over occurrences (β = 124.16, *t* = 3.367, *p* = 0.0007) and a significant interaction between session and the linear effect of occurrence (β = 96.52, *t* = 2.619, *p* = 0.009). No other effects were statistically significant or marginally significant. These findings support the observations regarding the speech sessions as made from Figure 6: the EEG amplitude developed non-linearly over the course of a speech block, with two changes in the direction of the developing response (after the initial increase followed a decrease and another increase).

The model on only the song data revealed a significant linear trend over occurrences (β = −94.47, *t* = 2.295, *p* = 0.022) as well as a quadratic trend over occurrences (β = −161.61, *t* = −3.952, *p* < 0.001). There was a marginally significant effect of session (β = −1.393, *t* = −1.904, *p* = 0.065). No other effects were statistically significant or marginally significant. These findings support the observations regarding the song sessions as made from Figure 6: the EEG amplitude developed non-linearly over the course of a song block, with one change in direction of the developing response (the initial increase was followed by a decrease).

These effects were further teased apart in by-occurrence analyses reported in the Supplementary Table S4 (D1: "By-Occurrence analyses"), which confirmed that the difference between modalities was only apparent on occurrences five and six. A final series of models reported in the Supplementary Table S4 (D2: "Occurrence–session interaction") explored the interaction between the linear trend over occurrences and session, which was apparent in the analyses comparing song and speech as well as the speech-only data. Across the separate analyses of Sessions 1 and 2, the linear effect was only statistically significant in Session 2 of both song and speech data, but it was not statistically significant in Session 1. These results had no implications for the conclusions regarding the development of the familiarity response in speech and song as outlined above.

#### 3.3.2. Additional Follow-up Analyses: Responder Types in the Test Trials

Supplementary Table S4 (D3–5) report additional exploratory follow-up analyses, which suggest that the group-level null effect in the test phase may mask a split between negative and positive responders and thus a robust segmentation effect (D3: "Formally establishing responder types"). Using these groups in further analyses seemed warranted, as there was no strong indication that the positive and negative responders were unequally distributed across the versions (D4: "Experimental control and responder types"). Comparisons of negative and positive responders suggest a more pronounced sinusoid-shaped development across speech familiarization for negative than positive responders (D5: "Comparing development in the familiarization phase across responder types in the test phase").

#### **4. Discussion**

The present study set out to test whether infants are able to segment words from ecologically valid children's songs and then transfer these units to recognition in the spoken register. This was done through an EEG familiarization paradigm for word segmentation [88,89], which presented infants with a series of familiarization-then-test blocks in separate song and speech sessions. Each block commenced with a familiarization phase of one word embedded eight times within an (unique) eight-phrase song or spoken fragment with the same words. The comparison between ERPs to the first two and last two target word occurrences was taken as an index of recognition in the familiarization phase, and speech and song sessions were compared to assess the potential beneficial or hindering effects of song compared to speech. Then, blocks continued with a spoken test phase consisting of two spoken phrases with the familiarized target word and two other spoken phrases containing a matched novel control word to which infants had not been familiarized. The difference in ERP response to the familiar target and control words would index word recognition in the test phase. For the speech sessions, we sought to replicate segmentation from speech as found in the study by Junge and colleagues [89]; for the song

sessions, this familiarity effect in the test phase could indicate infants' ability to transfer words that are segmented from song to recognition in speech.

The planned analyses of the familiarization phase revealed that infants are able to segment words from songs as well as from speech. Specifically, infants' ERPs showed an increased positivity to the last two compared to the first two target word tokens in both song and speech. This finding shows that infants identify the repetition of word forms within songs with the lyrical and musical complexity of actual children's songs, thus extending previous results on infants' abilities to process syllables in segmentally and musically much simpler songs than those employed here [4–6]. However, in contrast to this previous work, the present findings provide no clear indication that the segmentation response is stronger in song than speech. Factors that may have contributed to the lack of an observed song benefit are discussed below.

The planned analysis of the test phase provided no evidence of segmentation of either song or speech. This means that the status of the elements that children segment from songs is currently unclear. The interpretation of the absence of song-to-speech transfer is complicated by the lack of evidence for segmentation in the speech test phase following the speech familiarization (thus not replicating [89]). This leaves us unable to provide a definitive answer as to whether infants transfer words that are segmented from song to recognition in speech. Infants' difficulties with the test phase will be addressed in more detail below.

In addition to these planned analyses, we also conducted a set of follow-up analyses to better understand the unexpected aspects of the results: the positive-going instead of negative-going response in the familiarization phase and the absence of a group-level effect in the test phase. The follow-up analyses of the familiarization data suggested that the response to target words develops non-linearly over a passage and develops differently for song compared to speech. The response to target words in songs showed an inverted U-shape: the positivity increased incrementally over the first four occurrences; then, it remained relatively stable in occurrences five and six and reduced slightly in amplitude in occurrences seven and eight. The response to target words in speech developed with a sinusoid-like shape: regarding song, the positivity increased over the first four occurrences. However, the amplitude then attenuated in occurrences five and six to increase again in the last two occurrences. This was the only difference observed between infants' responses to song and speech. One could very tentatively consider the inverted U in the development of song responses as a slower version of the first half of the sinusoid development of responses to speech. However, as this analysis was exploratory and the modulation over target occurrences differed from the previously found linear trajectory followed by a plateau at maximum negative amplitude [89], these patterns and thus the potential difference between infants' processing of speech and songs should be cautiously interpreted until replication.

The exploratory follow-up analyses of the test phase data indicate tentative support for a binary split between infants that displayed a positive versus a negative response to the familiar target versus novel control words. Moreover, infants with a more mature negative response in the test phase showed a stronger modulation over the familiarization phase than the infants with a less mature positive response, which is reminiscent of previous suggestions that infants with negative-going responses have more robust neural segmentation responses across various stages of the procedure [70,93]. While this binary split may explain the absence of a group-level effect in the test phase and the possibility of interpreting the modulation in the familiarization phase, these results await replication, considering the exploratory nature of the binary split between infants with negative-going and positive-going responses (however, see precedents in [69–71,93]), as well as the unexpected shape of the modulation of the response over the familiarization phase.

Although the timing and topography of the ERP familiarity effect was comparable to previous reports, the present results differ from previous studies in two critical respects. Firstly, the familiarity response was positive going for both song and speech, instead of the negative-going response that was predicted for the segmentation of these 10-month-old infants [67,79,83,88,92,93]. A positive-going

response as observed in the present study is generally considered less mature (see reviews by [71,83]). In the context of speech segmentation, a positivity is associated with segmentation by younger infants [69,83], with infants' individual poorer concurrent and later language outcomes [67,69,71,93] as well as with stimulus materials that are more difficult to process [69].

A second discrepancy from most of these previous studies is that we found no evidence of segmentation of either speech or song in the test phase. Thus, this (null) finding may qualify claims about the sensitivity of the ERP familiarity response compared to behavioral familiarization-then-test methods [69,80,91,92]. However, the apparent lack of a group-level segmentation response in the present study is in line with the aforementioned recently published results from over 100 9-month-old participants [71] as well as a large-scale failure to replicate speech segmentation in a behavioral task from 8 to 10.5 months [72]. While Kidd and colleagues explain the lack of a group-level response with reference to the large individual differences between infants of roughly the same age and find that the response polarity relates to vocabulary size [71], Floccia and colleagues refer to the acoustic properties of the stimuli [72]. This latter explanation is in keeping with the literature suggesting that the prosodic exaggeration of infant-directed speech facilitates segmentation by infants [85,86]; cf. [87].

The two discrepancies from the literature, a positive-going response and a lack of a segmentation effect in the test phase, could thus be explained with reference to one of two factors: the (linguistic) maturity of the infants, or at least a subgroup of them, and the properties of the stimuli. As we did not collect data about the participants' concurrent or later language outcomes, the relationship between the EEG responses and (linguistic) maturity cannot be further addressed. However, the role of the stimulus properties can be addressed by directly comparing the acoustic features of the present stimulus set to the stimuli from Junge and colleagues [89], who found a negative-going response in the familiarization as well as the test phase. For a valid comparison between the present stimuli and those of [89], we re-analyzed the average pitch and pitch range of the stimuli of Junge and colleagues and added the acoustic focus measure, using the same procedure as reported in the present Methods section. The duration values to quantify speaking rate are taken directly from [89].

This comparison between the stimulus sets will focus on three features that have been suggested as facilitating infant speech processing in the literature. First, an overall higher pitch with more expanded range is often cited as the prime difference between IDS and adult-directed speech (ADS) (for reviews: [7,118], facilitating segmentation on the basis of transitional probabilities [85,86] as well as possibly segmentation from natural speech [72]; c.f., [87]. However, exaggerated pitch properties have not been found to enhance infant word recognition [119]. Second, pitch changes may also provide a specific acoustic focus on the target word. This focus appears essential for segmentation by 6-month-olds, and still aids segmentation by 9-month-olds [83]. Third, a slower speaking rate is often cited as a key difference between IDS and ADS (cf. [120]), which benefits infants' word-processing efficiency [119]. A fast speaking rate is disruptive to infants' segmentation abilities, eliminating a group-level response for 11-month-olds and shifting the behavioral response to a less mature familiarity preference in 14-month-olds [121].

Of the three features scrutinized, overall, pitch characteristics probably do not affect the maturity of the speech segmentation response observed in the present study: The average pitch was in fact higher in the present speech stimuli compared to those used by [89] (234 Hz versus 201 Hz), and the pitch range was highly comparable between the two datasets (10.90 semitones versus 10.28 semitones). Unless a lower-pitched voice is easier to segment, the present data support the conclusion from [119] that overall, pitch characteristics probably do not facilitate infant word processing, extending the conclusion to segmentation. The specific acoustic emphasis on the word could be a contributing factor to the relatively immature speech segmentation response in the present study, as the acoustic focus on the target word was somewhat smaller in the present compared to the [89] stimuli (0.896 versus 0.930, respectively). This suggests that prosodic focus may still aid segmentation at 10 months, extending the age range for which prosodic focus is found to be beneficial from 9 to 10 months [83]. Finally, the speaking rate of the stimuli could (partially) account for the positive-going response. The speaking rate in our speech stimuli is faster than in [89], as evidenced by shorter target words and phrases (target words: 515 ms versus 694 ms, respectively; phrases: 2068 ms versus 2719, respectively) despite both studies using disyllabic trochees as target words and having a very similar number of words per phrase (5.71 versus 5.75). This comparison agrees with the possible beneficial effect of a slow speaking rate for word recognition [119] and the hindering effects of a fast speaking rate for word segmentation [121].

The suggestion that acoustic prominence and speaking rate facilitate segmentation is also tentatively supported by the properties of the test phase stimuli. Recall that no group-level segmentation effect was observed in the test phase, but that about half of the participants displayed a (mature) negative-going response. Interestingly, the test stimuli were spoken with more acoustic focus on the target words than the familiarization stimuli (0.942 versus 0.896, respectively), which is comparable to the focus in the stimuli from Junge and colleagues ([89]; 0.930). Moreover, the test stimuli were somewhat slower than the familiarization stimuli, although not as slow as the [89] stimuli (target word duration: 538 ms; phrase duration: 2216 ms). In other words, the test stimuli might have been somewhat easier to segment than the familiarization stimuli. This might have facilitated a negative-going response in a larger subset of participants (or trials), possibly resulting in the disappearance of the group-level positive-going effect from the familiarization phase.

The present stimuli may have emphasized the target words occurring in the familiarization phase less than previous studies, because our stimuli were recorded in short stories rather than in individual sentences. Target words may be emphasized less in narratives, as the emphasis on repeated words diminishes over subsequent utterances, even in infant-directed speech [122]. If the recording of narratives and associated reduced prosodic emphasis is indeed responsible for the present study's less mature segmentation response, this raises questions about infants' ability to segment words from the speech that they are presented with during their daily interactions.

However, the acoustic properties of the stimuli do not explain all aspects of the present data, which becomes apparent when the song stimuli are considered. The song stimuli provided more prosodic focus on the target words (0.959) and were lower in speaking rate (target word duration: 794; phrase duration: 3181 ms), even compared to the stimuli of Junge and colleagues [89]. If acoustic focus and/or a slow speaking rate were necessary and sufficient for a mature segmentation response, the song familiarization should have elicited a group-level negative-going response or enabled more children to display this response. Thus, the observed positivity in the song familiarization could reflect additional challenges segmenting words from songs, such as the atypical acoustic signal of sung compared to spoken language [41,42]; cf. [43], or the increased risk of mis-segmentations from songs [44,45,123]. Alternatively, the acoustic focus and speaking rate might not explain the positive-going response in the present speech familiarization either, in which case we need to look to other factors to explain the different result patterns across studies.

One possible deviation in procedure compared to some previous studies (e.g., [88,89]) is that to maintain experimental engagement, besides showing desynchronized visual stimuli on a computer screen, in the present study, not only the parent but also an experimenter sat with the infant and showed silent toys when necessary. Note that other experimenters (e.g., [71], who did not find a group level effect in 9-month-olds) have also used such an entertainer. The additional visual distraction might have resulted in less selective attention to the auditory stimuli for at least some of the children or trials. Note that the allocation and maintenance of attentional focus undergo major developmental changes in infancy, with individual differences herein possibly being related to lexical development [124]. The precise effect of visual distraction (life or on screen) on the ERP word familiarity effect should be established in future studies. If life visual distraction were found to reduce or eliminate a mature negative-going response at the group level, this would raise questions about the effect of selective attention to auditory stimuli on the ERP familiarity effect in general. Is a negative ERP word familiarity effect possibly a reflection of more focused attention to the auditory stimuli, and do children who are better at selectively attending to auditory stimuli then develop language quicker (as in [71])?

As described in the Introduction, the developmental shift from an initial positivity to a later negativity for word recognition responses has been ascribed to cortical maturation as well as to the development of the lexicon [71,83], with various stimulus and task characteristics giving rise to a 'lexical processing mode', resulting in the negative ERP familiarity effect. This repetition enhancement effect might reflect the strengthening of novel neural representations during active learning ([103]; see Introduction). The positive ERP familiarity effect for both speech and song stimuli for the 10-month-olds in the current study suggests that acoustic prominence, speaking rate, and selective auditory attention might be important to get infants into such an active lexical processing mode that is necessary for building a lexicon. Future research is needed to further clarify the development and precise characteristics of the infant ERP familiarity effect and to specify the exact role of lexical processing in its polarity—for example, by actively manipulating the lexical status of stimuli or the processing mode of the infant.

The absence of evidence for better segmentation from song compared to speech, although not the primary focus of the present study, could be considered surprising in light of the previous research finding benefits of songs for language learning in adults and infants [4–6]. Two factors might have contributed to the lack of a song advantage in the present study. First, the present speech stimuli might have been more attractive than those used in the previous infant studies comparing language learning from song and speech. The speech stimuli of previous work were spoken in a completely flat contour [6], adult-directed speech [4], or at a pitch intermediate between infant-directed and adult-directed speech [5]. In contrast, the present stimuli were elicited in an infant-directed register and had an overall pitch that was highly comparable to the average pitch in a sample of Dutch-speaking mothers interacting with their own 15-month-old infant (236/237 Hz and 234 Hz, respectively; [125]). Thus, the lack of a song advantage in the present study could reflect the beneficial effects of infant-directed speech for infant speech segmentation [85,86]; cf. [87]. A second factor contributing to the lack of an observed song advantage might have been the relative complexity of the songs in the present study. The stimuli of previous work all presented consistent syllable–tone pairings, which were combined into one [5], two [4], or four [6] four-syllable melodies that were repeatedly presented to the infants during familiarization. In contrast, the present stimuli consisted of 20 novel melodies, some of which included a melodic repetition in the second four-phrase verse. As melodic repetition facilitates word recall from songs [24], the children's songs in the present study may have been too novel and complex to enhance segmentation. We are looking forward to future research investigating whether song complexity and, in particular, experience with the songs increases the benefit of songs for word learning.

In sum, this study provided electrophysiological evidence that 10-month-old Dutch children can segment words from songs as well as from speech, although the segmentation response was less mature than that observed in previous studies and did not persist into the test phase. Close inspection of the stimuli suggested that a limited prosodic focus on the target words and a relatively fast speaking rate may have inhibited more mature responses to the speech stimuli. However, these same cues were strongly present in the song stimuli, suggesting that other factors may suppress mature segmentation from songs. Future research will need to establish which aspects of songs, including children's familiarity with them, contribute to children's segmentation from speech and song and to what extent children can recognize words segmented from songs in speech.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2076-3425/10/1/39/s1, Table S1: Acoustic Properties of the Experimental Stimuli, Table S2: Pre-Test on the Melodies, Table S3: Full Set of Stimulus Materials, Table S4: Additional Analyses to the Follow-up Analyses.

**Author Contributions:** Conceptualization, T.M.S., T.B., and P.F.; methodology, T.M.S., T.B., and P.F.; software, T.M.S. and T.B.; validation, T.M.S. and T.B.; formal analysis, T.M.S. and T.B.; investigation, T.M.S., and T.B.; resources, T.M.S., T.B., and P.F.; data curation, T.M.S. and T.B.; writing—original draft preparation, T.B. and T.M.S.; writing—review and editing, T.B., T.M.S., and P.F.; visualization, T.M.S. and T.B.; supervision, T.B., T.M.S., and P.F.; project administration, T.M.S.; funding acquisition, P.F. and T.M.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by NWO (Nederlandse Organisatie for Wetenschappelijk Onderzoek), grant number 275-89-023 (VENI) to T.M.S.

**Acknowledgments:** We thank all the lab assistants for their help with the pretest and with stimulus creation, particularly Catheleyne Creusen, Esther Kroese, Melissa van Wijk, and Lisa Rommers. We thank Annelies van Wijngaarden for lending her voice to the speech and song stimuli. Many thanks to Renske van der Cruijsen for scheduling and testing most of the infant EEG sessions, with the assistance of Laura Hahn and several lab rotation students. We are greatly indebted to all infants and parents who participated in the experiment. Thanks to Laura Hahn and other members of the FLA-group of Radboud University for commenting on earlier drafts of the paper, and to George Rowland for proofreading. We thank Caroline Junge for generously sharing her stimuli for the further acoustic analyses reported in the discussion of this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **How the Brain Understands Spoken and Sung Sentences**

**Sonja Rossi 1,\*, Manfred F. Gugler 2, Markus Rungger 3, Oliver Galvan 3, Patrick G. Zorowka <sup>3</sup> and Josef Seebacher <sup>3</sup>**


Received: 29 November 2019; Accepted: 6 January 2020; Published: 8 January 2020

**Abstract:** The present study investigates whether meaning is similarly extracted from spoken and sung sentences. For this purpose, subjects listened to semantically correct and incorrect sentences while performing a correctness judgement task. In order to examine underlying neural mechanisms, a multi-methodological approach was chosen combining two neuroscientific methods with behavioral data. In particular, fast dynamic changes reflected in the semantically associated N400 component of the electroencephalography (EEG) were simultaneously assessed with the topographically more fine-grained vascular signals acquired by the functional near-infrared spectroscopy (fNIRS). EEG results revealed a larger N400 for incorrect compared to correct sentences in both spoken and sung sentences. However, the N400 was delayed for sung sentences, potentially due to the longer sentence duration. fNIRS results revealed larger activations for spoken compared to sung sentences irrespective of semantic correctness at predominantly left-hemispheric areas, potentially suggesting a greater familiarity with spoken material. Furthermore, the fNIRS revealed a widespread activation for correct compared to incorrect sentences irrespective of modality, potentially indicating a successful processing of sentence meaning. The combined results indicate similar semantic processing in speech and song.

**Keywords:** semantics; speech comprehension; singing; N400; event-related brain potentials (ERPs); functional near-infrared spectroscopy (fNIRS)

#### **1. Introduction**

Speech communication is a unique human ability. However, also listening to and playing music is only present in human people. Both speech and music include productive and perceptual aspects. In the present study, we will focus on perceptual abilities. Language as well as music processing can be partitioned into several sub-abilities such as the identification of single sounds, syntactic-combinatorial rule extraction, melodic perception, and meaning extraction [1]. We will put the emphasis on the perception of linguistic meaning in speech and song. Singing is a form of music which also carries direct semantic meaning as in spoken language but with additional melody. Speech and song differ with respect to several aspects: songs display a more precise articulation and longer vowel duration than speech [2,3]. Furthermore, pitch is altered in song exhibiting a more discrete F0 contour and a fine-grained accurate pitch processing compared to speech [4]. Singing is an important evolutionary phenomenon, as early humans already used a protolanguage similar to singing [5], and still nowadays

the parent–children interaction is characterized by singing which supports bonding [6,7]. We opted for investigating this kind of music, as hearing impaired patients show more difficulties in extracting meaning from sung sentences (e.g., [8]). In a currently ongoing study in our lab, we aim at better understanding the neural processing of speech comprehension in different groups of hearing-impaired patients. We are particularly interested in whether they show similar or altered neural mechanisms to semantic processing. Before being able to interpret pathological data, it is however important to clearly understand processing mechanisms in healthy subjects. An overall musical training and/or singing in particular were found to positively impact semantic processing [9], foreign language learning in general [10,11], and perception of speech in noise [12–14]. Furthermore, music is beneficial for language production abilities during rehabilitation of language disorders such as aphasia [15–19]. Furthermore, deaf children supplied with a cochlear implant were found to benefit from musical training as they improve auditory abilities as well as language production and perception [20–22]. We are primarily interested in neural mechanisms as a direct measure of semantic processing and will compare these to behavioral performance during a correctness judgement task. Because different neuroscientific methods assess different neural signals of the brain, results may lead to different modulations and conclusions. Hence, we opt for a multi-methodological approach in which we simultaneously apply the electroencephalography (EEG) and the functional near-infrared spectroscopy (fNIRS). EEG, and—in particular—the investigation of event-related brain potentials (ERPs), assesses electrical signals from the scalp and bears the potential to assess fast dynamic processing mechanisms in the range of tens of milliseconds. The topographical resolution is only rough in EEG but in order to assess this information the fNIRS method represents an ideal candidate. fNIRS is an optical method assessing the vascular response by means of near-infrared light (for a review on fNIRS see [23]). Even though this response proceeds on a much larger timescale than EEG it allows for reliably identifying involved brain areas. It can measure brain responses from about 3 cm depth from the scalp. Through this, only cortical regions can be reached in adult participants. The combination of these two methods is perfectly suitable for investigation of auditory stimuli as (1) they are both soundless in contrast to the application of functional magnetic resonance imaging (fMRI), which produces loud noise during data acquisition; (2) they do not interfere with each other compared to EEG-fMRI; (3) they allow a comfortable measuring setting while subjects are seated in a chair instead of lying in an MRI scanner.

#### *1.1. Electrophysiological Correlates of Semantic Processing in Speech and Song*

In language comprehension research, semantic processing was investigated through several experimental designs. One of these is the priming design. During this paradigm, a prime followed by a target stimulus is presented. Usually, the target stimulus is semantically related or unrelated to the preceding prime. An electrophysiological correlate of semantics elicited in priming paradigms is the N400 component. The amplitude of the N400 reduces with repetition of stimuli and was thus found to be enhanced for unrelated targets (e.g., [24–27]). This centro-parietal ERP component reflects the degree of semantic relatedness between prime and target, and is thus an index of semantic processing (for a review see [28]). A similar paradigm was also adopted for investigating meaning in music. Some studies addressed the question whether instrumental music without lyrics can convey extra-musical meaning such as iconic, indexical/emotional, or symbolic/cultural meaning or intra-musical meaning (i.e., structure of musical elements) (please refer to reviews by [29,30]). As primes, musical excerpts without lyrics [31], single chords [32], or single tones [33] were used followed by a semantically related or unrelated target word. A similar N400 modulation (i.e., larger amplitude for unrelated targets) as for speech was also found in this musical context suggesting shared mechanisms of semantic priming in speech and music. Electrophysiological studies specifically investigating sung material are scarce. However, semantic processing in songs also elicited larger N400s for target words unrelated to the final word of familiar and unfamiliar pop song excerpts compared to related words [34] or when sung target words differed from sung prime words compared to the repetition of the same words [35].

Another important paradigm suitable for investigating perception of semantic processing is to integrate selection restriction errors in sentences and compare them to semantically correct sentences [36,37]. Such a design can be adopted in both spoken and sung sentences or phrases. Similar to the semantic priming paradigm, also in this experimental context, the N400 component in the EEG reliably indexes the integration of semantic information. The amplitude was found to be larger for semantically incorrect compared to correct sentences [36,38,39]. To our knowledge, there is only one electrophysiological study integrating semantic errors in sung musical excerpts [40]. In this study, familiar excerpts from French operas which were sung a cappella were presented to professional opera musicians. Songs were manipulated in such a way that the original version was contrasted with a version containing either semantically incorrect final words or melodic incongruities. Semantically incorrect words elicited a larger N400 component compared to correct words in these sung excerpts. This finding is in line with a magnetoencephalographic study in professional singers and actors using spoken and sung excerpts from Franz Schubert in which, above all, the final word was either semantically correct or incorrect [41].

Electrophysiological findings seem to suggest that semantic processing in speech and song is supported by similar processing mechanisms (for a review please refer to [42]). However, there is no study so far which directly compares electrophysiological processes in relation to selection restriction errors in spoken and sung sentences.

#### *1.2. Brain Regions Supporting Semantic Processing in Speech and Song*

Several models tried to assign different aspects of speech and music to the two hemispheres of the brain. Zatorre and colleagues postulated that auditory cortices of both hemispheres are specialized for different auditory analyses whereby the left temporal area is more sensitive for fine-grained temporal analyses and the right temporal area reacts more to spectral variations [43,44]. This difference led to the conclusion that speech is processed predominantly by left and music by right temporal areas [43]. The multi-time resolution hypothesis proposed by Poeppel and colleagues postulates a dichotomy of the left and right auditory cortices based on temporal variations contained in speech and music [45]. Fast auditory transitions are assumed to be processed bilaterally while slow transitions predominantly recruit right temporal areas. Such a hemispheric specialization is already visible in newborn infants when confronted with auditory stimuli with varying temporal modulations [46]. These models predominantly focus on the auditory cortex. The Dynamic Dual Pathway model [47], in contrast, differentiates between different linguistic functions and allocates them to cortical regions of the two hemispheres. The model postulates that segmental information such as phonology, syntax, and semantics are predominantly processed by a fronto-temporal network in the left hemisphere while prosody is located primarily in homologous right-hemispheric areas.

When focusing on semantic processing in particular, a ventral stream including the superior and middle portions of the temporal lobe was proposed [48]. This stream seems to be bilaterally distributed with a weak left-hemispheric dominance. A more or less dominant lateralization usually depends on the linguistic or musical aspects contrasted with each other.

Brain regions activated by priming paradigms—and thus supporting semantic relatedness, access to the lexical storage, and semantic selection—were found to be located predominantly in temporal (particularly the middle temporal gyrus (MTG) and the superior temporal sulcus (STS)), and frontal regions (especially the inferior frontal gyrus (IFG) and orbitofrontal cortex (OFC)) in speech (e.g., [49–51]) but also in music [32]. Using unfamiliar songs which were repeated either with the same or different lyrics or with the same or different tunes, non-musicians showed a larger left-hemispheric activation in anterior STS for lyrics in contrast to tunes suggesting a greater autonomy of linguistic meaning probably because participants could rely more on their linguistic than musical expertise [52]. It should be noted that this study did not introduce any experimental task, thus subjects simply passively listened to the same/different repetition. In contrast, Schön and colleagues [53] presented pairs of spoken, vocalized (i.e., sung without words), or sung words and asked subjects

to explicitly judge whether word pairs were the same or different. The authors found similar brain regions being activated in spoken, vocalized, and sung processing compared to a noise stimulus. Differences arose at a quantitative rather that qualitative level. A larger activation in the left IFG was found for sung compared to vocalized words as they contained real words, thus more linguistically relevant features. Temporal areas (MTG and the superior temporal gyrus (STG)), on the contrary, were found for both linguistic and non-linguistic features, leading to the conclusion that these regions are recruited domain-independently.

Some neuroimaging studies compared different degrees of melodic information together with speech/lyrics and found differences especially in hemispheric lateralization. Merrill and colleagues [54] presented six different sentence categories: spoken, sung, with hummed (i.e., including prosodic pitch) or song melody (i.e., including melodic pitch), and with speech or musical rhythm. While activations were similar in bilateral temporal lobes, differences were present with respect to the inferior frontal areas. Sung sentences elicited increased activations in the right IFG similar to melodic pitch. Prosodic pitch, on the contrary, gave rise to activations predominantly in the left IFG. These findings fit with results obtained in a study [55] contrasting linguistic prosody (i.e., whether the phrase was spoken as a statement or question) to speech recognition (i.e., identifying whether a word was the same or different relative to the previous word). In this contrast, a larger recruitment of a right-hemispheric temporo-frontal network was found for linguistic prosody because of a stronger reliance on prosodic, thus melodic, aspects. The degree of linguistic or melodic features contained in the presented acoustic material seems relevant for a correct interpretation of found activations. In this vein of reasoning, a direct comparison between the listening to/production of spoken and sung material (e.g., familiar songs, words, phrases) showed an increased right-hemispheric dominance of the middle STG, the planum temporale (PT), and the OFC for sung compared to spoken songs [56–58]. The authors interpret the PT to be involved in the transformation of auditory input into motor representation relevant for speech production. The OFC is assumed to process pleasant and unpleasant emotional aspects during music perception. Interestingly, when comparing sung (i.e., linguistic and melodic information) as well as instrumental music (i.e., only melodic information) to spoken songs (i.e., only linguistic information), an increased activation was found not only in the right planum temporale but also in bilateral anterior planum polare, suggesting that these regions encode music/timbre in both instrumental and sung music [57]. Furthermore, spoken and sung material activated the STS bilaterally, indicating that this area is sensitive to human nonlinguistic vocalizations.

Brain regions subserving semantic processing at the sentential level in speech similarly include activations in left or bilateral temporal (particularly in STG and MTG), left or right frontal areas, and sometimes left parietal areas (i.e., angular gyrus) [59–62]. Bilateral temporal areas are assumed to reflect the semantic integration or semantic evaluation of the sentence, however some fMRI studies found increased activations in this region for correct compared to incorrect sentences [59] while others revealed a reversed activation pattern [62]. Frontal regions were found to be associated with semantic selection processes [61,62] whereas left temporal and temporo-parietal areas were also discussed to be involved in the integration of different syntactic and semantic information in a sentence [59,63]. While such a paradigm which integrates semantic errors in sentences was successfully applied in language research, to date no neuroimaging study used this paradigm for investigating semantic processing in songs. In the present study we opted for integrating semantic anomalies in sentences which were either spoken or sung in order to directly compare the underlying neural processing mechanisms.

#### *1.3. The Present Study*

The focus of the present study lies on semantic processing in speech and song. Even though several studies examined semantics by means of a priming design, it is not well understood how melodic aspects contribute to the extraction of meaning from semantically correct and incorrect sentences. Thus, we created a set of semantically correct and incorrect sentences which were either spoken or

sung. While subjects listened to these sentences and performed a correctness judgement task, neural processing was assessed via the simultaneous application of the electroencephalography (EEG) and the functional near-infrared spectroscopy (fNIRS). This multi-methodological approach was chosen for several reasons: (1) only one electrophysiological study [40] so far investigated semantic errors in sung sentences in professional musicians, but no direct comparison with spoken sentences in the same non-musically trained subjects was performed until now, (2) no neuroimaging study so far directly investigated semantic errors in sung sentences, and (3) the use of fNIRS in contrast to fMRI, especially, is very advantageous as this method is completely silent without any scanner noise, and is thus suitable for measuring acoustic stimuli. In the EEG we will focus on the well-established ERP component of the N400, while the fNIRS is capable of identifying underlying brain areas. In particular, the involvement of same or different neural networks in sung and spoken sentences as well as the degree of lateralization will provide important insights into the neural underpinnings of semantic processing in speech and song and potentially be relevant for therapeutic interventions in hearing impaired patients in future.

#### **2. Materials and Methods**

#### *2.1. Participants*

Twenty German native speakers (10 female) participated in the study (mean age: 38.65 years; range: 28–53 years). All participants grew up monolingually with German, and had learned foreign languages mostly at school or through friends, but not intensively in their family surroundings like bilingual subjects. All subjects learned English as their first foreign language at a mean age of 10.67 years (range: 3–11 years; 1 missing data). Other foreign languages were also learned (1 subject had learned 4 additional foreign languages, 2 subjects had learned 3 additional foreign languages, 4 subjects had learned 2 additional foreign languages, 7 subjects had learned 1 additional foreign language). All subjects were right-handed according to the Oldfield Handedness Inventory [64] (mean % right-handedness: 73.68; range: 0–100), had no neurological disorders, were not born prematurely, took no medication affecting cognitive functioning, had normal or corrected-to-normal vision and a normal hearing ability at both ears (assessed by an otorhinolaryngologist and an audiologist of the Department of Hearing, Speech, and Voice Disorders of the Medical University of Innsbruck by means of a pure tone audiogram (PTA) with the following criteria: thresholds <30 dB HL at audiometric test frequencies 500, 1000, 2000, and 4000 Hz-PTA4 average). No subject was a professional musician. 18 subjects entered EEG analyses (2 were excluded due to technical problems) and 18 subjects entered fNIRS analyses (2 were excluded due to technical problems). The excluded subjects differed between EEG and fNIRS as technical problems only referred to one single method. The respective other method remained unaffected.

#### *2.2. Materials*

The language material consisted of 88 German sentences (44 semantically correct, 44 semantically incorrect) of the following structure: definite article-subject-auxiliary-definite article-object-past participle). All sentences were constructed in past perfect tense. All nouns (subject and object) were bisyllabic, past participle verbs were trisyllabic. Example of a correct sentence: "Der Forscher hat die Firma gegründet" (engl. translation with German word order: "The researcher has the company founded"). Example of an incorrect sentence: "Der Forscher hat die Birke gegründet" (engl. translation with German word order: "The researcher has the birch founded"). Semantic incorrectness was achieved by a selection restriction error.

All correct and incorrect sentences were naturally spoken and sung by a male speaker who was working as a speech therapist and was trained as a professional singer. Sung sentences were assigned to four different melodies (2 rising, 2 descending) whereas rhythm was kept constant in order to provide a greater melodic variety to subjects (please refer to Supplementary Materials for an auditory example of a correct/incorrect spoken/sung sentence). Acoustic stimuli were digitally recorded in an anechoic chamber at a sampling rate of 44 kHz and 16 bits. Afterwards, acoustic stimuli were edited using the editing program Audacity (www.audacityteam.org). This included inserting 30 ms of silence at the onset and offset of each sentence as well as loudness normalizing. Furthermore, each individual word of the sentences was marked, and the individual onset times of each word were extracted. This was necessary in order to insert the exact timing of each word into the EEG and fNIRS marker files for neuroscientific analyses. Duration of the critical verb was as following: correct spoken: 1198 ms, incorrect spoken: 1190 ms, correct sung: 1744 ms, and incorrect sung: 1722 ms. An ANOVA with the within-subject factors *condition* (correct vs. incorrect) and *modality* (spoken vs. sung) revealed a significant main effect of *modality* [*F* (1,43) = 449.293, *p* < 0.0001] suggesting longer verbs in sung compared to spoken sentences. We also tested the whole duration of sentences and found a similar main effect of *modality* [*F* (1,43) = 130.862, *p* < 0.0001]. Again, sung sentences were longer than spoken ones (correct spoken: 4492 ms, incorrect spoken: 4362 ms, correct sung: 5193 ms, and incorrect sung: 5069 ms).

#### *2.3. Experimental Procedure*

The present study was approved by the Ethics Committee of the Medical University of Innsbruck (approval code: 1041/2017). Prior to participating in the experiment, subjects were informed in detail about the aims of the study, the sequence of the experiment, the methods, the exact application procedures, the risks, and the actions to minimize these risks. After having the possibility to clarify any questions, subjects gave written informed consent to take part in the study. Subjects did not receive any compensation for participation.

The experiment was controlled by means of the software Presentation (www.neurobs.com). The presentation sequence started with a fixation cross for 500 ms on a 24'' monitor positioned 1 m in front of the subject. Afterwards the acoustic presentation of the sentence started via stereo loudspeakers positioned below the monitor. Sentences were presented at a sound level of approximately 70 dB. The maximum duration of a slot to present a sentence was 6 s. During this time the fixation cross remained on the screen in order to mitigate effects of eye movements on the EEG signal. After the sentence the fixation cross was again presented for 500 ms. This was followed by the visual presentation of a sad and a happy smiley initiating the correctness judgement task. During this task, subjects had to press either the left or right mouse button indicating whether the previously heard sentence was semantically correct (indicated by a happy smiley) or not (indicated by a sad smiley). The position of the smileys on the monitor as well as the required button presses was counter-balanced across participants. Subjects had to respond within 3 s and the presentation sequence continued as soon as they pressed the button. Afterwards a variable inter-stimulus-interval (ISI) of 6 s on average (range: 4–8 s) followed. This long ISI had to be introduced because of the assessment of functional near-infrared spectroscopy which measures the sluggish hemodynamic response (HRF) peaking around 5 s and returning to baseline after 15–20 s [65]. Because the HRF for each sentence would overlap in time, the introduction of a variable ISI prevents a systematic overlap and allows disentangling brain activation for each experimental condition.

Eight different pseudo-randomization versions were created based on the following rules: (1) not more than 3 correct or incorrect sentences in succession, (2) not more than 3 spoken or sung sentences in succession, (3) at least 10 items between sentences of the same sentence pair, (4) in each experimental half an equal amount of correct and incorrect sentences, and (5) in each experimental half the same amount of spoken and sung sentences.

Completing the experiment took about 45 min on average for all participants. In order to prevent subjects' fatigue, two standardized pauses were introduced after each 15 min.

#### *2.4. Neuroscientific Recording*

#### 2.4.1. EEG Recording

The electroencephalogram (EEG) was recorded from 13 AgAgCl active electrodes (Brain Products GmbH, Gilching, Germany). Nine electrodes were placed on the scalp (F3, Fz, F4, C3, Cz, C4, P3, Pz, P4; see Figure 1), while the ground electrode was positioned at AFz and the reference electrode at the nasal bone. One electrode above the right eye (at position FP2) measured the vertical electro-oculogram while one electrode at the outer canthus of the right eye (at position F10) assessed the horizontal electro-oculogram. Electrode impedance was controlled using actiCap Control software (Brain Products GmbH, Gilching, Germany) and kept below 10 kΩ. The EEG signal was recorded by means of the software Brain Vision Recorder (Brain Products GmbH, Gilching, Germany) at a sampling rate of 1000 Hz and amplified between 0.016 and 450 Hz. An anti-aliasing filter with a cut-off at 450 Hz (slope: 24 dB/oct) was applied prior to analogue to digital conversion.

**Figure 1.** EEG-fNIRS positioning. PFi = prefrontal inferior, PFs = prefrontal superior, F = frontal, FT = fronto-temporal, T = temporal, TP = temporo-parietal, P = parietal, L = left, R = right.

#### 2.4.2. fNIRS Recording

Functional near-infrared spectroscopy (fNIRS) was recorded by means of the NIRScout device (NIRx Medical Technologies, LLC, USA), using a dual wavelength (850 and 760 nm) continuous-wave system. Signals were recorded from 14 channels, resulting from a combination of 8 light emitters and 8 light detectors positioned over bilateral prefrontal, frontal, temporal, temporo-parietal, and parietal areas (see Figure 1). The emitter–detector distance was 3.5 cm. Sampling rate was 7.81 Hz.

#### *2.5. Data Analyses*

#### 2.5.1. Behavioral Data Analyses

Based on the correctness judgement task subjects had to indicate whether the heard sentence was semantically correct or incorrect. Percentage of correct responses as well as associated reaction times were extracted and analyzed by means of an ANOVA with the within-subject factors *condition* (correct vs. incorrect) and *modality* (spoken vs. sung). Significance level was set at *p* < 0.05. In case of a significant interaction, post-hoc *t*-tests were performed adjusted by the False Discovery Rate [66].

#### 2.5.2. EEG Data Analyses

EEG data analyses were performed with the software Brain Vision Analyzer 2 (Brain Products GmbH, Gilching, Germany). EEG data were first low-pass filtered with a cut-off of 30 Hz (slope: 12 dB/oct, Butterworth zero-phase filter). Afterwards, a segmentation based on the critical verb in

the sentence was performed from 200 ms before verb onset until 1500 ms after verb onset. An ocular correction based on the Gratton and Coles algorithm [67] was applied in order to correct vertical eye blinks. Other artifacts were manually rejected. A baseline correction (−200–0 ms) was applied. Event-related brain potentials (ERPs) were extracted for each subject and each experimental condition (correct spoken, incorrect spoken, correct sung, incorrect sung) which was followed by the calculation of grand averages in a time window from −200 ms until 1500 ms time-locked to verb onset. After artifact rejection 75.6% (range: 50%–96.2%) of correct spoken, 75% (range: 39.4%–97%) of incorrect spoken, 76.6% (range: 51.3%–94.4%) of correct sung, and 77.8% (range: 51%–95.2%) of incorrect sung sentences entered final statistical analyses.

Statistical analyses were conducted on mean amplitudes. Because the difference between correct and incorrect sentences for spoken and sung sentences was delayed in time, two time windows, 500–900 ms and 800–1200 ms, were chosen based on visual inspection of the grand averages. The first time window characterized the N400 differences between correct and incorrect sentences for spoken sentences, while the second time window indicated the difference for sung sentences. For these analyses, a repeated-measures ANOVA with the within-subject factors *condition* (correct vs. incorrect), *modality* (spoken vs. sung), *region* (anterior vs. central vs. posterior), and *hemisphere* (left vs. right) was performed for lateral electrodes. Midline electrodes underwent an ANOVA with the factors *condition*, *modality*, and *electrodes*. With respect to modality, the mean amplitudes of the two time windows were used in the above-mentioned statistical analysis. Significance level was set at *p* < 0.05. Posthoc *t*-tests were performed and the False Discovery Rate [66] was applied for correcting for multiple comparisons. Whenever Mauchly's test of sphericity became significant, the Greenhouse–Geisser correction [68] was applied.

#### 2.5.3. fNIRS Data Analyses

fNIRS data were first separated into artifact-free segments by eliminating potential artifact-contaminated segments at the beginning and end of experiment as well as additionally introduced pauses in between the experiment in which no markers were presented. Further artifacts during the experiment were visually selected and corrected by a linear interpolation approach (e.g., [69]). A low-pass Butterworth filter of 0.4 Hz (filter order: 3) was applied. Stimulus duration was set at 3 s and used afterwards for applying the general linear model (GLM). Light attenuation was converted into concentration changes of oxygenated hemoglobin [oxy-Hb] and deoxygenated hemoglobin [deoxy-Hb] by means of the modified Beer–Lambert law [70]. For statistical analyses, a GLM-approach was used—in which a box-car-predictor of the stimulus duration was convolved with a canonical hemodynamic response function [71]—peaking at 5 s and fitted to the measured data. This procedure resulted in Beta-values corresponding to μmolar changes which were used for statistical analyses. These comprised repeated-measure ANOVAs with the within-subject factors *condition* (correct vs. incorrect), *modality* (spoken vs. sung), *region* (each of the 7 channels), and *hemisphere* (left vs. right), performed for [oxy-Hb] and [deoxy-Hb], separately. The significance level was set at *p* < 0.05. Posthoc *t*-Tests were performed and the False Discovery Rate [66] was applied for correcting multiple comparisons. Whenever Mauchly's test of sphericity became significant Greenhouse–Geisser correction [68] was applied. Increases in [oxy-Hb] as well as decreases in [deoxy-Hb] are both signs of increased brain activation and were thus analyzed separately.

#### **3. Results**

#### *3.1. Behavioral Results*

The ANOVA with respect to percentage of correctly answered trials during the correctness judgement task yielded no significant main effect or interaction indicating an equally high percentage: correct spoken (95%), incorrect spoken (97%), correct sung (94%), and incorrect sung (97%) (please refer to Figure 2).

**Figure 2.** Behavioral data from the correctness judgement task. (**a**) Percentage of correctly answered trials per experimental condition including SEMs. (**b**) Reaction times (in ms) of correctly answered trials per experimental conditions including SEMs. \* indicates the significant main effect of *modality* reflecting longer reaction times for sung compared to spoken sentences.

The ANOVA for reaction times of correctly answered trials during the correctness judgement task yielded a significant main effect of *modality* [*F* (1,19) = 4.602, *p* = 0.045]. Posthoc *t*-tests revealed longer reaction times for sung (474 ms) compared to spoken sentences (454 ms) (see Figure 2).

#### *3.2. EEG Results*

The ANOVA for lateral electrodes revealed significant main effects of *condition* and *modality* as well as significant interactions *condition* × *region* and *modality* × *region* (Table 1). Subsequent posthoc *t*-Tests resolving the interaction *condition* × *region* revealed a larger negative amplitude for incorrect compared to correct sentences at central [C3 and C4: *t* (17) = 3.326, *p* = 0.004] and posterior regions [P3 and P4: *t* (17) = 3.223, *p* = 0.005] (see Figures 3 and 4). Posthoc *t*-tests resolving the interaction *modality* × *region* revealed a larger negativity for spoken compared to sung sentences at central [C3 and C4: *t* (17) = −3.064, *p* = 0.007] and posterior regions [P3 and P4: *t* (17) = −3.076, *p* = 0.007].

**Table 1.** Statistical results of the ANOVA *condition* × *modality* × *region* × *hemisphere*for event-related brain potentials (ERP) data on lateral electrodes. Data from the time window 500–900 ms was considered for spoken sentences, while data from the time window 800–1200 ms entered analyses for sung sentences. Significant effects (*p* < 0.050) are marked in bold.


Findings for midline electrodes revealed the following significant effects (Table 2): main effect of *condition*, main effect of *modality*, and interaction *condition* × *electrodes*. The main effect of modality revealed a more negative shift for spoken compared to sung sentences. Subsequent posthoc *t*-tests

resolving the interaction *condition* × *electrodes* revealed a larger negativity for incorrect compared to correct sentences at Fz [*t* (16) = 3.199, *p* = 0.006], Cz [*t* (17) = 3.340, *p* = 0.004], and Pz [*t* (17) = 3.264, *p* = 0.005] (see Figures 3 and 4).

**Table 2.** Statistical results of the ANOVA *condition* × *modality* × *region* × *hemisphere* for ERP data on midline electrodes. Data from the time window 500–900 ms was considered for spoken sentences, while data from the time window 800–1200 ms entered analyses for sung sentences. Significant effects (*p* < 0.050) are marked in bold.


**Figure 3.** ERP results for spoken sentences. Grand averages from −200 ms to 1500 ms after verb onset. Negativity is plotted upwards. An 8 Hz low-pass filter was applied for presentation purposes only.

**Figure 4.** ERP results for sung sentences. Grand averages from −200 ms to 1500 ms after verb onset. Negativity is plotted upwards. An 8 Hz low-pass filter was applied for presentation purposes only.

#### *3.3. fNIRS Results*

#### 3.3.1. Results for [oxy-Hb]

The ANOVA revealed a significant main effect of *modality* as well as significant interactions *modality* × *region* and *modality* × *region* × *hemisphere* (Table 3). Subsequent posthoc *t*-tests resolving the three-way interaction revealed a stronger activation for spoken compared to sung sentences at the following channels: left prefrontal inferior [PFiL: *t* (17) = 2.974, *p* = 0.009], left and right prefrontal superior [PFsL: *t* (17) = 2.615, *p* = 0.018; PFsR: *t* (17) = 2.814, *p* = 0.012], left temporal [TL: *t* (17) = 2.140, *p* = 0.047], left temporo-parietal [TPL: *t* (17) = 2.902, *p* = 0.010], as well as left and right parietal [PL: *t* (17) = 2.242, *p* = 0.039; PR: *t* (17) = 3.041, *p* = 0.007] (see Figure 5).

**Table 3.** Statistical results of the ANOVA *condition* × *modality* × *region* × *hemisphere* for [oxy-Hb] of functional near-infrared spectroscopy (fNIRS) data. Significant effects (*p* < 0.050) are marked in bold.


**Figure 5.** fNIRS results for [oxy-Hb]. Beta-values for spoken and sung sentences merged across correct and incorrect sentences. Red channels indicate significant differences. PFi = prefrontal inferior, PFs = prefrontal superior, T = temporal, TP = temporo-parietal, P = parietal, L = left, R = right. Please note that a more positive value indicates an increased activation.

#### 3.3.2. Results for [deoxy-Hb]

The ANOVA revealed a significant main effect of *condition* (Table 4) indicating a stronger activation for correct compared to incorrect sentences (see Figure 6).



**Figure 6.** fNIRS results for [deoxy-Hb]. Beta-values for correct and incorrect sentences merged across spoken and sung sentences and across all channels including SEMs. Please note that a more negative value indicates an increased activation.

#### **4. Discussion**

The present study investigated neural mechanisms of semantic processing in speech and song. Semantic processing was operationalized by acoustically presenting semantically correct and incorrect sentences which were either spoken or sung. Singing is a form of music including both melodic as well as linguistic aspects. However, is meaning extracted similarly or differently from singing compared to pure spoken information? This research question guided the present study. In order to assess neural foundations of semantic processing, two neuroscientific methods were applied simultaneously, namely the EEG and the fNIRS.

#### *4.1. The N400 Di*ff*erentiates between Correct and Incorrect Sentences*

EEG results for spoken and sung sentences showed a clear difference between semantically correct and incorrect sentences indexed by a classical N400 component. The N400 is usually found in several semantic contexts and reflects lexical access and semantic integration [27,28,37]. It shows larger amplitudes when semantic processing is difficult. Such a modulation was also found in our study, revealing larger N400 amplitudes for incorrect compared to correct sentences. This N400 effect was equally present in both modalities. However, an important difference was nevertheless observable. The N400 for spoken and sung sentences was generally delayed compared to previous studies, and the N400 for sung sentences was even more delayed (500–900 ms for spoken and 800–1200 ms for sung sentences). A first consideration for this general delay of the N400 was that the critical verb is a past participle containing a clear syntactic marker "ge" in German. Only after this prefix an identification of semantic correctness is possible. Thus, we averaged ERPs aligned after this prefix. However, the N400 for sung sentences was still delayed in time compared to spoken sentences (cf. Supplementary Figure S5). Thus, we opted for carrying out the standard analysis procedure aligning ERPs to critical word onsets. Another explanation for the delayed N400 might concern the subjects' age range (mean age of 39 years). Studies investigating the N400 in differential semantic paradigms in younger (usually in the mid 20s) and older subjects show ambiguous results. Some studies report some delays of the N400 in older subjects [72–74] while others do not find any delayed processing [40,75,76]. Our delay might rather be driven by the longer duration of spoken but especially sung sentences as well as final words. A classical N400 to spoken sentences was usually reported between 300 and 500 ms [28]. In our study, the N400 to spoken sentences was found between 500 and 900 ms, thus delayed. However, giving a closer look to the grand averages of N400s in spoken sentences in previous studies shows that even though smaller time windows were analyzed (400–700 ms in [38] and 250–700 ms in [39]), the differences between semantically correct and incorrect sentences lasted longer (~until 1000 ms). This was the case for young (around 25 years [38,39]) but also middle age (around 43 years [77]) and older participants (around 60 years [78,79]). It should be noted that sentence duration in these studies [38,39] was about 1700 ms while in our study spoken sentences lasted much longer (around 4400 ms). This longer duration resulted from a slow presentation rate in order to approximate sentence length of spoken to sung sentences. Furthermore, this slow presentation rate was introduced because the study is currently also performed in hearing impaired patients supplied with cochlear implants and/or hearing aids with difficulties in language comprehension. In order to give these patients a chance to understand these sentences they were spoken very slowly. In fact, normal-hearing participants noticed this slow presentation rate, indicating that they experienced the experiment as effortful. Patients, on the other hand, did not complain about this slow presentation rate. Unfortunately, Besson and colleagues [40] do not report the exact duration of their sung final word. Gordon and colleagues [35], however, report the duration of their word stimuli used in a priming study. Their sung stimuli were 913 ms long while our critical words lasted around 1700 ms, thus much longer. While in Gordon et al. the N400 occurred between 300 and 500 ms, the longer duration of the sung stimuli in our study could explain the delayed N400 effect. Further support for this assumption is provided by the reaction times during the correctness judgement task in the present experiment also showing longer reaction times for sung compared to spoken stimuli. Finally, EEG results seem to show qualitatively similar semantic processing in spoken and sung sentences, with a quantitative difference displayed in a delayed N400 component. These EEG findings might be important with respect to hearing impaired patients who clearly show more behavioral difficulties in extracting meaning from sung sentences as from spoken speech [8] but also benefits from a musical training [21,22]. These findings are moreover interesting in the light of therapeutic interventions such as melodic intonation therapy (MIT) postulating a beneficial effect on language processing in aphasic patients through singing [15,16]. It should, however, be considered that MIT predominantly reveals its favorable effects with respect to speech production and not necessarily speech comprehension which was studied in the present study.

#### *4.2. Brain Areas Recruited for Semantic Processing in Spoken and Sung Sentences*

fNIRS results showed a twofold pattern: (1) an increased activation for spoken compared to sung sentences, irrespective of semantic correctness in bilateral prefrontal, left temporal and temporo-parietal, and bilateral parietal areas, and (2) an increased activation for correct compared to incorrect sentences—irrespective of modality widespread over the whole cortex.

The larger activation for spoken compared to sung sentences in the fNIRS goes in line with the larger negativity for spoken versus sung sentences in the EEG. However, in the EEG this difference can hardly be interpreted due to the different time windows analyzed for spoken and sung sentences. This increased activation for spoken compared to sung sentences in the fNIRS shows a stronger left-hemispheric lateralization, which might potentially be driven by the fact that our participants were non-musicians. Thus, they are more familiar with understanding spoken compared to sung language in everyday life. Furthermore, the correctness judgement task directed attention to the linguistic content and not to the melodic features of sentences. Similar findings were also shown by Sammler and colleagues [52] in a repetition priming study with fMRI contrasting lyrics and tunes in unfamiliar songs. The authors also found larger activations in the left superior temporal sulcus for lyrics than tunes in musically untrained subjects suggesting a link between subjects' expertise with music and language and a predominant processing of linguistic meaning.

The second important fNIRS finding was the widespread increased activation for correct compared to incorrect sentences, irrespective of modality. This result is in line with previous studies which also contrast semantically correct to incorrect sentences [59–61]. In particular, the direction of effects conforms to the fMRI findings of Humphries and colleagues [59]. They contrasted semantically correct with random sentences (i.e., words were scrambled resulting in a meaningless sentence). The authors also found increased activations for correct sentences in similar regions as in our study. Especially temporo-parietal areas were proposed to be related to combinatory semantic processes at the sentence level relevant for the formation of a more complex meaning. Such an interpretation would also fit with our activation pattern. The fact that a differentiation between correct and incorrect sentences was equally present for spoken and sung material might be attributed to the task in our experiment which primarily directed attention to the semantic content of sentences.

In general, however, topographic aspects of fNIRS results should be considered with caution as spatial resolution is limited compared to fMRI due to the possibility to assess neural activation from maximally 3 cm depth from scalp. Thus, only cortical areas can be reached. Due to the simultaneous assessment of EEG and fNIRS, only a limited number of light emitters and detectors can be positioned in between EEG electrodes. Consequently, specific tomographic analyses with multi-distance emitter-detector-pairs potentially leading to a better spatial resolution are not possible.

#### **5. Conclusions**

Findings from our multi-methodological approach indicate that the extraction of meaning from sentences is equally processed in spoken compared to sung sentences. A predominant processing of spoken compared to sung sentences could furthermore be attested. This effect seems to be at least partially influenced by a stronger familiarity with spoken material as well as with the correctness judgement task directing subjects' attention to the linguistic content of sentences. It would be interesting to conduct the same experiment without any experimental task; for example, simply during passive listening to spoken and sung sentences. Importantly, these fine-grained mechanisms appear only in the neural response but not in behavioral data, showing an equally high percentage of identification of correct and incorrect sentences in both spoken and sung modality. Interestingly, both neuroscientific methods show concordant results with respect to the direction of effects. However, the EEG—with its high temporal resolution—showed quantitative differences between spoken and sung sentences, as semantic processing in sung sentences was delayed in time. Based on these findings, we pursue the next step to investigate semantics in spoken and sung sentences in hearing-impaired listeners who are supplied with either hearing aids or cochlear implants, as these patients experience language comprehension problems. This would provide insights into the neural processing mechanisms which are present at the beginning and during the course of the rehabilitation process.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2076-3425/10/1/36/s1, Audio file S1: An example of a correct spoken sentence. Audio file S2: An example of a correct sung sentence. Audio file S3: An example of an incorrect spoken sentence. Audio file S4: An example of an incorrect sung sentence. Figure S5: Grand averages at the electrode Cz for semantically correct versus incorrect sentences for spoken and sung sentences aligned after the prefix "ge" of the critical past participle.

**Author Contributions:** Conceptualization, S.R. and J.S.; methodology, S.R. and J.S.; formal analysis, S.R., M.F.G. and J.S.; investigation, S.R. and M.R.; data curation, S.R.; writing—original draft preparation, S.R.; writing—review and editing, J.S., M.F.G., P.G.Z., M.R. and O.G.; visualization, S.R.; supervision, S.R.; project administration, S.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We thank Thomas Lungenschmid for speaking and singing the experimental material as well as all participating subjects. A special thank goes to all collaborators helping during neuroscientific data acquisition and analyses.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*
