**The Perception of Postalveolar English Obstruents by Spanish Speakers Learning English as a Foreign Language in Mexico**

#### **Mariela López Velarde and Miquel Simonet \***

Program in Second Language Acquisition and Teaching, University of Arizona, Tucson, AZ 85721, USA; marialopez@arizona.edu

**\*** Correspondence: simonet@arizona.edu

Received: 10 April 2020; Accepted: 15 June 2020; Published: 22 June 2020

**Abstract:** The present study deals with the perception (identification and discrimination) of an English phonemic contrast (/t - /–/ - /, as in *cheat* and *sheet*) by speakers of two Mexican varieties of Spanish who are learning English as a foreign language. Unlike English, Spanish does not contrast /t - / and / - / phonemically. Most Spanish varieties have [t- ], but not [- ]. In northwestern Mexico, [- ] and [t- ] find themselves in a situation of "free" variation—perhaps conditioned, to some extent, by social factors, but not in complementary distribution. In this variety, [- ] and [t- ] are variants of the same phoneme. The present study compares the perceptual behavior of English learners from northwestern Mexico, with that of learners from central Mexico, whose native dialect includes only [t- ]. The results of a word-categorization task show that both groups of learners find *cheat* and *sheet* difficult to identify in the context of each other, but that, relative to the other learner group, the group of learners in northwestern Mexico find this task to be particularly challenging. The results of a categorical discrimination task show that both learner groups find the members of the /t - /–/ - / contrast difficult to discriminate. On average, accuracy is lower for the group of learners in northwestern Mexico than it is for the central Mexicans. The findings suggest that the phonetic variants found in one's native dialect modulate the perception of nonnative sounds and, consequently, that people who speak different regional varieties of the same language may face different obstacles when learning the sounds of their second language.

**Keywords:** second language acquisition; phonology; discrimination; cross-linguistic assimilation; obstruent; affricate; fricative; dialect; English; Spanish

#### **1. Introduction**

Most people "have an accent" when speaking a language other than their native one(s). This has been widely documented, and we currently have a sizeable scientific literature describing and explaining this phenomenon—see the following reviews (Best and Tyler 2007; Bohn 2017; Broselow and Kang 2013; Chang 2019; Colantoni et al. 2015; Davidson 2017, p. 201; Eckman 2012; Flege 1995; Piske et al. 2001; Simonet 2016). Interestingly, "having an accent" is not restricted to speech production, but also manifested in perception. Current models of L2 speech acquisition account for those findings—typically from the perspective of perception and categorization—by postulating some sort of interaction between native and nonnative sounds in the representational network of bilinguals (Best and Tyler 2007; Escudero 2005; Flege 1995; van Leussen and Escudero 2015). L2 learners have an "accent" in their L2, these models state, because they already have internalized knowledge of a first language (L1). Native and nonnative sounds must find a way to co-exist, and this typically results in modifications to the nature of such sounds. In other words, L2 listeners assimilate the sounds of

their L2 in terms of the categories that are robustly represented in their phonology by the time they are learning the L2 (i.e., their L1), and they acquire these new sounds as a function of how they map them.

In English, /t -/ and / -/ constitute a phonemic contrast, as seen in minimal pairs such as *sheet–cheat* and *chair*–*share*. Spanish does not have this contrast. Most varieties of Spanish have [t- ] in their inventory, but they do not have [- ] (Hualde 2005, pp. 152–72). In spelling, /t - / is systematically represented by the digraph <ch>, as in *charco* 'puddle' [<sup>|</sup> t - aƌko] and *chamarra* 'jacket' [t- a| mara], and most Spanish speakers would consistently pronounce this phoneme as a postalveolar affricate, [t- ]. It follows that, if they are to acquire the English / - /–/t - / contrast successfully, native Spanish speakers who possess this particular phonological system must develop a new phoneme, (/ - /), in opposition to one they can recycle from their native language, (/t - /); they must create a new contrastive category, and they must assign to it a new phonetic substance. Learning new sounds and new oppositions typically presents a significant phonological challenge (Best and Tyler 2007; Colantoni et al. 2015; Escudero 2005).

Native speakers of some regional varieties of Spanish, on the other hand, may have an acquisitional obstacle of a different nature. In some dialects, both [t- ] and [- ] are found, but not in phonemic opposition. One such variety is spoken in northwestern Mexico, where people are known to pronounce Spanish words that have <ch> variably, with either [- ] or [t- ] (Alessi Molina and Díaz 1994; Amastae 1996; Brown 1989; Carreón Serna 2007; Martín Butragueño 2009; Méndez 2017; Moreno de Alba 1994; Serrano Morales 2000, 2009). In northwestern Mexico, therefore, [<sup>|</sup> t - aƌko] and [<sup>|</sup> - aƌko] are common variants of the same word, *charco* 'puddle.' It seems to follow that, in order to acquire the English / - /–/t - / contrast successfully, native Spanish speakers from northwestern Mexico do not need to learn any new sounds. They already have both [t- ] and [- ] in their inventory of phonetic categories. However, and this might be crucial, what they must do is learn that these two sounds are not variants of the same phoneme, like they are in their native Spanish variety, but separate phonemes (or separate contrastive categories). Learning new mappings between surface and underlying phonological representations presents a substantial acquisitional obstacle of a different kind (Barrios et al. 2016b).

The present study aims at contributing to the literature on the effects of native linguistic experience on the acquisition of L2 sounds. Most importantly, it examines the relative difficulty of developing new categories (new sounds) *versus* that of developing new phonemic contrasts between sounds one can reuse from one's native phonetic inventory (new mappings). The study is concerned with categorization patterns in the perception of an English phonemic contrast, / - /–/t - /, by two groups of L1 Spanish learners of English who speak different regional varieties of their native language.

#### *1.1. Cross-Linguistic Interactions in L2 Speech Acquisition*

The fact that native and nonnative sounds interact in the bilingual mind is illustrated by a well-known example, that of Spanish-speaking learners of English, who tend to have difficulties with the English /i/–/ܼ/ and //æ/–//ܤ/ contrasts (Barrios et al. 2016a; Casillas 2015; Escudero and Boersma 2004; Flege et al. 1994, 1997; Flege and Bohn 1989; Kondaurova and Francis 2008; Morrison 2008, 2009). Spanish has five phonemic vowels, /ieaou/, and Spanish-speaking learners of English tend to assimilate both English /i/ and /ܼ/ to a single native Spanish vowel, /i/ (e.g., Flege and Bohn 1989). This two-to-one cross-linguistic assimilation pattern creates an acquisitional obstacle for this learner population because it makes the two members of the English /i/–/ܼ/ contrast difficult to discriminate from each other (e.g., (Flege et al. 1994)). The English /ܤ/–/æ/ contrast also presents a challenge for Spanish-speaking learners, as both English vowels are cross linguistically assimilated to a single Spanish vowel, /a/ (Barrios et al. 2016a; Casillas and Simonet 2016). These findings indicate that the obstacles L2 learners encounter when acquiring the phonology of their L2 are, at least in part, determined by the listeners' native language background and the cross-linguistic assimilations established between L1 and L2 sounds.

Several theoretical accounts have attempted to explain the obstacles learners face during their acquisition of the L2 phonology. Two such models are the Perceptual Assimilation Model applied to L2 learning (PAM-L2) (Best and Tyler 2007) and the Second Language Linguistic Perception model (L2LP) (Escudero 2005; van Leussen and Escudero 2015). Both of these frameworks postulate that the native and nonnative sounds of L2 learners interact (in some way). The manner in which this interaction is modelled, however, differs in the two accounts. The PAM-L2 proposes that difficulties arise as a function of the assimilability of L2 contrasts to L1 categories. Cross-linguistic assimilation is claimed to rely on the phonetic (in particular, articulatory) similarity between the L2 and the L1 sounds. The following are three of the possible assimilation patterns the PAM-L2 operationalizes: (i) When two L2 phones are cross linguistically assimilated or equated to two different L1 phonemes, a *two-category* assimilation (TC) is said to have occurred. In a TC scenario, discrimination of the two key L2 segments is predicted to be excellent, since the discrimination of the two corresponding L1 categories is assumed to be optimal. (ii) In contrast, if two L2 sounds are cross linguistically assimilated to the same L1 category and both are equally similar to the L1 sound, a *single-category* assimilation (SC) pattern occurs. The model predicts that, in cases of SC assimilation, the discrimination of the two key L2 phones is poor, as the two L2 sounds will have been categorized as variants of the same sound. This type of cross-linguistic assimilation pattern is particularly challenging for learners. (iii) A third type of assimilation pattern is called *category-goodness* assimilation (CG). In a CG pattern, two contrastive L2 sounds are assimilated to the same L1 category, but the cross-linguistic similarity is greater for one of the categories than for the other. In such a situation, discriminating between two key L2 categories is predicted to range from moderate to good, depending on the degree of category-goodness assimilation for each of the L2 segments.

The L2LP model differs from the PAM-L2 in some important ways. The L2LP claims that, at the initial stages of the L2 learning process, learners transfer or duplicate the entire L1 system to form an interlanguage system. Although the L2 system begins as a duplicate of the L1 grammar (a transferred grammar), this only occurs once, and it is subsequently handled as a separate phonological grammar. The novel system is equipped with the same learning mechanisms available in the L1, and it evolves as experience with the L2 increases. Technically, therefore, the L2LP rejects that the L1 and L2 phonological systems interact because it rejects that the two reside in a common representational network. Nevertheless, since the dedicated L2 system begins its course as an exact copy of the L1, the sound categories and mapping strategies learners developed for the L1 powerfully determine the manner in which L2 sounds are processed, perceived, produced, represented, and, ultimately, learned (or not learned). An aspect of the L2LP that resembles the PAM-L2 is that it operationalizes the existence of "cross-linguistic" *comparisons* in terms of L1 and L2 contrasts based on the phonetic (acoustic, in this case) similarity of L1 and L2 sounds. For instance, the cross-linguistic comparison described as single-category assimilation (SG) in the PAM-L2 is called *new scenario* in the L2LP, and both models predict that learning the L2 contrast in this particular scenario is challenging. What the PAM-L2 calls a two-category (TG) assimilation pattern, the L2LP calls a *similar scenario*; that is, when two L2 sounds each resemble a different L1 sound, learning these two categories is predicted to be easy. In sum, while some aspects of the PAM-L2 and the L2LP make them substantially different from each other, other principles of the two models are fundamentally identical.

Creating a new category during L2 acquisition is particularly difficult in cases in which two contrastive categories of the L2 are cross linguistically assimilated to a single category. A (perhaps) different kind of phonological obstacle presents itself when learners must develop new phonological mappings affecting sounds that already exist in their native inventory. For instance, Spanish has both [d] and [ð], but these are in complementary distribution—the two sounds are allophones (variants) of the same contrastive category, /d/. In English, on the other hand, these two sounds are contrastive, as illustrated by the minimal pair *den–then*. It follows that Spanish-speaking learners of English must develop new mappings between surface sounds that already exist in their inventory and new underlying representations (Barrios et al. 2016b). In other words, they must learn that two sounds that are linked to a single phonemic representation in their L1 are actually contrastive in their L2—they are linked to separate phonemes in the L2. It has been hypothesized that this type of acquisitional obstacle, called *allophonic split*, is particularly challenging for L2 learners (Eckman et al. 2001; Lado 1957). This prediction derives from the finding that, in native speech, discriminating between contrastive categories is much easier than doing so between categories that are phonetically distinct but not contrastive; in other words, sounds that are not contrastive are perceived to be more similar to each other than contrastive sounds are (Barrios et al. 2016b; Johnson and Babel 2010). The literature on this learning scenario is scant but, interestingly, a recent study has shown that an obstacle of the kind described here does not "cause consistent difficulty for advanced L2 learners in perception" (Barrios et al. 2016b, p. 14). At this juncture, therefore, it is not known which of the learning scenarios—learning a new sound *versus* learning a new mapping—presents a greater challenge.

The present study is singularly placed to compare the relative difficulty of two of the learning scenarios discussed above: (i) the need to acquire a new sound category (that is, a new phoneme together with a new surface allophone), and (ii) the need to acquire a new mapping between surface and underlying representations (that is, a new phoneme for an already existing surface allophone).

#### *1.2. Regional Dialects and L2 Speech Acquisition*

The present study compares the perceptual behavior of two groups of Spanish-speaking learners of English. The two groups of learners differ in their region of origin; that is, they were brought up as speakers of two different geographical varieties of Mexican Spanish. The premise of our study is that the particular, specific L1 experience of L2 learners can determine, to some extent, the obstacles they encounter (and progression paths they take) when learning their L2. It follows that the phonology of the native dialect can modulate the acquisition of the phonology of the L2. A handful of recent studies have examined the potential role of regional dialect on L2 development. Some have explored the acquisition of different L2 dialects, that is, how people who speak the same language but are learning different varieties of the L2 differ in their linguistic behavior (Baker and Smith 2010; Escudero and Boersma 2004). Others have analyzed the potential effects of the native dialect on the acquisition of the L2, that is, how people who speak different varieties of the same language progress towards learning the same L2 (Chládková and Podlipský 2011; Escudero et al. 2012; Escudero and Chládková 2010; Mayr and Escudero 2010).

It has been demonstrated that people who speak the same native language but are exposed to different regional varieties of their L2 can face different cross-linguistic assimilation scenarios, leading to potentially different learning paths. For instance, Escudero and Boersma (2004) examined how two groups of Spanish-speaking learners of English perceived the English /i/–/ܼ/ contrast. One of the groups was learning English in Scotland, whereas the other was hypothesized to have been exposed mostly to the variety spoken in the South of England. This study found that learners in the two exposure groups behaved differently in their vowel categorization patterns. The authors attributed this to the acoustic properties of the particular target vowels involved, which led to different cross-linguistic assimilation patterns.

One's native dialect also seems to modulate the cross-linguistic assimilation patterns one will establish. For instance, Escudero et al. (2012) investigated the perceptual assimilation patterns displayed by Dutch-speaking L2 learners of English. It is well known that speakers of Dutch tend to have difficulties with the acquisition of the English /æ/–/ε/ contrast. Interestingly, Escudero and colleagues noted that, since the acoustics of the vowel categories of two regional varieties of Dutch (North Holland and Flanders) differ, it would be reasonable to predict diverging patterns of cross-linguistic assimilation for learners of English living in these two regions. Indeed, it was found that differences in the native-dialect phonetic system led to differences in cross-linguistic assimilation of vowels for learners in these regions. There is ample evidence that native phonology determines, to some extent, the learning paths of people acquiring a L2; evidently, the term "native phonology" refers to the individual phonological system of a given learner—their native phonological competence, which is based on their personal linguistic experience—and not to that of the "standard" dialect of a given learner's L1.

The present study compares the perceptual behavior of speakers of two regional varieties of Mexican Spanish when confronting an acquisitional challenge, the English / - /–/t - / contrast. Dialectological descriptions of the native phonology of speakers of these two regional varieties suggest differences that could lead to diverging patterns of cross-linguistic assimilation between their native consonants and those of English. This could lead to differences in their acquisitional obstacles and phonological learning paths.

#### *1.3. Postalveolar Obstruents in Mexican Spanish*

Dialectologists identify four regional varieties of Spanish in Mexico: central, coastal, northern, and peninsular (Yucatán) (Lope Blanch 1990; Martín Butragueño 2011, 2014; Moreno de Alba 1994). One of the phonological variables used to map the regional varieties of Mexican Spanish concerns the pronunciation of the first consonant in words such as *charco*, 'puddle', and *chamarra*, 'jacket.' In most dialects of Spanish, both in the Americas and in the Iberian Peninsula, this consonant is pronounced as a postalveolar affricate. This is true of most varieties of Mexican Spanish as well. Therefore, in most regions in Mexico, including the central highlands (the socially prestigious regional variety), *charco* 'puddle' is pronounced as [<sup>|</sup> t - aƌko]. The speech of people born and raised in the northwestern Mexican states—including Sonora, Chihuahua, and Baja California Norte, among others—is characterized by a pattern of phonetic variation in which the postalveolar obstruent in *charco* 'puddle' may be pronounced as either [t- ], an affricate, or [- ], a fricative. Therefore, in northwestern Mexican speech, *charco* 'puddle' is pronounced sometimes as [<sup>|</sup> t - aƌko] and sometimes as [<sup>|</sup> - aƌko]. Although variationist studies vary significantly in their reported fricativization rates ("fricativition" refers to the practice of pronouncing <ch> as [- ], a diachronic innovation), what seems clear is that, in this Spanish dialect, both the fricative and affricate variants of this variable are found (Alessi Molina and Díaz 1994; Brown 1989; Carreón Serna 2007; Méndez 2017).

Note that the use of the two variants of <ch> is not determined by a phonological rule; in other words, the two variants are not in complementary distribution, but in a scenario of "free" variation. In reality, the variation is not completely free: the investigations that have explored the phonetic variation that affects the pronunciation of <ch> have identified a number of social factors that may modulate to some extent variant choice. Among the social factors involved are age, level of education, and gender. Studies vary in their reported effects of gender (Carreón Serna 2007; Méndez 2017), and some claim that gender is meaningful only when it interacts with age (Jaramillo and Bills 1982). Age might also be relevant only when correlated with level of education (Jaramillo and Bills 1982). At any rate, what is important for our present purposes is that the alternation between [t- ] as [- ] is not determined by a phonological process. The two sounds are neither contrastive nor in complementary distribution, since the same lexical item may be pronounced with either variant.

The constant exposure to the variation that affects <ch> has been found to affect northwestern, Mexican Spanish listeners' patterns of spoken word recognition. In a lexical access investigation, López Velarde and Simonet (2019) confirmed that listeners in northwestern Mexico are equally likely to accept Spanish word forms produced with either variant of <ch>. They also found that both variants are equally likely to prime listeners for the efficient recognition of spoken words (that is, [<sup>|</sup> - aƌko] primes [<sup>|</sup> t - aƌko] as much as [<sup>|</sup> t - aƌko] does), which suggests that this group of listeners store Spanish words with both variants within the same abstract (or prototypical) mental representation. This study confirms that the two variants of <ch> are indeed allophones of the same phoneme. These findings suggest that people who experience sociophonetic variability in their speech community may store more than one phonetic variant in their long-term mental representation of words.

#### *1.4. The Present Study*

The current study focuses on a phonemic contrast of English—that between /t -/, as in *cheat*, and / -/, as in *sheet*—and investigates the perceptual identification and discrimination patterns pertaining to this contrast displayed by two groups of L2 learners of English whose native language is Spanish. Our learner

sample was recruited from two dialectal regions, central and northwestern Mexico. More specifically, this study explores how speakers of northwestern Mexican Spanish, who are recurrently exposed to the sociophonetic variability that affects Spanish <ch> in their speech community, perceive the target English contrast (/ - /–/t - /), and how these perceptual habits differ (if at all) from those demonstrated by speakers of central Mexican Spanish, who lack experience with this specific variability pattern.

Even though we hypothesize that both of our target populations of L2 learners of English may find this contrast relatively difficult to master, we believe that the specific learning obstacles the two populations experience are different—and this, we hypothesize, results in different learning outcomes. On the one hand, we postulate that the English / - /–/t - / contrast will prove to be relatively difficult for central Mexican learners because [- ], as in *sheet*, does not correspond to any sound in their dialect of Spanish while [t- ], as in *cheat*, does. It is possible that these learners assimilate the two phonemes of English to the same native category, /t - /. Nevertheless, since [t- ] and [- ] are phonetically quite distinct, and since central Mexican Spanish has both affricates (/t - /) and fricatives (/s, f, h/), it could be the case that (adopting PAM's terminology) the English / - /–/t - / contrast presents a category-goodness (CG) assimilation pattern, one in which English /t - / is assimilated to central Mexican Spanish /t - / with a very high goodness of fit and English / - / also assimilates to this central Mexican Spanish phoneme but with a lower goodness of fit.

Northwestern Mexican Spanish speakers, unlike people from central Mexico, are exposed to two variants of <ch> in their native dialect, [t- ] and [- ]. These are two allophones of the same phoneme. For this reason, we hypothesize that speakers of northwestern Mexican Spanish are likely to assimilate both English /t - / and English / - / to the same native phoneme, and that the goodness of fit of these two cross-linguistic assimilation patterns is likely to be similarly high. This might create (adopting PAM's terminology again) a single-category (SG) assimilation pattern. Since CG assimilation patterns are expected to lead to better discriminability than SC ones (Best et al. 1988, 2001; Best and Tyler 2007), we hypothesize that the error rates in the identification and discrimination of our target English contrast will be larger for learners in northwestern Mexico than for central Mexican learners.

An alternative way to frame the learning scenario for the northwestern Mexican Spanish speakers is that of an allophonic split (Barrios et al. 2016b; Eckman et al. 2001). These learners must unlearn that both [t- ] and [- ] are mapped onto the same phoneme (as they are in their native dialect of Spanish), and they must develop a new phoneme specific to the L2 to which only one of these two phonetic categories is mapped. In other words, speakers of northwestern Mexican Spanish must develop a new phonological, underlying category and remap their sound categories so that they are each assigned to a different contrastive unit. Are the northwestern Mexican Spanish speakers more or less likely than the central Mexicans to succeed in their discrimination of the members of English / - /–/t - / contrast? This is the fundamental research question that motivates the present study.

To compare the acquisition of our target English phonemic contrast with a contrast about which much is known, we selected a second phonemic contrast of English to serve as a control condition, that between /i/ and /ܼ/. It is well known that native speakers of Spanish find the *seat–sit* contrast very difficult to discriminate and, therefore, learn (Casillas 2015; Escudero and Boersma 2004; Kondaurova and Francis 2008; Morrison 2008, 2009); they also find the two members of the contrast very difficult to identify against each other. Therefore, the *seat–sit* contrast, tested in our experiments alongside our target contrast, *sheet–cheat*, serves as a control condition, one that should be similarly challenging for both of our target learner populations.

#### **2. Method**

#### *2.1. Participants*

The data were collected in two locations in Mexico, Hermosillo and Santiago de Querétaro, which are the two largest cities in the states of Sonora and Querétaro, respectively. The participants in Hermosillo were lifelong residents of the state of Sonora. The majority of the participants had lived in Hermosillo from birth, and those born in other municipalities had moved to the city as children. Many of the participants tested in Santiago de Querétaro were not born in the city of Santiago, but they reported having moved there as children or as teenagers. The Querétaro residents in our sample who were not natives to Querétaro were born in other central states of the country, such as Guanajuato, Jalisco, Morelos, and Puebla. Particularly with respect to their treatment of /t - /, as well as to that of many other sounds, the central highlands of Mexico form a single dialectal area. In sum, data were collected in two dialectal areas, the northwest (exemplified by Hermosillo, Sonora) and the central highlands (exemplified by Santiago de Querétaro, Querétaro).

A total of 88 people (44 from Sonora, 44 from Querétaro) participated in this study. Participants' ages ranged between 18 and 43 years old. All but five participants were college students at the time of testing, graduate or undergraduate. Three participants had graduated with a college degree, and two had not completed college and were working in the industry. The high number of college students or college graduates in our sample is due to our having recruited our participants in college settings, the *Universidad de Sonora* (Hermosillo) and the *Universidad Autónoma de Querétaro* (Santiago de Querétaro). The educational and professional profile of the participants is not fully representative of the general population native to these locations—highly educated people are overrepresented. The profile, however, might be representative of the narrower population, in these locations, who have learned English as a foreign language. At any rate, the social profile of the two dialectal groups does not differ—both groups consist of highly educated people who are learning English as a foreign language in a school setting. All participants study (or studied) English in college.

Participants responded to the Bilingual Language Profile questionnaire (Gertken et al. 2014) The questionnaire collects information regarding the listeners' linguistic background and L2 learning experience with a focus on attitudes, history, self-assessed proficiency, and daily usage of the two languages. The questionnaire produces a language dominance score along a spectrum centered around 0, which represents balanced bilingualism. The participants in our study are expected to be Spanish dominant; in our implementation of the survey, dominance in Spanish is captured with scores ranging between 0 and −218.

We also administered an English vocabulary-size test to assess the participants' English knowledge, the LexTALE. The LexTALE (www.lextale.com) is a standardized test designed to measure vocabulary size in language learners (Lemhöfer and Broersma 2012). To the extent that vocabulary size reflects overall knowledge of the language, the LexTALE provides an indicator of a person's knowledge of English. It seems reasonable to speculate that acquisition of phonemic, contrastive categories is based upon vocabulary knowledge (Simonet 2016). The test consists of 60 trials, comprising 40 English words and 20 nonwords, and these are presented to participants for them to make lexical decisions on. The resulting score is expressed in percent-correct units, and it is corrected for the unequal number of words and nonwords. In this study, the test was administered using PsychoPy 2 (Peirce et al. 2019). After responding to the BLP and the LexTALE, participants proceeded to complete the identification task followed by the discrimination task.

The two dialectal groups do not seem to differ with respect to their dominance scores, *t*(85.1) = −0.608, *p* > 0.05 [.54], 95% c.i. [−14.9, 7.9], Cohen's *d* = −0.13, but they do in regards to their English vocabulary size scores, *t*(82.7) = −3.63, *p* < 0.001 [0.0004], 95% c.i. [−11.86, −3.47], Cohen's *d* = −0.77. The average BLP score for the Querétaro group is −97.3 (*SD* = 28.3, *range* [−142.2, −18.1]), and the average for the Sonora group is −100.8 (*SD* = 25.5, *range* [−140.1, −41.5]). This confirms that all participants are dominant in Spanish. The average LexTALE score for the Querétaro group is 69.6 (*SD* = 10.8, *range* [53.7, 97.5]), and the average for the Sonora group is 61.9 (*SD* = 8.9, *range* [42.5, 81.25]). Thus, the Queretaroans have, on average, higher vocabulary size scores than the Sonorans, but there is much overlap between the two groups. On average, neither of the two groups are near ceiling (i.e., 90% or higher). In terms of their vocabulary size scores, both groups include a relatively wide range of learners.

#### *2.2. Materials*

#### 2.2.1. Identification Experiment

The key data in this study were collected by means of two perception tasks, an identification task and a categorical discrimination task. In the identification task, participants were presented with 96 auditory stimuli consisting of one of four English words: *cheat*, *sheet*, *seat*, and *sit*. A total of 24 different iterations of each of the four words were played in random order to each participant. Listeners were asked to identify each stimulus by indicating, from a closed list of options, the lexical item each auditory stimulus corresponded to. Four options to select from were shown on a computer screen in alphabetical order, from left to right: *cheat*, *sheet*, *seat*, *sit*. The participants are hypothesized to misidentify *cheat* as *sheet*, *sheet* as *cheat*, *seat* as *sit*, and *sit* as *seat*. We have no hypothesis as to whether they would also misidentify *seat* or *sit* as *cheat* or *sheet*.

#### 2.2.2. Discrimination Experiment

An ABX categorical discrimination task was designed to test two key contrasts: *sheet–cheat* (target) and *seat–sit* (control). In an ABX task, listeners hear a triad of auditory tokens (A, B, and X) presented in a sequence within the same trial and, upon hearing all three, they indicate whether the third token (X) matches either the first (A) or the second (B) item in the sequence. There were no "catch" trials in our version of the task, which means that there always was a correct answer. Importantly, all of the stimuli in each triad were acoustically different, including the two matching tokens, as each one of them had been recorded by a different talker. Under such conditions, comparisons cannot be based on acoustic memory, but must be based on phonological or lexical memory (participants are comparing abstract categories, not auditory tokens), which requires participants to access their mental representations to make their decisions. This is the reason why we refer to this task as a *categorical* ABX.

Each participant provided 48 observations to the data set: 24 trials focused on the *seat–sit* contrast (*sea*t–sit–seat [6]; seat–sit–sit [6]; sit–seat–sit [6]; sit–seat–seat [6]), and 24 focused on the sheet–cheat contrast (sheet–cheat–cheat [6]; sheet–cheat–sheet [6]; sheet–sheet–cheat [6]; sheet–cheat–cheat [6]). In 24 of the trials, the matching word was adjacent to the target word—it was in the second position. In other words, the target word was always in the third position of the sequence and, in cases of adjacency, the matching word was in second position. In 24 of the trials, the matching word was not adjacent to the target—it was in the first position. Everything else being equal, matching adjacent categories is expected to be easier than matching non-adjacent one (Best et al. 2001).

#### 2.2.3. Auditory Stimuli

Four native English speakers, all of them women, served as talkers. Their productions were recorded in a sound-treated booth using professional recording equipment: a Shure SM10A head-mounted dynamic microphone and a Sound Devices USBPre2 audio interface connected to a laptop computer. Speech productions were digitized at 44.1 kHz, with 16-bit quantization. Sound files were normalized for intensity.

The talkers were asked to produce the target words by embedding them in a constant carrier phrase, "\_\_ is the word." The materials were presented in random order to avoid any possible systematic effects of list intonation or exhaustion on the same lexical items. Talkers produced all target words four times (4 tokens × 4 iterations × 4 talkers = 64 items). One token of each target word per talker was selected (avoiding disfluencies and any extraneous noise) for a total of four target stimuli per lexical item.

#### *2.3. Procedure*

Participants completed the tasks individually. In Querétaro, participants were tested in a sound-attenuated booth, while in Sonora they were tested in a quiet library room. Stimuli were presented auditorily over a set of Audio Technica ATH-M50x closed-circumaural headphones connected to a laptop computer running PsychoPy2 (Peirce et al. 2019). Participants responded by pressing a key on a Logitech G512 Lightsync RGB mechanical keyboard. Prior to the completion of the experimental tasks, the first author, a native Spanish speaker from Sonora, provided them with a general description of the tasks and their instructions. This conversation took place in Spanish. Before participating in any of the perceptual tasks, people completed the Bilingual Language Profile questionnaire (Gertken et al. 2014), then the LexTALE (Lemhöfer and Broersma 2012).

For the identification task, participants were instructed to listen to each stimulus, one per trial, and indicate their answer as quickly and accurately as possible by pressing one of four keys on the keyboard (1, 2, 3, or 4). Trials began with a red cross in the center of the screen for 250 ms, which was followed by a screen showing the four response options: *cheat*, *sheet*, *seat*, *sit*. Words were shown in capital letters. Numbers—that is, key codes—were presented in yellow and shown below their corresponding lexical item. Response options were shown for 2500 ms. Auditory stimuli were played 500 ms from the onset of the screen displaying the response options. Participants were allotted 2 s to enter a response. If participants did not provide a response within the allotted time, a new trial began, and the response was left empty.

In the ABX task, participants were asked to listen to all three sounds presented in the trial and only then respond by pressing either number 1 or number 2 on the keyboard to indicate whether they believed the third sound matched the first (1) or the second (2) one in the triad. The words 'first' and 'second' were shown on the computer screen in upper case and accompanied with their matching key codes, 1 or 2. Each trial began with the showing of a red cross in the center of the screen for 1000 ms. The first stimulus of the triad was played at the 1 s mark and was then followed by the second sound of the triad at the 2 s mark. The stimulus onset asynchrony of these two stimuli was thus set at 1 s. The stimulus onset asynchrony between the second and third stimuli was set at 1.5 s. Simultaneously with the playing of the third auditory stimulus in the triad, a screen showing the two response options was shown. Participants had 2 s to introduce their answer. If no answer was entered within this time, a new trial began, and the response was left empty.

#### *2.4. Analysis*

All statistical analyses were run in *R*, with packages *tidyverse* (Wickham 2017), *afex* (Singman et al. 2018), and *e*ff*size* (Torchiano 2018). For reproducibility, readers may obtain the *R* scripts and synthetic data frames from either of the authors.

#### 2.4.1. Identification

The analysis of the identification-task data was conducted in two steps. In the first step, we classified what participants heard (the lexical items the talkers had produced) as a function of what they responded (the lexical items the listeners had responded they had heard). This results in a contingency table. The original data set was comprised of 8448 rows, all of them listeners' responses to auditory stimuli. Nevertheless, a number of these observations were excluded from the analysis because the listener did not respond within their allotted time. There was a total of 424 not-responded-to trials, about 5% of the observations. The analysis was then conducted without such trials, with a data set containing 8024 observations.

In a second step, we simply calculated the proportion of times a given participant was accurate versus the times they were inaccurate. In order to prepare the data for the statistical analysis, we ran an arcsine transformation of the proportion-correct scores by participant and condition.

#### 2.4.2. Discrimination

The original data set comprised a total of 4224 rows, 44 (listeners) × 2 (locations) × 48 (responses), of which 288 (6%) were empty, that is, trials that did not contain any information because the participant had failed to respond within the allotted time. An analysis of the participants' responses was then conducted after removing the empty observations from the data set, which results in a data frame comprised of 3936 observations. The analysis counted the proportion of correct responses per listener, per condition (contrast type and adjacency). Then, the accuracy scores (or proportion of correct responses) were arcsine-transformed for the statistical analysis.

#### **3. Results**

#### *3.1. Identification*

The analysis of the identification data focuses on the proportion of times the auditory stimuli were identified as each of the four lexical items. Table 1 shows the proportion of responses, calculated only for the trials that were responded to, as a function of stimulus played (rows) and response given (column), further broken down by region-of-origin of the participants.

**Table 1.** Proportion of times each auditorily presented lexical item was identified as being an instance of one of four possible words (*cheat*, *sheet*, *seat*, *sit*), further broken down by region of origin, in Mexico, of the English learners (Sonora, Querétaro).


Note: Rows represent the auditory stimuli played and columns represent the labels displayed on the screen and, thus, the responses available to the participants. Responses below 5% are not shown. Within each participant group, rows add up to 1.

As it may be observed in Table 1, the proportion values suggest that neither *seat* nor *sit* are likely to be categorized as neither *sheet* nor *cheat*. We may conclude that [s] is categorized as being distinct from both [t- ] and [- ], and that this is true for both groups of learners—when a word begins with [s] and ends with [t], only *seat* and *sit* are viable options. Equivalently, when a word begins with a postalveolar obstruent, either [t- ] or [- ] and end with [t], neither *seat* nor *sit* are viable options. We infer that it is reasonable for us to treat the *sheet–cheat* and the *seat–sit* contrasts as separate binary oppositions in our analysis. The scores in Table 1 also suggest that the identification of both *seat* and *sit* lead to a large number of categorization errors, and that both groups of learners are likely to confuse the two words with each other.

We now turn our attention to *cheat* and *sheet*. It appears that the identification patterns of these two words differ in the two groups of learners. In the case of Queretaroans, *cheat* and *sheet* do not appear to be very difficult to identify even in a task that plays these words in the context of each other—accuracy rates are relatively high, with 83% correct responses for *cheat* and 73% correct responses for *sheet*, but they can nevertheless be confused with each other at rates that are not negligible. On the other hand, the Sonorans made many categorization errors for *cheat* and *sheet*. Sonorans' accuracy rates are relatively low, with 51% correct responses for *cheat* and 57% for *sheet*.

Our statistical analysis focuses on accuracy rates. We select only the cells that may be interpreted as displaying 'correct' responses: [t- ]*eat* identified as *cheat*, [- ]*eet* identified as *sheet*, *s*[i:]*t* identified as *seat*, and *s*[/ܼ]*t* identified as *sit*. This analysis ignores all other cells. The arcsine-transformed accuracy scores are analyzed with a mixed-design, two-way 2 × (4) ANOVA with Location (Querétaro, Sonora) as a between-subjects factor, and Item (*cheat*, *sheet*, *seat*, *sit*) as a within-subjects factor. The ANOVA yields main effects of Location, *F*(1, 86) = 18.2, *p* < 0.0001, η<sup>2</sup> = 0.10, and of Item, *F*(2.13, 183.4) = 62.5, *p* < 0.0001, η<sup>2</sup> = 0.26. It also yields a statistical interaction between the two factors, *F*(2.13, 183.4) = 6.4, *p* < 0.05 [.002], η<sup>2</sup> = 0.03. The results reveal that, as a group, the Sonorans are more likely to make categorization mistakes than the Queretaroans with this closed lexical set, but this further depends on the lexical item.

To investigate the interaction, we divide the data set into four subsets as a function of Item. The results reveal that, for both *seat* and *sit*, accuracy rates are comparable across the two learner groups: *sit*, *t*(83.3) = −1.16, *p* > 0.05 [0.25], 95% c.i. [−0.287, 0.075], Cohen's *d* = −0.247; *seat*, *t*(66.5) = −1.97, *p* > 0.05 [0.053], 95% c.i. [−0.252, 0.001], Cohen's *d* = −0.42. In other words, both groups of participants are similarly likely to be accurate when identifying these two words. On the other hand, accuracy rates are different across learner groups for both *cheat*, *t*(83.9) = −8.08, *p* < 0.0001, 95% c.i. [−0.501, −0.303], Cohen's *d* = −1.722, and *sheet*, *t*(73.02) = −3.53, *p* < 0.001 [0.0007], 95% c.i. [−0.316, −0.088], Cohen's *d* = −0.753. In both cases, the Queretaroans are more likely to be accurate than the Sonorans.

To summarize, identifying the two members of the *seat–sit* contrast appears to be similarly challenging for both groups of Spanish-speaking learners of Spanish, whereas asking participants to identify the two members of the *sheet–cheat* contrast is more likely to lead to errors for the Sonorans than for the Queretaroans. The results obtained in the identification task suggest the following hypotheses: (i) The Sonorans are just as likely as the Queretaroans to find the *seat–sit* contrast difficult to discriminate, and so we use this contrast as our control condition in the discrimination study; (ii) the Sonorans are likely to find the *sheet–cheat* contrast more difficult to discriminate than the Queretaroans.

#### *3.2. Discrimination*

Table 2 shows the untransformed proportion of correct responses by participant group and experimental condition. There are two experimental conditions in our design: (i) the lexical contrast tested in a given trial (*seat–sit*, *sheet–cheat*), and (ii) the adjacency condition between the target word and the matching one. When the matching stimulus is located in the first position in the triad, the matching and the target stimuli are not adjacent (*primacy* condition), whereas when the matching stimulus is found in the second position in the triad the matching and the target stimuli are adjacent (*recency* condition). Everything else being equal, recency trials are predicted to be easier to answer accurately than primacy ones, particularly for challenging contrasts (Best et al. 2001).

**Table 2.** Proportion of correct responses by learner group (Querétaro, Sonora), as a function of lexical contrast (*seat*–*sit*, /i:/-//ܼ/; *sheet*–*cheat*, / - /–/t - /) and adjacency condition (primacy, recency).


Note: Primacy stands for trials in which target and matching stimuli are not adjacent; recency stands for trials in which target and matching stimuli are adjacent.

Firstly, we submit the arcsine-transformed proportion-correct scores to a mixed-design, three-way 2 × (4) × (2) ANOVA with Location (Querétaro, Sonora) as a between-subjects factor, and Contrast (*sheet–cheat*, *seat–sit*) and Adjacency (primacy, recency) as within-subjects factors. The ANOVA yields main effects of Location, *F*(1, 86) = 5.7, *p* < 0.05 [.02], η<sup>2</sup> = 0.02, Contrast, *F*(1, 86) = 117.8, *p* < 0.0001, η<sup>2</sup> = 0.19, and Adjacency, *F*(1, 86) = 14.8, *p* < 0.001 [.0002], η<sup>2</sup> = 0.05. Of these effects, the largest one is Contrast, then Adjacency, then Location. Importantly, there are two significant two-way interactions: Contrast by Adjacency, *F*(1, 86) = 4.43, *p* < 0.05 [.04], η<sup>2</sup> = 0.009, and Contrast by Location, *F*(1, 86) = 6, *p* < 0.05 [.02], η<sup>2</sup> = 0.01. There is no significant Location by Adjacency interaction and no significant three-way interaction.

The interactions are explored in three steps. Firstly, to explore the Contrast by Adjacency interaction, we average over Location and analyze the effects of adjacency for the two contrasts separately. This analysis pools the data for the two dialectal regions. According to a series of paired-sample *t*-tests, Adjacency does not trigger a significant effect for the *seat–sit* contrast, *t*(87) = 2.05, *p* > 0.016 [.04], 95% c.i. [0.002, 0.138], Cohen's *d* = 0.199, but it does for the *sheet–cheat* one, *t*(87) = 3.9, *p* < 0.001 [0.00019], 95% c.i. [0.084, 0.259], Cohen's *d* = 0.396. People are found to be less accurate in

primacy trials than in recency ones, but most obviously so in trials that test the *sheet–cheat* contrast than the other one. Secondly, to explore the Contrast by Location interaction, we average over Adjacency and analyze the effects of Contrast for the two learner groups separately. For both groups, the *sheet*–*cheat* contrast leads to significantly more response errors than the *seat–sit* contrast (Querétaro: *t*(87) = −6.27, *p* < 0.0001, 95% c.i. [−0.27, −14], Cohen's *d* = −0.595; Sonora: *t*(87) = −8.92, *p* < 0.0001, 95% c.i. [−0.397, −0.252], Cohen's *d* = −0.971), but the effect is larger for the Sonorans than for Queretaroans. Finally, returning once more to the Contrast by Location interaction, the potential effects of Location for the two contrasts are analyzed separately. According to two Welch *t*-tests, there is no significant effect of Location for the *seat–sit* contrast, *t*(172.5) = −0.616, *p* > 0.0125 [.54], 95% c.i. [−0.108, 0.057], Cohen's *d* = −0.093, whereas the effect is statistically significant for the *sheet–cheat* contrast, *t*(173.9) = −3.37, *p* < 0.0125 [.0009], Cohen's *d* = −0.508.

To summarize, both groups learners find both lexical contrasts relatively difficult to discriminate. Interestingly, the *sheet–cheat* contrast appears to be more challenging than the *seat–sit* contrast. Additionally, both groups are similarly accurate in their discrimination of the *seat–sit* contrast, which we are taking to be our control condition. The most important finding is that, for the *sheet–cheat* contrast, the Sonorans are more likely than the Queretaroans to make discrimination errors. From these analyses, one could infer that the discrimination of the / - /–/t - / contrast is more challenging for the Sonorans than it is for the Queretaroans. Recall, however, that, according to the LexTALE, the Queretaroans in our sample have, on average, a larger English vocabulary than the Sonorans. Thus, the finding could be due, to some extent, to an asymmetry in English vocabulary size rather than to their native dialect phonologies.

To address the possibility that vocabulary size, rather than native phonology, explains these findings, we select a subset of the 22 learners with the lowest LexTALE scores in the Querétaro group and the 22 learners with the highest LexTALE scores in the Sonora sample to form a subset comprising 44 learners, <sup>1</sup> <sup>2</sup> of the sample. In this subset, the vocabulary size of the Sonorans (*M* = 68.6, *SD* = 5.9) is larger than that of the Queretaroans (*M* = 61, *SD* = 4.2), according to a Welch *t*-test conducted on LexTALE scores, *t*(37.8) = 4.89, *p* < 0.0001, 95% c.i. [4.47, 10.76], Cohen's *d* = 1.47. A mixed-design ANOVA with arcsine-transformed accuracy scores obtained in the discrimination task yields only main effects of Contrast, *F*(1, 42) = 76.5, *p* < 0.0001, η<sup>2</sup> = 0.21, and Adjacency, *F*(1, 42) = 4.47, *p* < 0.05 [.04], η<sup>2</sup> = 0.04, and a Contrast by Adjacency interaction, *F*(1, 42) = 5.91, *p* < 0.05 [.02], η<sup>2</sup> = 0.02, but no other main effects, and no further interactions. Importantly, there are no main effects of Location, *F*(1, 42) = 0.023, *p* > 0.05 [.64], η<sup>2</sup> = 0.002, and Location does not interact with Contrast, *F*(1, 42) = 1.92, *p* > 0.05 [.17], η<sup>2</sup> = 0.007. Overall, participants are more accurate in their discrimination of the *seat–sit* contrast (*M* = 0.81, *SD* = 0.2) than of the *sheet–cheat* contrast (*M* = 0.59, *SD* = 0.24). They are also more accurate in recency trials (*M* = 0.75, *SD* = 0.23) than in primacy trials (*M* = 0.65, *SD* = 0.25). Additionally, the effects of adjacency are more robust in the *sheet–cheat* contrast than in the control contrast. In sum, an analysis of a subset of data in which the Sonorans have larger English vocabularies than the Queretaroans fails to reveal any differences between the groups in regard to their discrimination of the *sheet–cheat* contrast (or the *seat–sit* one, for that matter). Apparently, for the Sonorans to match the Queretaroans in their discrimination of the *sheet–cheat* contrast, they must have larger vocabularies than them.

In a final analysis, we explore the potential effects of vocabulary size on the discrimination of the target contrast by means of two linear regression models. These analyses address the following question: Do learners with larger vocabularies show increased sensitivity to the target consonant contrast? To conduct these comparisons, we select one of the experimental conditions—the most "difficult" one, the arcsine-transformed *sheet–cheat* contrast in primacy trials—so that we obtain a single accuracy score per participant, and then correlate this variable with the learners' LexTALE scores. We conduct two regression analyses, one per participant group. Whereas the regression model analyzing the Querétaro data yields a significant finding, *F*(1, 42) = 11.2, *p* = 0.002, adjusted *R*<sup>2</sup> = 0.191, the one analyzing the Sonorans does not, *F*(1, 42) = 0.03, *p* < 0.001, adjusted *R*<sup>2</sup> = <sup>−</sup>0.023. In other

words, the Queretaroans with larger vocabularies seem to be more sensitive to the *sheet–cheat* contrast than the ones with smaller vocabularies, whereas no such relation exists for the Sonorans. According to a series of one-sample *t*-tests, in this particular experimental condition, the Sonorans (as a group) are not found to have accuracy rates higher than chance (0.5 proportion-correct scores), *t*(43) = −0.46, *p* > 0.025 [0.64], 95% c.i. [0.69, 0.84], while the Queretaroans are, *t*(43) = 3.03, *p* < 0.025 [.002], 95% c.i. [0.84, 1.007].

#### **4. Discussion**

#### *4.1. Summary of Findings*

The present study reported on two categorization experiments aimed at investigating the acquisition of the English / - /–/t - / (*sheet–cheat*) contrast by two groups of foreign-language learners, both with Spanish as their L1. The first group was recruited in central Mexico (Querétaro), where the local dialect uses [t- ] but not [- ]. The second group was recruited in northwestern Mexico (Sonora), where the local dialect uses both [t- ] and [- ] in free variation. Alongside the target contrast, we investigated the English /i/–/ܼ/ contrast as a control condition.

The findings of a perceptual identification experiment indicated that the learners found the members of the / - /–/t - / contrast difficult to identify when played in the context of each other. On average, 29% of the relevant stimuli were identified incorrectly. Interestingly, the Sonorans (40% mean error rate) made significantly more errors than the Queretaroans (18% mean error rate). The findings of a categorical discrimination experiment demonstrated that the learners found the members of the target contrast hard to discriminate from each other. Overall, the mean error rate for the / - /–/t - / contrast was 38%. Importantly, the groups differ from each other in their discrimination of the / - /–/t - / contrast. In particular, the average error rate of the Sonorans in discrimination of this contrast was as high as 44% while that of the Queretaroans was 32%.

The learner groups differed from each other in their identification and discrimination accuracy of the / - /–/t - / contrast, but not in that of the /i/–/ܼ/ contrast. Taken together, the findings of these two experiments suggest that the Sonorans find the acquisition of the English / - /–/t - / contrast more difficult than the Queretaroans.

#### *4.2. Interpretation and Implications*

The experimental evidence suggests that the English /i/–/ܼ/ contrast is particularly difficult for native Spanish speakers who are learning English as a L2 (Escudero and Boersma 2004; Kondaurova and Francis 2008; Morrison 2008, 2009). Studies demonstrate that Spanish-speaking learners of English tend to associate both of the English vowels to their native /i/, perhaps initially merging the three phonetic categories into one. Spanish speakers can certainly overcome this initial obstacle and can acquire this contrast. Nevertheless, even when they do, their representation of the /i/–/ܼ/ distinction is likely to be based on duration, not spectrum; that is, whereas native, monolingual speakers of English distinguish these two vowel categories based on their spectral properties (i.e., first- and second-formant frequencies), native Spanish speakers who have been able to acquire this contrast are likely to base their distinction on a seemingly parasitic correlate, duration (Kondaurova and Francis 2008; Morrison 2008). The results pertaining to both of our participant groups demonstrate that the identification and discrimination of items implementing the /i/–/ܼ/ contrast are challenging for this population.

Let us now focus on an unexpected secondary finding pertaining to the control contrast. It seems that learners were more likely to be accurate in the discrimination task than in the identification one. For the /i/–/ܼ/ contrast condition, the participants were above chance in their discrimination patterns while they were at chance (as a group) when asked to identify *sit* and *seat*. In fact, some participants were more likely to be wrong than right in the identification experiment. One possible interpretation is that our learners have been able to develop separate phonetic categories for /i/ and /ܼ/ while not having learned which word has which sound category. The results of the discrimination task suggest that

participants can distinguish the two categories, though neither perfectly nor consistently; those of the identification task, on the other hand, suggest that participants do not associate the phonetic categories with lexical items. Insofar as a phoneme is a phonetic category associated to a particular lexical set (i.e., a category included in a phonolexical representation) (Simonet 2016), one could say that our participants may have developed (fuzzy) separate phonetic, but not *phonemic*, categories for /i/ and /ܼ/. The input they have received may have allowed them to form two phonetic categories, perhaps by means of distributional acoustic learning (Escudero and Williams 2014; Wanrooij et al. 2013); it may not have been sufficient, however, for them to form accurate, detailed phonolexical representations that include those categories. Indeed, experimental evidence suggests that forming phonetic categories in a L2 and associating them with phonolexical representations involve different stages of learning (Amengual 2016; Díaz et al. 2012; Sebastián-Gallés and Díaz 2012).

One factor that may have affected participants' identification patterns in the present study has to do with sound-to-spelling correspondences—recall that participants were asked to identify auditory stimuli in terms of visual ones. In Spanish, the phoneme-to-grapheme correspondence of /i/ consistently matches <i>. Note that the English lexical items chosen for the present study, *seat* and *sit*, had <i> corresponding with /ܼ/, not /i/. It is possible that the learners were variably affected by the Spanish spelling conventions during their identification of *sit*. This, together with their not having developed separate *phonemic* categories for these sounds, could explain the pattern of results. At any rate, the crucial finding pertaining to the /i/–/ܼ/ contrast is that the performance of the two groups of participants in the current study was comparable. We are justified to consider this contrast our control condition.

We now discuss the findings pertaining to the target contrast, / -/–/t -/. We hypothesized that the perceptual behavior of our two groups of learners would differ in terms of their categorization of the English / - /–/t - / contrast. This hypothesis was based on the characteristics of the phonology of their native dialect of Spanish. An interesting finding was that both groups of participants seemed to find the English contrast somewhat challenging to acquire. This was not unexpected, as no variety of Spanish has a comparable contrast. The Queretaroans speak a variety of Spanish that uses [t- ], but not [- ], whereas the Sonorans speak a variety that uses both [t- ] and [- ] in free (i.e., not phonologically conditioned) variation. Whereas, as mentioned, all participants found the target contrast somewhat challenging to discriminate, the Queretaroans were found to be, on average, more accurate in their perceptual performance involving this particular contrast than the Sonorans. We propose that the acquisitional obstacles encountered by these two groups of learners, and thus their learning paths, differ on substantial grounds.

Let us first discuss the case of the Queretaroans. As mentioned in the Introduction, most current models of L2 phonological acquisition postulate that learners develop connections between the sound categories of their L2 and those of their L1. One such model is the PAM-L2 (Best and Tyler 2007) and another is the L2LP (Escudero 2005; van Leussen and Escudero 2015). The Queretaroans in our sample may have assimilated both English categories, /t - / and / - /, to the closest category in their L1, /t - /. Since Spanish has /t - / but also has voiceless fricatives (/sfh/, just not / - /), one could hypothesize that the acquisitional obstacle encountered by these learners is not insurmountable, provided that these learners extrapolate this aspect of their native system. Speakers of this dialect possess the capacity to represent affricates as being distinct from fricatives. We postulate that the Queretaroans in our sample have assimilated both English /t - / and / - / to the same native phoneme but, crucially, have done so at different degrees of goodness of fit. Thus, whereas English /t - / may be strongly assimilated to Spanish /t - / (i.e., the match is close to perfect), the interlingual assimilation of English / - / to Spanish /t - / may be rather poor (i.e., the match is imperfect). In the terminology used in the PAM, this would be an instance of CG assimilation. In cases of CG assimilation, the perceptual discrimination of the two members of the L2 contrast is expected to range from moderate to good. The Queretaroans' discrimination of the English / - /–/t - / contrast is indeed moderate. In sum, we believe that the obstacle the Queretaroans face when learning the English / - /–/t - / contrast is having to develop a new phonetic category, separating

from an initial stage in which the two sounds as categorized as instances of the same sound. Postulating that the Queretaroans assimilate both English /t - / and / - / to the same native phoneme, but at different degrees of fit, explains both the presence of the obstacle and its size (i.e., moderate).

Let us now discuss the case of the Sonorans. In Sonora, as well as in other northwestern Mexican states, the local Spanish dialect uses both [- ] and [t- ] in free variation. Crucially, these phonetic variants are not in complementary distribution, but occupy the same segmental slots in the same lexical items. The obstacle for these particular learners lies in the fact that they must unlearn the phonological mapping patterns of their native dialect—the mapping between phonetic categories (or allophones) and phonolexical representations (or phonemes)—before they can learn those of English. Assuming that in the first stage of L2 acquisition learners transfer their L1 competence into their L2 system (Escudero 2005; van Leussen and Escudero 2015), Sonorans begin their development at a stage in which [- ] and [t- ] are equivalent at some level of representation. Therefore, they must first unlearn that [- ] and [t- ] are variants of the same phoneme before they can learn that these two sounds are contrastive in English and, thus, associated with different lexical sets. The acquisitional obstacle may be formalized in two ways.

The first way in which the acquisitional obstacle encountered by the Sonorans may be formalized makes use of the same theoretical constructs we have used to explain the behavior of the Queretaroans, interlingual phonetic category assimilations (Best and Tyler 2007; Escudero 2005). It appears that the lexical distribution of phonetic categories determines to some extent peoples' perceptual behavior. In particular, phonetic categories in complementary distribution are less likely to be perceived as being distinct from each other than categories that are contrastive in the lexicon (Johnson and Babel 2010). If two sound categories are in free variation, it is even more likely that they will be perceived as being perceptually very similar to each other. Sonorans, therefore, are likely to perceive [- ] and [t- ] as being perceptually more similar (to each other) than speakers of other Spanish dialects. If we extrapolate this to interlingual interactions in L2 acquisition, we postulate that Sonorans have assimilated both English /t - / and / - / to the same native phoneme, and that both English categories are optimal matches to this native phoneme. The situation for the Sonorans could be one of SC assimilation (Best et al. 2001; Best and Tyler 2007), a *new scenario* (Escudero 2005). In cases of SC assimilation, discriminability is predicted to be very poor. Indeed, the discrimination of the English / - /–/t - / contrast by the Sonorans is very poor.

The second way in which the acquisitional obstacle encountered by the Sonorans may be formalized makes use of the concept of *allophonic split* (Barrios et al. 2016b; Eckman et al. 2001). This formalization does not depend on interlingual perceptual assimilations between phonetic categories but makes use of the concept of mapping between surface and underlying representations. Surface allophones that find themselves in either free variation or complementary distribution are mapped onto a single underlying segment, a phoneme. A Sonoran learner of L2 English would need to establish new mappings between familiar phones (Barrios et al. 2016b). The phonological competence of Sonoran Spanish speakers includes both [- ] and [t- ], but, since the two sounds are mapped onto the same phoneme, learning the English / - /–/t - / contrast would require creating a new underlying representation and mapping the two surface allophones to separate phonemes. Eckman et al. (2001) hypothesize that this scenario is particularly challenging for L2 learners, but Barrios et al. (2016b) did not find evidence to support this claim. Insofar as we can conceptualize the learning scenario of Sonoran learners as a case of allophonic split, our data are fully in line with Eckman's hypothesis. Interestingly, our data suggest that cases of allophonic split (Sonora) are more challenging to overcome than cases of new category formation (Querétaro).

Since Sonoran learners possess a phonological system that includes both [-] and [t-], one could have hypothesized that acquiring the English / - /–/t - / contrast would be particularly easy in their case. Obviously, this is not what our data suggest. Our data suggest that, in addition to existing phonetic categories, lexico-distributional patterns (i.e., the patterns of lexical distribution of phones, which determine contrastivity, among other things) determine, to some extent, the significance of acquisitional obstacles.

Recall that our two participant samples differed, not only in their region of origin, but also in the sizes of their English vocabularies. We found that, on average, our Queretaroans had slightly larger vocabularies in English than our Sonorans. One could attribute the difference between the two participant groups in terms of their perceptual behavior with respect to the / - /–/t - / contrast to their overall proficiency in English. The Queretaroans may have been more accurate than the Sonorans in their discrimination of the / - /–/t - / contrast because they may be generally more proficient in English than the Sonorans. While this is certainly possible, we would like to highlight the following. Firstly, along with our main data set, we compared the behavior of Sonorans and Queretaroans in a controlled subset. For this subset, we selected 10 Sonorans and 10 Queretaroans so that the Sonorans had larger vocabularies than the Queretaroans. In this subset, participants' perceptual behaviors with respect to the / - /–/t - / contrast were found to be indistinguishable. Sonorans with larger vocabularies discriminate the / - /–/t - / contrast just as poorly as Queretaroans with smaller English vocabularies—thus, vocabulary size is not the only determinant of / - /–/t - / discrimination. Secondly, the control contrast condition, /i/–/ܼ/, led to comparable behavior across the two groups—whereas vocabulary sizes may differ between groups, their overall state of phonological development may not. Thirdly, Queretaroans, but not Sonorans, seem to become "better" at discriminating the / - /–/t - / contrast as their vocabulary increases. These observations suggest that the difference between Sonorans and Queretaroans reported in the present study is indicative of a larger issue, such as a difference in their native phonologies, not simply a difference in overall English competency. Nevertheless, only future research can resolve this conundrum.

#### **5. Conclusions**

The present study investigated the significance of two types of acquisitional obstacles in L2 phonology. The study reported on the identification and discrimination of the English / - /–/t - / (*sheet–cheat*) contrast by two groups of learners whose L1 is Spanish. The first group was recruited in central Mexico, where the local dialect uses [t- ] but not [- ]. To learn the English / - /–/t - / contrast, speakers of this Spanish variety must learn a new phonetic category, [- ]. The second group was recruited in northwestern Mexico, where the local dialect uses both [t- ] and [- ] in free variation. Since both obstruents are variants of the same phoneme in this variety, to learn the English / - /–/t - / contrast, speakers of this dialect must develop new mapping between familiar phonetic categories and underlying (contrastive) representations. The study found that the acquisitional obstacle encountered by speakers of the northwestern Mexican variety of Spanish is of a larger magnitude than the one encountered by speakers from central Mexico. Native dialect phonology is a powerful determinant of L2 phonology learning paths.

**Author Contributions:** The research study reported here is part of M.L.V.'s doctoral dissertation, supervised by M.S. M.L.V. and M.S. participated in the design and planning of the experiments, and both shared in the analysis of the data. M.L.V. collected all of the data, and M.S. drafted the manuscript. All authors have read and agree to the published version of the manuscript.

**Funding:** This research was funded by the College of Humanities of the University of Arizona (Graduate Student Research Grant, 2019) and by a Research and Project (ReaP, 2019) grant provided by the Graduate and Professional Student Council of the University of Arizona.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

Alessi Molina, María T., and Ana Luisa Torres Díaz. 1994. Aspectos fonéticos del habla sonorense. In *Estudios de Lingüística y Sociolingüística*. Edited by Gerardo López Cruz and José Luis Moctezuma Zamarrón. Hermosillo: Universidad de Sonora, pp. 285–92.


Moreno de Alba, José G. 1994. *La Pronunciación del Español en México*. México City: El Colegio de México.


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **The Contributions of Crosslinguistic Influence and Individual Di**ff**erences to Nonnative Speech Perception**

#### **Charles B. Chang 1,\* and Sungmi Kwon <sup>2</sup>**


Received: 1 September 2020; Accepted: 29 October 2020; Published: 31 October 2020

**Abstract:** Perception of a nonnative language (L2) is known to be affected by crosslinguistic transfer from a listener's native language (L1), but the relative importance of L1 transfer vis-a-vis individual learner differences remains unclear. This study explored the hypothesis that the nature of L1 transfer changes as learners gain experience with the L2, such that individual differences are more influential at earlier stages of learning and L1 transfer is more influential at later stages of learning. To test this hypothesis, novice L2 learners of Korean from diverse L1 backgrounds were examined in a pretest-posttest design with respect to their perceptual acquisition of novel L2 consonant contrasts (the three-way Korean laryngeal contrast among lenis, fortis, and aspirated plosives) and vowel contrasts (/o/-/Λ/, /u/-/ȫ/). Whereas pretest performance showed little evidence of L1 effects, posttest performance showed significant L1 transfer. Furthermore, pretest performance did not predict posttest performance. These findings support the view that L1 knowledge influences L2 perception dynamically, according to the amount of L2 knowledge available to learners at that time. That is, both individual differences and L1 knowledge play a role in L2 perception, but to different degrees over the course of L2 development.

**Keywords:** second language acquisition; perceptual learning; individual differences; phonetic sensitivity; crosslinguistic influence; Korean; laryngeal contrast; voice onset time; vowel inventory

#### **1. Introduction**

Whereas infants are described as "citizens of the world" (Werker and Tees 1984) when it comes to perceiving the sounds of a nonnative language (L2), adults are known to be comparatively poor at L2 speech perception. This disparity follows from a process of perceptual specialization for the native language (L1), which begins well before the end of the first year of life and results in an apparent decline in perception of L2 sound contrasts (Kuhl et al. 1992; Polka and Werker 1994; Werker and Pegg 1992). Although the characterization of this process varies throughout the literature, in recent years researchers have converged upon the view that specialization for the L1 does not involve loss of perceptual ability per se, but rather shifts in attention, which may be attributed to lexical and/or phonemic development (Werker 1994, 1995; Werker and Curtin 2005) or to the impetus for perceptual routines to be automatic and robust to adverse conditions (Strange 2011).

Under the view that mature L1 speakers maintain access to the perceptual and cognitive abilities undergirding L1 phonological development (e.g., Flege 1995), it becomes relevant to ask how, in L2 learning, these underlying abilities interact with the consequences of L1 specialization, especially in light of individual variation in these abilities. That is, how are the processes and outcomes of L2 acquisition affected by language-specific properties of a learner's L1 vs. the personally-specific abilities of the learner? The effect of the L1 has been discussed extensively in terms of transfer or

crosslinguistic influence (CLI) of aspects of L1 knowledge and/or experience (Major 2008; Odlin 1989) and plays a central role in theories of L2 speech learning (see Section 1.1). However, the effect of the learner (i.e., individual differences among speakers who may share the same L1; Dörnyei 2006), long a focus of research in the interdisciplinary field of second language acquisition (SLA), has only recently begun to be considered systematically in the study of L2 perceptual learning (see Section 1.2).

The study reported in this article is an attempt to explore the interaction of L1 transfer and individual differences in L2 perceptual learning of Korean phonological contrasts. In what follows, we review the literature on transfer and individual differences in L2 learning, describe the target Korean contrasts (the three-way laryngeal contrast among lenis, fortis, and aspirated voiceless stops and two vowel contrasts between rounded and unrounded back vowels), and motivate a Transfer Ramp-up Hypothesis predicting a delayed onset of L1 transfer effects in L2 perceptual development.

#### *1.1. Crosslinguistic Influence in Speech Learning*

That L2 speech learning may be influenced by previously acquired linguistic knowledge is well-established in the field of L2 speech and in SLA more broadly. Major theories of nonnative speech perception, such as the Perceptual Assimilation Model (PAM; Best 1994, 1995) and its extension to L2 learners, PAM-L2 (Best and Tyler 2007), the Native Language Magnet theory (NLM; Kuhl and Iverson 1995; Kuhl et al. 1992, 2008), the Speech Learning Model (SLM; Flege 1995), and the Automatic Selective Perception framework (ASP; Strange 2011), all incorporate a role for a listener's L1, although they vary in terms of how L1 influence is conceptualized. For example, NLM emphasizes the role of developing L1 category prototypes, whereas PAM(-L2) focuses on the ways in which L2 contrasts may be perceptually assimilated to L1 contrasts, which have consequences for both discrimination and identification of L2 sounds (e.g., Tyler et al. 2014).

Despite their differences, these theories of nonnative speech perception are similar in providing an account for ostensibly negative effects of L1 development in perception of an L2. Although L1 experience is not always a handicap in L2 perception (Bohn and Best 2012; Chang 2016, 2018), it often appears that listeners with an already-developed L1 are at a disadvantage compared to listeners who are less far along in L1 development. For example, English-speaking adults were found to discriminate contrasts of Thompson Salish poorly compared to English-learning infants (Werker and Tees 1984). Such disparities have been interpreted in terms of a decline in sensitivity to nonnative contrasts during L1 development (e.g., Kuhl et al. 1992; Werker and Pegg 1992). It remains unclear, however, whether the disadvantage for adult listeners in L2 perception has to do with reduced phonetic sensitivity or "negative transfer" of L1 structures such as phonological categories (or both).

Indeed, a fundamental component of many theories of L2 speech is the misleading influence of L1 categories. For example, the SLM posits that the most difficult L2 sounds to acquire accurately in the long term are not "new" sounds, but rather "similar" sounds, which resemble the sounds of the L1. This is because a "similar" sound, unlike a "new" sound, gets perceptually linked with an L1 sound, which inhibits the formation of a distinct representation for the L2 sound. According to the SLM, L1 influence (in the form of crosslinguistic perceptual linkage) takes place at the level of the position-specific allophone, but there is reason to believe that the phoneme, rather than the allophone, may be the structure most liable to exert L1 influence. In particular, some findings suggest that access to L1 allophones, which are typically below the level of speaker consciousness, may depend on factors such as the level of acquisition and the type of orthography (Bergier 2014; Eckman et al. 2001; Vokic 2010).

Besides formalizing the influence of L1 categories, theories such as the SLM also posit that "mechanisms and processes used in learning the L1 sound system ... remain intact over the life span, and can be applied to L2 learning" (Flege 1995, p. 239), suggesting that the basic perceptual abilities supporting L1 acquisition are maintained in adulthood. This view is consistent with findings showing little effect of L1 specialization in certain perceptual tasks with nonnative speech. For instance, although [s] and [S] are phonemic in English and allophonic in Dutch, L1 Dutch adults were just as good as L1 English adults at discriminating the two sounds in English nonce items; however, L1 Dutch

adults also rated the sounds as more similar than did L1 English adults (Johnson and Babel 2010). This type of result is at odds with the view that adult listeners necessarily have reduced sensitivity to L2 contrasts, and further suggests that any effect of the L1 sound system is closely related to task type (in particular, the levels of processing tapped by the task; cf. Díaz et al. 2012).

If adults are capable of perceiving nonnative contrasts, this still leaves the question of why they tend to perform worse than infants in L2 perception. Strange's (2011) ASP framework addresses this question with the construct of selective perception routines (SPRs). According to ASP, L1 acquisition involves developing SPRs specialized for the target L1. These SPRs are selective in the sense that they target only those acoustic cues that are relevant for distinguishing L1 phonemes, which allows L1 perception to become automatic and robust in adverse conditions. Crucially, a language-universal mode of perception never disappears, but high task demands may block access to this mode, encouraging instead a default to L1 SPRs. In ASP, then, what distinguishes adult and infant listeners is not reduced sensitivity, but shifted attention. As in L1 acquisition, ASP posits that L2 learning involves developing SPRs that weight and direct attention to cues in a manner specialized for the target L2. This implies that L2 learning generally leads to a (positive) change in L2 perception, which has indeed been found (e.g., Kilman et al. 2014; Levy and Strange 2008; cf. Holliday 2016).

Insofar as better L2 perception reflects less L1 transfer, the finding of higher L2 perceptual accuracy in more advanced learners supports the view that L1 transfer decreases over the course of L2 learning. This view is at the heart of a different theory of L2 phonological development, the Ontogeny Phylogeny Model (OPM; Major 2001). OPM differs from other theories of L2 speech in addressing the time course of all three factors that play a role in general SLA theories: L1, L2, and universal components (cf. Universal Grammar; White 2003). The central claim is that, within a learner's developing L2 system, L1 components decrease over time, while L2 components increase; meanwhile, universal components show an inverse U-shaped pattern, first increasing and then decreasing. This theory also formalizes aspects of intra-speaker variation by describing the influence of style/register, crosslinguistic similarity, and markedness. However, inter-speaker variation is not addressed in this model (or in the others discussed above), and is the subject of a different literature.

#### *1.2. Individual Di*ff*erences in Speech Learning*

Interest in individual differences (IDs) has existed for decades, going back to the 1970s when SLA researchers started examining the predictive power of various personally-specific properties of the learner (for reviews, see Dörnyei 2006; Dörnyei and Skehan 2003). These include (foreign) language (learning) aptitude, general intelligence or cognitive ability (for a recent review, see Bowles et al. 2016), personality (e.g., Dewaele and Furnham 2000; Verhoeven and Vermeer 2002), and musicality (e.g., Milovanov et al. 2008, 2010), all of which are understood to be complex constructs consisting of several subconstructs (e.g., Robinson 2001). For example, language aptitude may consist of variables such as phonemic coding ability, grammatical sensitivity, inductive learning ability, and rote learning ability. Based on Carroll (1981), Skehan (1989, 1991, 1998, 2002) attempted to explain three abilities (namely, auditory, linguistic, and memory-based), while others (e.g., Jilka et al. 2008) have conceptualized of IDs in terms of a "talent" dimension.

In regard to L2 perception specifically, IDs may reflect a construct that has been called phonetic sensitivity in the L2 speech literature (e.g., Kwon 2013; Munro 2008). Note that it is phonetic, as opposed to psychoacoustic, sensitivity that is of interest here due to the neurolinguistic evidence of a "speech-specific origin of individual variability in L2 phonetic mastery" (Díaz et al. 2008, p. 16083). According to Piske (2008), phonetic sensitivity can be distinguished from phonetic or phonological awareness in not necessarily referring to a conscious level of speech processing, and we use this term in a similar sense, to refer generally to sensitivity to properties of the speech signal that may serve to cue a linguistic contrast (regardless of whether or not the contrast exists in the listener's L1). Under the view that the ability to perceive nonnative contrasts does not go away during L1 development, but may be masked by attentional shifts associated with L1 specialization, it becomes relevant to ask whether

IDs in L2 perceptual learning can be traced to variation in learners' (remaining) phonetic sensitivity to nonnative contrasts. That is, might the degree of (non)commitment to perceptual cues relevant for the L1 constitute an "aptitude" for L2 perceptual learning? If not, then what does?

These questions have begun to be explored in a number of studies examining IDs in L2 perception and/or production. For example, research on L1 English speakers trained on Mandarin tonal contrasts found that IDs in learning outcomes were related to learners' pre-training inclination to attend to the most informative cue for Mandarin tones (i.e., pitch direction) as well as to the interaction between their basic pitch perception abilities and the type of training they underwent (Chandrasekaran et al. 2010; Perrachione et al. 2011). Effects of preexisting variation in cue weighting, which can be arbitrary (i.e., not determined by cue informativeness; Idemaru et al. 2012), were also observed in L1 Spanish learners of Dutch and L1 Korean learners of English (Wanrooij et al. 2013; Schertz et al. 2016); however, IDs in L2 perception were not predicted by learners' cue use in L2 production (Schertz et al. 2015). Other studies have linked IDs to variation in compactness and location of L1 categories (Kartushina and Frauenfelder 2013, 2014), self-perception of one's own L2 speech (Baker and Trofimovich 2006), formation and structure of L2 representations (Golestani and Zatorre 2009; Hattori and Iverson 2009), L2 mispronunciation detection (Hanulíková et al. 2012), L2 working memory (Darcy et al. 2015), and different neural correlates (Raizada et al. 2010; Sebastián-Gallés et al. 2012; for a recent review, see Myers 2014).

Some of the work on IDs has focused on the specific target language in the present study—Korean. As reported for learners of other L2s, learners of Korean were found to evince wide variation in L2 perception, both in discrimination of L2 contrasts prior to extensive L2 exposure as well as in identification of L2 categories following weeks of L2 learning; however, pre-learning performance (i.e., phonetic sensitivity to the target L2 contrasts) showed no, or only a weak, correlation with post-learning performance (Jung and Kwon 2010; Kwon 2013). Notably, these results were found with learners representing two or more L1 backgrounds, including typologically diverse and genetically unrelated language families (e.g., Austronesian, Finno-Ugric, Indo-European, Sino-Tibetan). The majority of the data on Korean L2 speech, however, comes from studies focusing on L1 speakers of English and/or Mandarin (e.g., Chang 2010, 2012; Holliday 2014, 2015; Schmidt 2007). These studies show, on the one hand, many dominant patterns of response in perceptual tasks such as crosslinguistic classification (e.g., L1 Mandarin learners tended to classify Korean lenis and aspirated stops as the same Mandarin category) yet, on the other hand, substantial IDs, especially in production (e.g., L1 English learners produced Korean stop contrasts in eight different ways). In the present study, we investigate the interaction of IDs with L1 transfer, focusing on Korean as the target L2. Next, we consider the specific target contrasts, each of which may pose difficulty for L2 learners.

#### *1.3. Korean Phonological Contrasts*

#### 1.3.1. Stop Laryngeal Contrasts in Korean

Korean is known for a typologically rare laryngeal contrast among three series of voiceless stops: lenis, fortis, and aspirated. This three-way laryngeal contrast occurs at four places of articulation in all (bilabial, denti-alveolar, and velar stops, as well as alveolo-palatal affricates) and has been described in terms of several different acoustic dimensions, including voice onset time (VOT), closure duration, and properties of the following vowel such as onset fundamental frequency (*f* 0), voice quality, vowel duration, intensity buildup, and formant trajectories (Cho et al. 2002; Kagaya 1974; Park 2002a, 2002b; for a recent review, see Chang 2013).

The most widely-studied acoustic cues to the stop laryngeal contrasts in utterance-initial position are VOT and *f* 0. Traditionally, fortis, lenis, and aspirated stops are described as showing, respectively, very short, medium-lag, and very long VOTs, as well as high, low, and very high onset *f* <sup>0</sup> values. However, studies have noted a recent diachronic shift in the phonetic implementation of the three laryngeal categories, which is especially evident in Seoul Korean and in female speakers (Kang 2014; Oh 2011; Silva 2006a, 2006b). In particular, the VOT of lenis stops has lengthened while that of

aspirated stops has shortened, leading to highly overlapping VOT distributions; these developments have resulted in an increased role of *f* <sup>0</sup> in realizing the lenis-aspirated contrast. Nevertheless, it is clear from other varieties of Korean as well as data from relatively careful speech that no two of the laryngeal categories have fully merged in VOT, such that there is still something resembling a tripartite VOT contrast in Korean. For example, VOTs of the female model talker in the Korean textbook used by participants in the current study (see Section 2.2) were distinct among the three stop series, as shown in Table 1 (see also the data on native Korean teachers in Chang 2010, 2012).

**Table 1.** VOT (in ms) of Korean stops as produced by the female model talker in participants' Korean textbook (Seoul National University Language Education Institute 2010). All measures represent VOT of stops in utterance-initial position before the low vowel /a/.


Thus, regardless of the degree to which lenis and aspirated stops overlap in VOT for individual speakers, it is reasonable to posit that the role of VOT in distinguishing three laryngeal categories in Korean may place a higher burden on this cue in Korean than in languages with a two-way laryngeal contrast (e.g., voiced-voiceless), by far the most common type of spoken language (Maddieson 1984). L2 learners of Korean, likely to have been exposed only to a two-way VOT opposition in their L1, may therefore experience difficulty in acquiring Korean stop laryngeal contrasts, making this set of contrasts useful to examine in a study of L2 perceptual learning.

#### 1.3.2. Vowel Contrasts in Korean

The vowel inventory of modern Korean as spoken in South Korea contains, for most speakers alive today, the monophthongs /i ε a u ȫ o Λ/. A former length contrast has largely disappeared, leaving only short vowels in contemporary varieties (Kim-Renaud 2012). Older descriptions of Korean (e.g., Lee 1993; Sohn 1999; Yang 1992) also include the vowels /eyø/; however, these vowels are no longer present in the inventory as distinct monophthongs for most speakers. The front rounded vowels have developed into diphthongs (i.e., / /ܷi/, /wε /; see, e.g., Kim-Renaud 2012), while the former contrast between /e/ and /ε / has merged to one mid front vowel phoneme (Ko 2009; Eychenne and Jang 2015, 2018). Thus, apart from several diphthongs (/ /ܷi wε wa wΛ jε ja ju jo jΛ ȫi/), modern Korean has a vowel space consisting of seven basic vowel qualities (Chang 2012; Yoon and Kang 2014).

Of interest in the current study are two vowel contrasts that depart from the typologically most common five-vowel inventory of /ieaou/ (Maddieson 1984)—that between the high back (i.e., non-front) vowels /u/ and /ȫ/ and that between the mid back vowels /o/ and /Λ/. These contrasts can be described in terms of a difference in phonological labiality or roundedness, as only the former vowel in each pair alternates with labial stops in certain "p-irregular" verbal paradigms and is prohibited from occurring with the rounded on-glide /w/ in diphthongs (Kim-Renaud 1974, 2012; Sohn 1999). Phonetically, however, the two contrasts are realized differently. Whereas /u/ and /ȫ/ differ primarily in terms of the second formant (*F*2) and less so in terms of the first (*F*1) or third formant (*F*3), /o/ and /Λ/ show substantial differences in both *F*<sup>1</sup> and *F*<sup>2</sup> (Table 2; cf. similar data in Chang 2012; Yang 1992; Yoon and Kang 2014; Yoon and Kim 2015).

In short, the two Korean contrasts /u/-/ȫ/ and /o/-/Λ/, as relatively marked vowel contrasts, may pose a challenge for L2 learners, who may or may not have been exposed to similar contrasts in their L1. Both are therefore examined as target contrasts in the current study.

**Table 2.** The first three formants (*F*1, *F*2, *F*3) in the target Korean vowels as produced by the female model talker in participants' Korean textbook (Seoul National University Language Education Institute 2010). All measures are in Hz for vowels produced in isolation.


#### *1.4. The Present Study*

Given the scarcity of L2 speech research examining the roles of the L1 and IDs in tandem, we conducted a study of L2 perceptual learning of Korean with two goals: (1) examining how L1 transfer and IDs in phonetic sensitivity interact during L2 development, and (2) contributing empirical data on L2 acquisition of Korean, still an underinvestigated L2 despite its increasing popularity as a foreign language worldwide (Byon 2008; Byon and Pyun 2012; Gordon 2015).

With regard to the interaction of transfer with IDs, we hypothesized that L1 transfer in L2 perception would, over time, show an inverse U-shaped pattern instead of a more simple decline. Because the crucial difference between this hypothesis and previous models of L2 learning is in its description of the initial portion of L2 development as showing an increase (as opposed to decrease) in L1 transfer, we call this the Transfer Ramp-up Hypothesis. Note that this hypothesis does not differ from previous models (e.g., OPM; see Section 1.1) in predicting a decline in L1 transfer. Rather, the claim is that the timing of this decline—and, by implication, of L1 transfer effects to begin with—incorporates a delay. That is, the influence of the L1 takes time to "ramp up", peaking at an intermediate, if still early, point in L2 development (see Figure 1).<sup>1</sup> Weak L1 influence at L2 onset thus allows certain non-L1 factors, such as IDs in phonetic sensitivity, to play a greater role earlier (i.e., in the pre-ramp-up stage) than later in L2 development.

**Figure 1.** Schematic of the Transfer Ramp-up Hypothesis, showing L1 transfer over time in L2 development. The dotted part of the curve indicates the pre-ramp-up stage and ramp-up to maximal L1 transfer; the solid part, the post-ramp-up stage of declining L1 transfer.

Since the Transfer Ramp-up Hypothesis contradicts several models of L2 learning in positing that L1 transfer does not peak at L2 onset, it is worth explaining the motivation for this hypothesis in more

<sup>1</sup> To be more specific, the timescale for the ramp-up is hypothesized to correspond to how long it takes to develop phonological knowledge of the target L2, which is likely to differ depending on the L2 as well as the L1 background of the learner (i.e., the particular pattern of phonological alignment between the L1 and L2). However, in general, we believe that the timescale of the ramp-up to high L1 transfer will be short—less than the five-week interval of L2 learning we observed (see Section 2.2)—because, especially in an instructed L2 context, the phonological system of the L2 (at least, the basic phonological inventory) is one of the first aspects of the language to be acquired, supported by learning the orthographic system in the case of alphabetic orthographies. This is why in Figure 1 the ramp-up is represented as a steep, as opposed to a shallow, incline.

detail. In short, this view follows from two principles alluded to above. The first is that language users maintain access to the perceptual resources that supported acquisition of their L1; the second is that the nature of L2 perception changes with L2 experience. Together, these principles set the stage for L2 perception ab initio, at which point nothing is yet known about the L2, to involve postponing L1 transfer in favor of a language-universal mode of perception. However, the development of a phonological framework for the L2 during L2 learning eventually provides a linguistic basis for mapping L2 speech to the L1, thus encouraging transfer at later stages of learning.

Following from the Transfer Ramp-up Hypothesis, two predictions were tested in the present study. In regard to ab initio perception of the L2, it was predicted that there would be no significant effect of learners' L1 background in a perceptual task not requiring knowledge of L2 phonological categories (P1). That is, performance in such a task was expected to reflect primarily IDs in phonetic sensitivity; therefore, taking ID dimensions as intrinsic properties of the learner, we expected to find a wide range in performance for learners of all L1 backgrounds, and little to no predictive value of linguistically relevant dimensions of a learner's L1. On the other hand, in regard to L2 perception after L2 learning, it was predicted that there would be a significant effect of L1 background, at least in a perceptual task drawing upon knowledge of L2 categories (P2).

Although these predictions were general, and thus applied both to stop and to vowel perception, the linguistically relevant dimensions of a learner's L1 were different for stops and vowels. In the case of stops, the centrality of VOT in distinguishing Korean stop laryngeal categories led us to focus on the number and type of VOT oppositions as the linguistically relevant dimension of the L1. In particular, given that Korean laryngeal categories are all characterized by positive VOTs, we predicted that L1s with VOT oppositions on the positive side of the VOT space (e.g., short- vs. long-lag) would lead to the most beneficial transfer. In the case of vowels, we focused on the occurrence of the target L2 contrasts as the linguistically relevant dimension of the L1 (as a proxy for helpful acoustic oppositions in the back portion of the vowel space), predicting that L1s with one or both of the target contrasts would result in more beneficial transfer than L1s containing neither contrast.

Because this is one of the first studies to examine L2 perceptual learning of Korean by learners from multiple L1 backgrounds, we endeavored to broaden the empirical contribution, as well as the generalizability of the results, by adopting an inclusive approach to the research. As such, we admitted into the study all learners who were eligible rather than targeting a few specific L1 backgrounds. The resulting diversity of L1 backgrounds, some of which are represented by only one participant, naturally presents a challenge for examining the effects of L1 background; we address this by using a group-based analysis, as described in the next section.

#### **2. Methods**

The study received ethical approval from the Institutional Review Board (Pukyong National University) under approval code 1041386-202008-HR-47-02. To determine the number of participants to recruit for the study, we carried out a power analysis anticipating multiple regression models with up to eight coefficients apart from the intercept (accounting for group and category predictors consisting of up to three levels, along with interaction coefficients) and assuming 80% power, an alpha level of 0.05, and a model *r*<sup>2</sup> of 0.24 (based on exploratory modeling of pilot data). Using pwr.f2.test() in the pwr package (Champely 2018) in R (R Development Core Team 2020), we determined the target number of participants to be approximately 56, so we recruited participants until we reached a final sample of at least 56 participants.

#### *2.1. Participants and Groups*

A total of 59 adult L2 learners of Korean participated in the study, with two excluded from analysis due to failure to complete the study or outlier performance (i.e., lower than 10% accuracy) in the posttest portion (see Section 2.4.2). The 57 participants in the final analysis (43 female; *M*age = 21.5 yr, *SD* 4.5) came from 10 L1 backgrounds: North American and British English (*n* = 22), Mandarin Chinese

(*n* = 14), Finnish (*n* = 5), Swedish (*n* = 5), Slovenian (*n* = 3), Castilian Spanish (*n* = 2), European French (*n* = 2), European Portuguese (*n* = 2), Malay (*n* = 1), and Turkish (*n* = 1). They were sorted into groups according to crucial features of their L1 hypothesized to influence perception of the target L2 contrasts—namely, type of VOT contrast and type of vowel inventory.

For the stop study, learners were assigned to one of three groups based on the VOT opposition in their L1 stops: lead-short (e.g., [b]-[p]), short-long (e.g., [p]-[ph]), or lead-long (e.g., [b]-[ph]). The lead-short group included the L1 speakers of Finnish, French, Malay, Portuguese, Slovenian, and Spanish; the short-long group, English and Mandarin; and the lead-long group, Swedish and Turkish. Using the VOT ranges in Keating (1984), the group classifications were made on the basis of published descriptions of the respective L1s, which suggest that the L1s in the lead-short, short-long, and lead-long groups contrast, respectively, lead (i.e., negative) vs. short-lag VOT (Cruz-Ferreira 1995; Fougeron and Smith 1993; Herrity 2000; Martínez-Celdrán et al. 2003; Shahidi and Aman 2011; Suomi et al. 2008), short- vs. long-lag VOT (Duanmu 2007; Labov et al. 2006; Ladefoged 1999; Roach 2004), and lead vs. long-lag VOT (Helgason and Ringen 2008; Ö ˘güt et al. 2006). Background data on the groups are summarized in Table 3. The groups did not differ significantly in age [|*t*|s < 1.7, *p*s > 0.05].


**Table 3.** Background information on participant groups in the stop study.

For the vowel study, learners were assigned to one of two groups based on whether their L1 contained none or some of the target vowel contrasts—namely, /u/-/ȫ/ and /o/-/Λ/. The no-contrast group included the L1 speakers of Finnish, Mandarin, Spanish, and Swedish; the some-contrast group, English, French, Malay, Portuguese, Slovenian, and Turkish. As in the stop study, these group classifications were made on the basis of published descriptions of the respective L1s, which suggest that the L1s in the no-contrast group contain neither contrast (Bradlow 1995; Eklund and Traunmüller 1997; Iivonen and Harnud 2005; Lee and Zee 2003) whereas the L1s in the some-contrast group contain oppositions resembling one or both of the contrasts (Clynes and Deterding 2011; Escudero et al. 2009; Hillenbrand et al. 1995; Jurgec 2005; Kiliç et al. 2004; Strange et al. 2007). Because we relied on published descriptions and there is variation in the transcription conventions used for different languages, we took a somewhat liberal approach to looking for a given contrast; for example, for /u/-/ȫ/, we looked not only for /ȫ/ but also for /݁/. Table 4 summarizes the background data on the two groups in the vowel study, which did not differ significantly in age [*t*(55) = 1.239, *p* > 0.05].

**Table 4.** Background information on participant groups in the vowel study.


Participants in all L1 groups tended to have considerable experience with additional languages, such that the majority had been exposed to a type of VOT contrast and/or vowel inventory different from that in their L1. All participants who were not L1 English speakers were also proficient in English (to a level sufficient for college coursework), and most (82%) reported knowledge of one or more other languages (e.g., Cantonese, German, Italian, Japanese, Polish, Russian, Serbo-Croatian, Tagalog, Thai). Given this self-reported multilingualism, we examined whether there were disparities in "diversifying" types of additional language (L*n*) exposure across groups—which could potentially result in one group having an inherent advantage in either study—by coding participants in terms of the number of their L*n*s (including English) that would have provided exposure to a type of VOT contrast (or vowel inventory) different from that in their L1. Between-group comparisons on this dimension revealed no significant difference between the no-contrast and some-contrast groups in the vowel study [*t*(55) = 1.210, *p* > 0.05]; however, in the stop study, the short-long group reported significantly fewer "diversifying" L*n*s than the other two groups [|*t*|s > 3.1, *p*s < 0.01], while the lead-short and lead-long groups did not differ from each other in this respect [*t*(19) = 0.156, *p* > 0.05]. The disadvantage of the short-long group in terms of L*n* exposure—largely due to the fact that, unlike the lead-short and lead-long groups, participants in this group did not have English to count as a "diversifying" L*n*—was not, in fact, reflected in systematically lower L2 Korean performance (cf. Section 4); therefore, we assume that L*n* exposure, at least operationalized in terms of number of L*n*s, did not have a detectable impact on the relative patterning of groups on L2 Korean.

#### *2.2. L2 Learning Context*

Participants were recruited in 2013–2014 from an international summer program at a university in Seoul, where they were enrolled in an elementary Korean course as well as other courses. The Korean course lasted five weeks and consisted of three hours of instruction per day for 3–4 days a week (18 total class meetings), amounting to a total of 54 contact hours. Consistent (≥90%) attendance was required to pass the course, and every participant received at least a passing grade, meaning that all received a similarly large amount of classroom instruction in the L2. Participants' non-language courses were conducted in English. Furthermore, students in the program resided in a campus dormitory and, besides contact with Korean students in extracurricular activities, communicated with other students primarily in English. Therefore, in spite of their residence in an L2 environment, participants were not generally immersed in the L2 outside of the classroom.

Although the participants came from different Korean classes split among a team of five instructors (average class size < 15), each class used the same syllabus and teaching materials and followed the same instructional format with English as the main language of instruction. Additionally, the allotment of classroom time to content was uniform across classes, with the first 10 hours allotted to orthographic and phonemic familiarization (i.e., reading and writing of the Korean alphabet). Thus, by the end of the first week of classes, students had mostly finished the portion of the course focused on spelling and pronunciation. Nevertheless, the posttest experiment was not conducted until five weeks later (i.e., at the end of the course) so that participants would be maximally comfortable with using Korean graphemes to provide their responses in this experiment.

#### *2.3. Materials*

Stimuli in both the pretest and posttest experiments were designed to test perception of the same Korean phonological contrasts tested in Jung and Kwon (2010): the three-way stop laryngeal contrast and two vowel contrasts, /u/-/ȫ/ and /o/-/Λ/. Thus, the stimuli included all nine stops (lenis /ptk/, fortis /p\* t\* k\*/, aspirated /ph t <sup>h</sup> kh/), which were put before /a/, and four lone vowels (/u ȫ o Λ/).

The stimuli were prepared as follows. First, they were produced by a female native Korean speaker in a standard formal register within the carrier sentence /\_\_-ka is\*ȫpnita/ "There is (a) \_\_". For example, for /k/, the target syllable was /ka/, and this was uttered in the sentence /ka-ka is\*ȫpnita/. The set of 13 sentences (9 target stops + 4 target vowels) was arranged in a random order and recorded three times. The recordings were made at 22.1 kHz and 16 bps, in a soundproof booth using a CSL 4500 recording device and a Shure SM48 dynamic mic.

After the sentences were recorded, they were edited in Praat (Boersma and Weenink 2013) to isolate the target syllables. Each syllable was cut out from its carrier sentence and saved in a separate file. All tokens were then inspected auditorily, and one token of each syllable was chosen for the perception stimuli (usually the second token, unless this token sounded unnatural). The resulting 13 stimuli were then randomized and submitted for perceptual evaluation to five native Korean-speaking judges to

confirm their quality. The judges were asked to listen to and identify the stimuli (by transcribing them in Korean orthography), as well as rate their confidence in each identification judgment (on a 1–5 scale, 1 being "least confident" and 5 being "very confident"). The rate of correct identification was 100% and the mean confidence rating was 4.9, suggesting that the stimuli were highly intelligible, as well as comprehensible, and thus suitable for use in this study.

#### *2.4. Procedure*

Both the pretest and posttest experiments were carried out in a quiet classroom equipped similarly to the rooms for participants' Korean classes. Participants were tested in a group, using the classroom's audio speakers (which were mounted all around the room) and paper answer sheets. Test materials for the pretest and posttest are publicly accessible via the Open Science Framework (OSF) at https://osf.io/rkxdh/.

#### 2.4.1. Pretest: Oddball Discrimination

The pretest experiment was conducted before the beginning of participants' Korean course and was meant to investigate the degree to which these ab initio learners could already perceive the target sounds. Because at this point in time participants had no knowledge of Korean, an identification task was eschewed in favor of a relatively difficult discrimination task (namely, oddball discrimination) with reduced stimulus variability (i.e., only one token of each target syllable) to prevent the task from being overly difficult. Prior to beginning the experiment, participants were told that the sounds they were going to hear were Korean speech sounds (so as to preclude a nonlinguistic mode of auditory perception), and they then completed a short practice session consisting of three trials with stimuli different from the test stimuli to familiarize them with the task.

Each trial of the oddball discrimination task presented a three-item sequence of auditory stimuli to the participant, who had to indicate which (if any) of the three items was different from the other two (i.e., the oddball). For example, one trial presented the sequence /ka/-/k\*a/-/k\*a/, and the correct answer on this trial was /ka/ (= item #1). The items in a test sequence were separated by an inter-item interval of 700 ms, and each sequence was played twice before participants had to respond. Participants responded by circling the number on their answer sheet corresponding to the serial position of the oddball or, alternatively, one of two other options: "all same" (if no item was different from the other two) and "all different" (if every item was unique). The inter-trial interval was 5 s. Trials were ordered randomly but blocked by contrast type, such that all trials testing laryngeal contrasts were presented before trials testing vowel contrasts. For each of the nine laryngeal contrasts and two vowel contrasts, two test sequences were included (e.g., /p\*a/-/pha/-/p\*a/ and /pha/-/pha/-/p\*a/ for the /p\*/-/ph/ contrast), with the position of the oddball distributed across the three possibilities in a ratio of 1:2:2.5. Each test sequence was iterated twice, for a total of 44 trials (11 target contrasts × 2 test sequences × 2 repetitions).

To check that the pretest was in fact reliable, we calculated split-half reliability by randomly dividing the dataset from the pretest into two subsets according to contrast type (i.e., each subset containing half of the items, randomly selected, for each of the five contrast types: lenis vs. fortis, lenis vs. aspirated, fortis vs. aspirated, /o/ vs. /Λ/, /u/ vs. /ȫ/). This calculation suggested that the pretest had good reliability (Cronbach's α = 0.82).

#### 2.4.2. Posttest: Forced-Choice Identification

The posttest experiment was conducted at the end of the Korean course and was meant to examine participants' (relative) degree of success in perceptually acquiring the target Korean sounds. By this point, participants had spent a considerable amount of time learning Korean, including the alphabet, so it was assumed they were familiar enough with its phonological categories and alphabet to perform an orthographic labeling task. Therefore, in contrast to the pretest, the posttest experiment used the identification paradigm to provide a measure more closely reflecting the task of real-world speech perception. Consequently, it should be noted that absolute accuracy levels in the pretest and posttest

cannot be directly compared to each other; however, such a comparison is not needed to address our research questions, which concern relative levels of performance across L1 backgrounds rather than absolute accuracy. Prior to beginning this experiment, participants were once again told that they were going to hear Korean speech sounds, and then they completed a short practice session consisting of three trials using stimuli different from the test stimuli to familiarize them with the task.

The identification task was a ten-alternative forced-choice (10AFC) task for trials testing stops and a six-alternative forced-choice (6AFC) task for trials testing vowels. The response options for stop trials were /pa p\*a pha ta t\*a tha ka k\*a kha/ and the distractor option "other". The response options for vowel trials were /u ȫ o Λ/ and the distractor options /a e/. On each trial, one auditory stimulus consisting of a target syllable in isolation was played twice before participants had to respond. Participants responded by circling the option on their answer sheet (written in Korean orthography) that matched what they had just heard. The inter-trial interval was 5 s. As in the pretest, trials were ordered randomly but blocked by contrast type, such that all trials testing stops were presented before the trials testing vowels. For each of the 13 target syllables, there were three repetitions (of the same token), for a total of 39 trials.

To check that the posttest was reliable, we again calculated split-half reliability by randomly dividing the dataset from the posttest into two subsets according to category type (i.e., each subset containing half of the items, randomly selected, for each of five category types: lenis stops, fortis stops, aspirated stops, mid vowels, high vowels). This calculation suggested that, like the pretest, the posttest also had good reliability (Cronbach's α = 0.82).

#### *2.5. Statistical Analysis*

In both the stop study and the vowel study, the likelihood of an accurate response in the pretest and posttest experiments was analyzed using mixed-effects logistic regression. To maximize the stability and generalizability of the final models, a two-stage modeling process was used, consisting of initial exploration of potential predictors (in models with lone fixed effects) followed by incremental model building, with model comparisons conducted via likelihood-ratio tests. This process resulted in four final models, one for the pretest and posttest in each of the two studies. All data from the pretest and posttest are publicly accessible via the OSF at https://osf.io/wgcen/.

In the first stage of modeling, we examined a series of models built with single fixed-effect predictors to get a sense of the informativeness of each fixed effect on its own. The random-effects structure in all models comprised intercepts by Participant and Stimulus, except in case there was only one stimulus per L2 category (i.e., in the model of posttest vowel identification).2 The fixed effects explored in pretest models were the participant's L1 group (Group; treatment-coded with reference level "lead-short" in the stop study and "no contrast" in the vowel study) and the contrast tested on that trial (Contrast; treatment-coded with reference level "lenis-fortis" in the stop study and "/u/-/ȫ/" in the vowel study). The fixed effects explored in posttest models were Group (coded as above), the category tested on that trial (Category; treatment-coded with reference level "lenis" in the stop study and "/u/" in the vowel study), and the participant's overall accuracy in the pretest (PretestAcc; centered and standardized).

In the second stage of modeling, we built a full model on each dataset (e.g., pretest trials in the stop study) with a view toward providing a strong test of the presence of a Group effect. Starting from a base model containing the random-effects structure from above, all potential fixed predictors besides Group were examined first (in decreasing order of informativeness, as indicated by AIC values in the single-predictor models from the first stage of modeling), including all possible interaction terms.

<sup>2</sup> A more complex random-effects structure including random slopes was not used in either stage of modelling because models with such a random-effects structure failed to converge or, alternatively, showed signs of overparameterization and/or less stable fit.

Once these predictors were tested and all those which failed to significantly improve the model were removed, the Group term was added to see if it significantly improved the model. Thus, for each dataset, the model improvement (or lack thereof) resulting from the addition of the Group term is interpreted as evidence of the presence/absence of an effect of L1 background.

#### **3. Results**

#### *3.1. Individual Di*ff*erences at Pretest*

To check whether IDs were confounded with L1 background (leading to unbalanced levels of IDs across the various L1-based groups), we first examined the range and distribution of variation in pretest performance (i.e., global discrimination accuracy) within each L1 group. Recall that there were three groups in the stop study and two groups in the vowel study; therefore, there were three between-group comparisons to be made in the stop study and one in the vowel study.

As shown in Figure 2, there was some variation across groups in the shape of the distribution in pretest accuracies (e.g., in the stop study, a longer lower tail in the lead-short group vs. a shorter lower tail in the short-long group); however, paired comparisons via Welch-corrected two-sample *t*-tests provided little evidence of a systematic disparity in the degree of IDs between any two groups. For the groups in the stop study, the range in pretest accuracy was 22–97%, 50–100%, and 47–89%, respectively, for the lead-short, short-long, and lead-long groups. None of the between-group differences were significant [|*t*|s < 2.026, *p*s > 0.05]. As for the groups in the vowel study, the range in pretest accuracy was 47–97% and 22–100%, respectively, for the no-contrast and some-contrast groups. Here, too, the between-group difference was not significant [*t*(54.7) = −0.502, *p* > 0.05]. Given these results, we conclude that IDs in phonetic sensitivity did not simply reflect participants' diverse L1 backgrounds. In fact, there were substantial, and not significantly different, levels of individual variation in all of the L1 groups discussed below.

**Figure 2.** Violin plots showing the probability density of pretest discrimination accuracies (standardized) in each L1 group in (**a**) the stop study and (**b**) the vowel study. Superimposed points (jittered along the *x*-axis) represent individual participants.

#### *3.2. Study 1: Stop Perception*

Analysis of pretest performance in the stop study revealed considerable variation in discrimination accuracy across the different L2 laryngeal contrasts, but no systematic effect of the learner's L1 group. As shown in Figure 3, accuracy was well above chance level (=20%) in all cases, but tended to be lower on the lenis-fortis contrast (*M*acc = 58%) than on the other two contrasts (fortis-aspirated: *M*acc = 85%; lenis-aspirated: *M*acc = 81%). This was reflected in the modeling results, which revealed a significant effect of Contrast [χ2(2) = 13.769, *p* < 0.01] but no effect of Group [χ2(2) = 5.604, *p* > 0.05]. The final

model (shown in Table 5) confirmed that, compared to the lenis-fortis contrast (represented in the intercept), the fortis-aspirated and lenis-aspirated contrasts were both significantly more likely to be discriminated accurately [βs > 1.476, *p*s < 0.001].

**Figure 3.** Pretest performance in the stop study (accuracy in oddball discrimination), by L2 laryngeal contrast and L1 group (i.e., VOT type). Error bars indicate *SE* of the mean over participants; the dotted line marks chance-level performance (20%).

**Table 5.** Fixed-effect coefficients in the final model of pretest stop discrimination [*N* =2052, log-likelihood = −958.7]. Significance code: \*\*\* *p* < 0.001.


Analysis of posttest performance in the stop study revealed both variation in identification accuracy across the different L2 laryngeal categories as well as an advantage of the short-long group—and, to a lesser extent, the lead-short group—over the lead-long group. As shown in Figure 4, with the exception of the lead-long group on fortis stops, accuracy was well above chance level (i.e., 10%); nevertheless, accuracy tended to be lower on fortis stops (*M*acc = 51%) than on the lenis (*M*acc = 68%) or aspirated stops (*M*acc = 62%), especially for the lead-short and lead-long groups. These patterns were reflected in the modeling results, which showed no effect of PretestAcc [χ2(1) = 0.834, *p* > 0.05] but a significant effect of Category [χ2(2) = 13.770, *p* < 0.01], of Group [χ2(2) = 8.564, *p* < 0.05], and of the Category <sup>×</sup> Group interaction [χ2(4) = 36.291, *p* < 0.001].

**Figure 4.** Posttest performance in the stop study (accuracy in 10AFC identification), by L2 laryngeal category and L1 group (i.e., VOT type). Error bars indicate *SE* of the mean over participants; the dotted line marks chance-level performance (10%).

The nature of the interaction between Category and Group can be seen in the final model of posttest accuracy (Table 6). Relative to accuracy on lenis stops, the lead-short group was significantly less likely to be accurate on fortis stops [β = −1.394, *p* < 0.001] but not aspirated stops [β = 0.163, *p* > 0.05], and the lead-long group showed a similar pattern, reflected in non-significant interaction coefficients. The short-long group, however, showed a different pattern, with a much smaller decrease in accuracy on fortis stops vis-a-vis lenis stops and a slight decrease in accuracy on aspirated stops (contrasting with the slight increase observed in the other two groups); these differences were reflected in a significant positive interaction coefficient for fortis stops [β = 0.860, *p* < 0.05] and a significant negative interaction coefficient for aspirated stops [β = −0.810, *p* < 0.05].

**Table 6.** Fixed-effect coefficients in the final model of posttest stop identification [*N* = 1539, log likelihood = −885.7]. Significance codes: \* *p* < 0.05, \*\*\* *p* < 0.001.


In addition to analyzing accuracy, we also inspected the pattern of errors in the posttest to check whether they involved laryngeal category, place (of articulation), or both. As shown in Figure 5 (see also Figures A1–A3 in Appendix A), errors for every stimulus mostly involved laryngeal category only. The two exceptions were /pha/ and /ta/, which each elicited a high proportion of errors involving place. Crucially, however, errors involving laryngeal category only were by far the most common error type, both overall (83%) and across groups (77–88%).


**Figure 5.** Confusion matrix of errors in the posttest for stop items (vertical = stimuli, horizontal = responses). Each cell shows the percentage of all errors on the given stimulus represented by the given response (rows may not add to 100% due to rounding); the most common error type for each stimulus is bolded. The total number of errors for each stimulus (across all groups) is shown in parentheses.

Although our modeling results suggested that pretest performance was not a predictor of posttest stop perception, we further examined the relationship between the two through Pearson's correlations to provide additional evidence of the (non)predictiveness of pretest performance. These analyses revealed no significant correlations between pretest and posttest accuracy on stops, overall [*r*(55) = 0.13, *p* > 0.05] or within any individual group: lead-short [*r*(13) = 0.10, *p* > 0.05], short-long [*r*(34) = −0.18, *p* > 0.05], or lead-long [*r*(4) = 0.64, *p* > 0.05] (see Figure 6). In short, results of the stop study produced some evidence of L1 transfer, but crucially only following extensive L2 exposure; moreover, there was no evidence of a link between preexisting phonetic sensitivity (as measured in the pretest) and L2 stop perception following L2 learning (as measured in the posttest).

**Figure 6.** Individual posttest accuracy in the stop study by pretest performance, with regression lines (**a**) over all L1 groups together, and (**b**) over each L1 group separately. Points are jittered; the shaded areas indicate the 95% confidence interval around the regression line.

#### *3.3. Study 2: Vowel Perception*

Analysis of pretest performance in the vowel study revealed ceiling-level discrimination accuracy for both L2 vowel contrasts and across both L1 groups. As shown in Figure 7, accuracy was well above chance in all cases and did not differ between the /u/-/ȫ/ and /o/-/Λ/ contrasts (*M*acc = 93% for both). Modeling results showed no effect of Contrast [χ2(1) = 0, *p* > 0.05] or of Group [χ2(1) = 0.040, *p* > 0.05], so the final model [*N* = 456, log-likelihood = −83.5] contained no fixed predictors. Consistent with the high accuracies in Figure 6, the intercept in this model indicated that L2 vowel contrasts were, overall, discriminated with significantly better than 50-50 odds [β = 6.983, *z* = 4.456, *p* < 0.001].

**Figure 7.** Pretest performance in the vowel study (accuracy in oddball discrimination), by L2 vowel contrast and L1 group (i.e., vowel inventory type). Error bars indicate *SE* of the mean over participants; the dotted line marks chance-level performance (20%).

Compared to pretest performance, there was more variation in posttest performance, but accuracies were generally high, especially relative to stop identification rates. There were some differences among vowels, with /ȫ/ showing the highest accuracies and /Λ/ the lowest. Crucially, there was also a group difference: the some-contrast group outperformed the no-contrast group, both overall (no contrast: *M*acc = 74%; some-contrast: *M*acc = 88%) and on every vowel category, as shown in Figure 8. Thus, modeling results showed a significant effect of Category [χ2(3) = 52.729, *p* < 0.001] and of Group [χ2(1) = 4.233, *p* < 0.05] though no effect of the Category <sup>×</sup> Group interaction [χ2(3) = 0.955, *p* > 0.05]. Additionally, as in the stop study, there was no effect of PretestAcc [χ2(1) = 0.395, *p* > 0.05]. The final

model (Table 7) indicated that, compared to /u/, /ȫ/ was significantly more likely to be identified accurately [β = 2.090, *p* < 0.001] while /Λ/ was significantly less likely to be identified accurately [β = −0.768, *p* < 0.05]; furthermore, the some-contrast group was significantly more likely to be accurate than the no-contrast group [β = 1.210, *p* < 0.05].

**Figure 8.** Posttest performance in the vowel study (accuracy in 6AFC identification), by L2 vowel category and L1 group (i.e., vowel inventory type). Error bars indicate *SE* of the mean over participants; the dotted line marks chance-level performance (17%).

**Table 7.** Fixed-effect coefficients in the final model of posttest vowel identification [*N* = 684, log-likelihood = −248.0]. Significance codes: \* *p* < 0.05, \*\*\* *p* < 0.001.


As in the study of stop perception, here too we inspected the pattern of errors in the posttest to see which vowels were most confusable with each other. As shown in Figure 9 (see also Figures A4 and A5 in Appendix A), /u/ was most often confused with /o/, /ȫ/ with /u/ (although there were very few errors on /ȫ/ overall), /o/ with /Λ/, and /Λ/ with /a/. Relatively few errors involved confusion with the vowel /ȫ/ or /e/. In short, vowel identification errors tended to involve confusion with a vowel sharing a specification for roundedness/labiality and/or height, which was unsurprising; however, predominant vowel confusions were not symmetrical (e.g., /o/ was most often misidentified as /Λ/ but /Λ/ was not most often misidentified as /o/).


**Figure 9.** Confusion matrix of errors in the posttest for vowel items (vertical = stimuli, horizontal = responses). Each cell shows the percent of all errors on the given stimulus represented by the given response (rows may not add to 100% due to rounding); the most common error type for each stimulus is bolded. The total number of errors for each stimulus (across all groups) is shown in parentheses.

As in the stop study, we further examined the relationship between pretest performance and posttest vowel perception via Pearson's correlations. In line with the modeling results as well as the lack of correlations seen in the stop study, these analyses revealed no significant correlations between pretest performance and posttest accuracy on vowels, overall [*r*(55) = 0.09, *p* > 0.05] or within either individual group: no-contrast [*r*(24) = 0.16, *p* > 0.05] or some-contrast [*r*(29) = −0.002, *p* > 0.05] (see Figure 10). In short, results of the vowel study were consistent with those of the stop study: again, we found evidence of L1 transfer only in the posttest and no evidence of a link between preexisting phonetic sensitivity and L2 vowel perception after L2 learning.

**Figure 10.** Individual posttest accuracy in the vowel study by pretest performance, with regression lines (**a**) over all L1 groups together, and (**b**) over each L1 group separately. Points are jittered; the shaded areas indicate the 95% confidence interval around the regression line.

#### **4. Discussion**

In summary, evidence of L1 transfer was not found in L2 perception of Korean ab initio, but was found after a prolonged period of L2 learning. Results of the pretest showed, as expected, substantial individual differences (IDs) in learners' preexisting phonetic sensitivity to the target stop and vowel contrasts, but no systematic effect of the crucial L1 features hypothesized to affect L2 perception of these contrasts. In contrast, results of the posttest showed a significant effect of these L1 features, both for stops and for vowels, but no effect of phonetic sensitivity as reflected in pretest performance. These findings thus provide support for the first part of the Transfer Ramp-up Hypothesis: rather than peaking at L2 onset, L1 influence takes time to set in during L2 perceptual development (at which point the relative role of IDs in phonetic sensitivity may become smaller).

Although these findings converge with the results of previous studies on L2 Korean (Jung and Kwon 2010; Kwon 2013), they also diverge from the literature supporting the Perceptual Assimilation Model (PAM; Best 1994, 1995), much of which also examined naive listeners of an L2 yet found patterns of ostensible L1 influence in discrimination of L2 contrasts. In our view, the disparity between the current results and PAM-framed results is likely due to different formulations of the dependent measure. That is, whereas studies testing PAM have generally examined L2/nonnative perception by individual contrast (because PAM makes different predictions for different types of L2 contrasts), the present study, which was not focused on testing PAM, evaluated L2 perception mainly in terms of an overall likelihood of accuracy. In fact, when the (pretest) discrimination results are separated out by contrast, there is some indication of a possible L1 effect for certain contrasts (e.g., the short-long group showing the greatest advantage over the lead-short group on the Korean fortis vs. aspirated contrast; see Figure 3), even if not for others (e.g., the no-contrast group showing similar, ceiling-level performance as the some-contrast group on both Korean vowel contrasts; see Figure 7).

Thus, while we interpret the current findings as support for the Transfer Ramp-up Hypothesis, we are also careful to point out that our observation of a low degree of L1 transfer at L2 onset (vis-a-vis later points in L2 development) does not mean that there is no L1 influence at L2 onset. It is not possible to draw such a conclusion on the basis of the evidence in this study, and the Transfer Ramp-up Hypothesis, moreover, does not make this extreme claim. The core claim is rather that there is an

increase in L1 transfer during the early part of L2 development. Because this does not entail the total absence of L1 effects at L2 onset, it stands to reason that the L1 may indeed have a detectable effect on the ab initio perception of specific L2 contrasts (e.g., "Single Category" contrasts that assimilate to the same L1 sound; Best 1994), which may amount to an L1 effect that is weak or undetectable overall when multiple L2 contrasts are considered together. The locus of the difference between the present study and previous theoretical frameworks, therefore, is in the observed trajectory of L1 transfer. In particular, our findings contradict the view—made explicit in the Ontogeny Phylogeny Model (OPM) and left implicit in other frameworks such as PAM(-L2)—that L1 influence is at its height at the start of L2 development and only decreases from that point.

Apart from the implications for theories of L1 transfer, the current findings also have implications for views of IDs in L2 acquisition. In particular, they make two contributions to our understanding of IDs in L2 perceptual development. First, given that no effect of L1 background was found in the pretest despite ample variation among learners, the pretest results provide evidence that the magnitude of IDs in phonetic sensitivity to L2 contrasts—at least, to the target L2 Korean contrasts examined in this study—can be relatively large compared to the magnitude of L1 effects. These results thus argue in favor of making the analysis of IDs central to studies of L2 perception (as is becoming increasingly common in the L2 speech literature; see Section 2.2) because IDs may actually prove to be a more powerful predictor of L2 perceptual behavior (at certain points in L2 development) than the more extensively examined factor of L1 background. Second, together with the pretest results, the posttest results provide evidence that the effect of IDs changes over the course of L2 development. Crucially, IDs in phonetic sensitivity were found not to predict L2 perceptual accuracy after L2 learning, which supports a view of L2 perception, at any given point during L2 development, as the outcome of a dynamic interaction between L1 transfer and IDs.

In light of these theoretical implications, it would be remiss of us not to mention the limitations of the present study, which motivate a number of different directions for future research. First, as alluded to in Section 2.4, the composition of the participant sample, influenced in large part by worldwide trends in who elects to study Korean as an L2, was unevenly distributed in terms of L1 backgrounds, necessitating an approach that grouped L1s together rather than analyzing them separately. Although adequate for addressing research questions related to broadly formulated phonological features of an L1, this approach does not lend itself to a nuanced examination of features that may be more language-specific (e.g., palatalization in Russian; pharyngealization in Arabic). It would, therefore, provide additional insight to replicate this type of study with a larger sample in which all L1 backgrounds are robustly represented, allowing for analyses that focus on specific L1s.

Second, as mentioned in Section 2.4.2, the study was designed to use different task paradigms (discrimination and identification) for the pretest and posttest, considering both ecological validity and appropriateness for different stages of L2 learning. Although both paradigms are widely used to measure perceptual ability, and the measures from these paradigms have been shown to be highly correlated with each other at the same point in time (e.g., Pearson's *r* > 0.6 for L2 discrimination and identification of Mandarin tones; Bowles et al. 2016), the fact remains that discrimination performance and identification performance cannot be directly compared to each other, which prevents us from being able to draw conclusions about participants' L2 perceptual development that are truly longitudinal. Thus, it would be useful in future research to collect longitudinal data from the exact same task, which, with some design modifications, could be made appropriate for learners at different stages of L2 development (e.g., use of iconic images for response options in ab initio L2 identification; see Bowles et al. 2016).

Third, this study included observations of two time points in L2 development, whereas at least three are needed to fully demonstrate the inverse U-shaped pattern of L1 transfer postulated in the Transfer Ramp-up Hypothesis (Figure 1). That is, we have only assumed the part of the pattern in which L1 transfer declines at later points in L2 development on the basis of prior findings in the literature (see Section 1.1), since it was not possible to observe a third, later time point in the case of the current participants (who did not necessarily continue learning Korean after the end of their Korean language course). In future work, it would therefore be helpful to track learners further into their trajectory of L2 learning (e.g., in a year-long course of L2 instruction) so as to provide direct evidence of the hypothesized post-ramp-up decline in L1 transfer.

#### **5. Conclusions**

In closing, we would like to highlight one of the chief challenges of designing developmental perceptual research such as the present study, and outline a possible approach to addressing this challenge in future research. In our view, truly longitudinal perceptual data (i.e., data from the same individual completing the same perceptual task, including the same auditory stimuli, at different points in time) may not be the ideal data for investigating perceptual change over time, because it is not only development, but also extraneous factors such as familiarity with (or memory of) the test stimuli, that may lead to listeners performing differently in the same task across two points in time. The challenge for future research, therefore, is to identify and control for such extraneous factors appropriately. In the case of test stimuli, for example, one way of addressing the issue of familiarity/memory would be to use similarly constructed, but non-identical, stimulus sets at different time points, normed in advance to be equivalently difficult. Studies designed "pseudo-longitudinally" in this manner would be better able to show change that could be confidently interpreted as reflecting development.

Despite the challenges of incorporating a longitudinal design in developmental perceptual research, however, the need for longitudinal studies in working toward a theory of L2 perceptual learning that incorporates a role for both L1 transfer and individual differences cannot be overemphasized. Given that cross-sectional studies, by their very nature, are ill-equipped to examine the temporal dynamicity of individual difference effects, longitudinal studies, including both the laboratory training approach as well as the classroom learning approach taken in the present study, are uniquely positioned to shed new light on the roles, and interaction, of language-specific and personally-specific variables in L2 perceptual development.

**Author Contributions:** Conceptualization, C.B.C. and S.K.; Data curation, C.B.C.; Formal analysis, C.B.C.; Funding acquisition, S.K.; Investigation, S.K.; Methodology, C.B.C. and S.K.; Project administration, S.K.; Resources, S.K.; Supervision, S.K.; Validation, C.B.C. and S.K.; Visualization, C.B.C.; Writing—original draft, C.B.C.; Writing—review & editing, C.B.C. and S.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Pukyong National University, grant number C-D-2016-0831.

**Acknowledgments:** The authors gratefully acknowledge Peter Jurgec's expert advice on Slovenian phonetics and phonology and helpful comments and feedback from two anonymous reviewers.

**Conflicts of Interest:** The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**


**Figure A1.** Confusion matrix of errors by the lead-short group in the posttest for stop items (vertical = stimuli, horizontal = responses). Each cell shows the percent of all errors on the given stimulus represented by the given response (rows may not add to 100% due to rounding); the most common error type for each stimulus is bolded. The total number of errors for each stimulus (across all participants in the group) is shown in parentheses.


**Figure A2.** Confusion matrix of errors by the short-long group in the posttest for stop items (vertical = stimuli, horizontal = responses). Each cell shows the percent of all errors on the given stimulus represented by the given response (rows may not add to 100% due to rounding); the most common error type for each stimulus is bolded. The total number of errors for each stimulus (across all participants in the group) is shown in parentheses.


**Figure A3.** Confusion matrix of errors by the lead-long group in the posttest for stop items (vertical = stimuli, horizontal = responses). Each cell shows the percent of all errors on the given stimulus represented by the given response (rows may not add to 100% due to rounding); the most common error type for each stimulus is bolded. The total number of errors for each stimulus (across all participants in the group) is shown in parentheses.


**Figure A4.** Confusion matrix of errors by the no-contrast group in the posttest for vowel items (vertical = stimuli, horizontal = responses). Each cell shows the percent of all errors on the given stimulus represented by the given response (rows may not add to 100% due to rounding); the most common error type for each stimulus is bolded. The total number of errors for each stimulus (across all participants in the group) is shown in parentheses.


**Figure A5.** Confusion matrix of errors by the some-contrast group in the posttest for vowel items (vertical = stimuli, horizontal = responses). Each cell shows the percent of all errors on the given stimulus represented by the given response (rows may not add to 100% due to rounding); the most common error type for each stimulus is bolded. The total number of errors for each stimulus (across all participants in the group) is shown in parentheses.

#### **References**

Baker, Wendy, and Pavel Trofimovich. 2006. Perceptual paths to accurate production of L2 vowels: The role of individual differences. *International Review of Applied Linguistics in Language Teaching* 44: 231–50. [CrossRef]


Sebastián-Gallés, Núria, Carles Soriano-Mas, Cristina Baus, Begoña Díaz, Volker Ressel, Christophe Pallier, Albert Costa, and Jesús Pujol. 2012. Neuroanatomical markers of individual differences in native and non-native vowel perception. *Journal of Neurolinguistics* 25: 150–62. [CrossRef]

Seoul National University Language Education Institute. 2010. *Active Korean 1*. Seoul: Munjin Media.


Yoon, Tae-Jin, and Yoonjung Kang. 2014. Monophthong analysis on a large-scale speech corpus of read-style Korean. *Malsoriwa Eumseong Gwahak [Phonetics and Speech Sciences]* 6: 139–45. [CrossRef]

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Perceived Phonological Overlap in Second-Language Categories: The Acquisition of English /r/ and /l/ by Japanese Native Listeners**

**Michael D. Tyler**

School of Psychology and the MARCS Institute for Brain, Behavior and Development, Western Sydney University, Penrith, NSW 2751, Australia; m.tyler@westernsydney.edu.au

**Abstract:** Japanese learners of English can acquire /r/ and /l/, but discrimination accuracy rarely reaches native speaker levels. How do L2 learners develop phonological categories to acquire a vocabulary when they cannot reliably tell them apart? This study aimed to test the possibility that learners establish new L2 categories but perceive phonological overlap between them when they perceive an L2 phone. That is, they perceive it to be an instance of more than one of their L2 phonological categories. If so, improvements in discrimination accuracy with L2 experience should correspond to a reduction in overlap. Japanese native speakers differing in English L2 immersion, and native English speakers, completed a *forced category goodness rating task*, where they rated the goodness of fit of an auditory stimulus to an English phonological category label. The auditory stimuli were 10 steps of a synthetic /r/–/l/ continuum, plus /w/ and /j/, and the category labels were L, R, W, and Y. Less experienced Japanese participants rated steps at the /l/-end of the continuum as equally good versions of /l/ and /r/, but steps at the /r/-end were rated as better versions of /r/ than /l/. For those with more than 2 years of immersion, there was a separation of goodness ratings at both ends of the continuum, but the separation was smaller than it was for the native English speakers. Thus, L2 listeners appear to perceive a phonological overlap between /r/ and /l/. Their performance on the task also accounted for their responses on /r/–/l/ identification and AXB discrimination tasks. As perceived phonological overlap appears to improve with immersion experience, assessing category overlap may be useful for tracking L2 phonological development.

**Keywords:** Perceptual Assimilation Model; second language speech learning; English /r/ and /l/; Japanese; English as a second language; categorical perception; speech perception

#### **1. Introduction**

Adult second language (L2) learners almost invariably speak with a recognizable foreign accent (Flege et al. 1995; Flege et al. 1999). Less obvious for the casual observer is that they are also likely to have difficulty discriminating certain pairs of phonologically contrasting phones in the target language—that is, they also *hear* with an accent (Jenkins et al. 1995). Research into cross-language speech perception by naïve listeners has shown that attunement to the native language affects the discrimination of pairs of phonologically contrasting non-native phones (e.g., Best et al. 2001; Polka 1995; Tyler et al. 2014b; Werker and Logan 1985), often resulting in poor discrimination when both non-native phones are perceived as the same native phonological category.

For learners of an L2, the question is whether and to what extent they are able to overcome their perceptual accent and acquire new phonological categories. Discrimination of initially difficult contrasts, such as English /r/–/l/ for Japanese native listeners, can improve with naturalistic exposure (MacKain et al. 1981) and laboratory training (Bradlow et al. 1999; Bradlow et al. 1997; Lively et al. 1993; Lively et al. 1992; Lively et al. 1994; Logan et al. 1991; Shinohara and Iverson 2018). For example, in Logan et al. (1991), learners identified minimal-pair words containing /r/ or /l/ (e.g., *rake* or *lake*) and were

**Citation:** Tyler, Michael D. 2021. Perceived Phonological Overlap in Second-Language Categories: The Acquisition of English /r/ and /l/ by Japanese Native Listeners. *Languages* 6: 4. https://doi.org/10.3390/ languages6010004


Received: 1 September 2020 Accepted: 22 December 2020 Published: 28 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/ licenses/by/4.0/).

given corrective feedback for incorrect responses. After 15 training sessions of 40 min each, learners' identification showed a small but significant improvement. The results of the series of experiments showed that: (a) the learning generalized to novel talkers and tokens; (b) the perceptual training resulted in improved production, and; (c) improvements in both perception and production were still evident when learners were tested months later. However, in spite of the improvements from high variability training, the learners' performance did not reach the same level of accuracy as that of native speakers of English.

The fact that learners were able to identify words according to whether they contained /r/ or /l/ suggests that they had some sensitivity to the phonetic characteristics that define English /r/ versus /l/. To establish separate lexical entries for /r/–/l/ minimal pairs (e.g., *rock* and *lock*), it could be argued that they must have developed separate phonological categories for /r/ and /l/. If this is the case, it remains to be explained how they might have learned a phonological distinction without being able to discriminate it to the same level as a native speaker. The purpose of this paper is to propose a learning scenario that may account for these observations. L2 learners may, under certain learning circumstances, develop L2 phonological categories that correspond to similar sets of phonetic properties. That is, when they encounter an L2 phone, they may perceive a phonological overlap, such that the L2 phone is consistent with more than one L2 phonological category. This proposal will be tested using perception of English /r/ and /l/ by native speakers of Japanese, which has a long history of investigation. As this idea emerges from recent developments in the Perceptual Assimilation Model (PAM, Best 1994a, 1995), it complements the theoretical framework of its extension to L2 learning, the Perceptual Assimilation Model of Second Language Speech Learning (PAM-L2, Best and Tyler 2007). Thus, this chapter will present up-to-date reviews of both PAM and PAM-L2, and these will be followed by a review of Japanese listeners' perception of /r/ and /l/, and an outline of the present study.

#### *1.1. The Perceptual Assimilation Model*

The Perceptual Assimilation Model (PAM) was devised to account for the influence of first language (L1) attunement, both on infants' discrimination of native and nonnative contrasts (Best 1994a; Best and McRoberts 2003; Tyler et al. 2014b) and on adults' discrimination of previously unheard non-native contrasts (PAM, Best 1994b; Best 1995; Best et al. 2001; Best et al. 1988; Faris et al. 2018; Tyler et al. 2014a). Following the ecological theory of perception (e.g., Gibson 1966, 1979), PAM proposes that articulatory gestures are perceived directly. Phonological categories are the result of perceptual learning, where the perceiver tunes into the higher-order invariant properties that define the category across a range phonetic variability. While they are abstract, in the sense of being coarser grained than specific articulatory movements, or the acoustic structure that corresponds to those movements, phonological categories are perceptual units rather than mental constructs (Goldstein and Fowler 2003). According to PAM, discrimination accuracy for non-native phones depends on whether and how they are *assimilated* to the native phonological system. When adults encounter non-native contrasts, their natively tuned perception may sometimes help perception (e.g., when the non-native phones are assimilated to two different native categories) and it may sometimes hinder perception (e.g., when both non-native phones are assimilated to the same native category).

An individual non-native phone can be perceptually assimilated as either *categorized* or *uncategorized*, or not assimilated to speech (Best 1995). Categorized phones are perceived as good, acceptable, or deviant versions of a native phonological category. Uncategorized phones are those that are perceived as speech, but not as any one particular native category, and non-assimilable phones are perceived as non-speech. For example, Zulu clicks were often perceived by native English speakers as finger snaps or claps (Best et al. 1988).

The perceptual learning that shapes phonological categories is driven not only by detecting the higher-order phonetic properties that define category membership, but also those set a category apart from other phonological categories in the system (see Best 1994b, pp. 261–62). PAM predicts relative discrimination accuracy for pairs of phonologically

contrasting non-native phones according to how each one is assimilated to the native phonological space. That is, discrimination accuracy for never-before-heard non-native phones depends, to a large extent, on whether the listener detects a *native* phonological distinction between the non-native phones. Consider, first, those contrasts where both non-native phones are assimilated to a native category. If each phone is assimilated to a different category (a *two-category* assimilation), then a native phonological distinction will be detected, and discrimination will be excellent. When both contrasting non-native phones are assimilated to the same native category, there is no native phonological distinction to detect and discrimination depends on whether the listener detects a difference in phonetic goodness of fit to the same native category.1 If there is a goodness difference (a *category-goodness* assimilation), then discrimination will be moderate to very good, depending on the magnitude of the perceived difference, and if there is no goodness difference (a *single-category* assimilation), then discrimination will be poor. Thus, for discrimination accuracy, PAM predicts: two category > category goodness > single category. This main PAM prediction has been confirmed in studies on non-native consonant (Antoniou et al. 2012; Best et al. 2001) and vowel perception (Tyler et al. 2014a).

Turning to contrasts involving uncategorized phones, Best (1995) suggested that when one phone is uncategorized and the other is categorized (an *uncategorized-categorized* assimilation), discrimination should be very good. When both are uncategorized (an *uncategorized-uncategorized* assimilation), discrimination should vary depending on their phonetic proximity to each other. Faris et al. (2016) expanded on this description by describing the different ways that uncategorized non-native phones might be assimilated to the native phonological space. On each trial of a categorization task, Egyptian-Arabic listeners assigned a native orthographic label to an Australian-English vowel (in a /hVb@/ context, where V denotes the target vowel) and then rated its goodness-of-fit. To rule out random responding, Faris et al. tested whether each label was selected above chance for each vowel. An uncategorized native phone was deemed to be *focalized* if only one label was selected above chance (but below the categorization threshold, usually 50% or 70%, Antoniou et al. 2013; Bundgaard-Nielsen et al. 2011b), *clustered* if more than one label was selected above chance, or *dispersed* if no label was selected above chance. It is important to note that, by definition, listeners recognize weak similarity to native phonological categories when non-native phones are perceived as focalized or clustered, but no native phonological similarity when they are perceived as dispersed.

These expanded assimilation types lead to new predictions for uncategorizedcategorized and uncategorized-uncategorized assimilations. Faris et al. (2018) suggested that the discrimination of contrasts involving focalized and clustered assimilations should vary according to the overlap in the set of categories that are consistent with one non-native phone versus the set that is consistent with the other phone. This is known as *perceived phonological overlap* (see also, Bohn et al. 2011). For example, if both non-native phones were clustered, and they were weakly consistent with the same set of native phonological categories, then they would be completely overlapping. Discrimination for completely overlapping contrasts should be less accurate than for contrasts that are only partially overlapping, and most accurate for non-overlapping contrasts (i.e., those where unique sets of labels are chosen for each non-native phone). To test this prediction, Australian-English listeners completed categorization with goodness rating and AXB discrimination tasks with Danish vowels. In AXB discrimination, participants were presented with three different vowel (V) tokens (in a /hVb@/ context) and asked to indicate whether the middle element (X) was the same as the first (A) or third (B) element. While Faris et al. did not observe any completely overlapping contrasts, they demonstrated that non-overlapping contrasts were discriminated more accurately than partially overlapping contrasts. This shows that naïve listeners are influenced by perceptual assimilation to the native phonological space even when a non-native phone is perceived as weakly consistent with one or more native

<sup>1</sup> See Tyler (2021) for a discussion of the different sources of information that might be available for the discrimination of non-native contrasts.

phonological categories. The purpose of this study is to demonstrate how perceived phonological overlap might account for speech perception in learners who have acquired new L2 phonological categories, which is the focus of PAM's extension to L2 speech learning, PAM-L2.

#### *1.2. PAM-L2*

The Perceptual Assimilation Model of Second Language Speech Learning (PAM-L2, Best and Tyler 2007) was devised for predicting the likelihood of new category formation when a learner acquires an L2. Taking the naïve perceiver described by PAM as its starting point, learners are assumed to accommodate phonological categories from all of their languages in a common interlanguage phonological system. Phonological categories may be shared between the L1 and L2 (and subsequently learned languages), or the learner may establish new L2 categories. Importantly, and in contrast to the Speech Learning Model (Flege 1995, 2003), PAM-L2 proposed that learners could maintain common L1- L2 phonological categories with language-specific L1 and L2 phonetic categories, in a similar way that allophones of a phoneme might be thought to correspond to a single phonological category. For example, an early-sequential Greek-English bilingual's common L1-L2 phonological /p/ category could incorporate language-specific phonetic variants for long-lag aspirated English [ph] and short-lag unaspirated Greek [p] (Antoniou et al. 2012).

PAM-L2's predictions for new L2 phonological category formation are made on the basis of PAM contrast assimilation types. To illustrate the PAM-L2 principles, Best and Tyler (2007) outlined a number of hypothetical scenarios involving a previously naïve listener acquiring an L2 in an immersion context. In the case of a two-category assimilation, the learner comes already equipped with the ability to detect the phonological difference between the non-native phonemes, through attunement to the L1. Discrimination of two-category contrasts is predicted to be excellent at the beginning of L2 acquisition and the learner would develop a common L1/L2 phonological category for each of the non-native phonemes. No further learning would be required for that contrast, but L2 perception would be more efficient if they developed new phonetic categories for the L2 pronunciations of their common L1/L2 phonological categories. For single-category assimilations, the learner is unlikely to establish a new phonological category for either phone and discrimination will remain poor. In fact, both L2 phonemes are likely to be incorporated into the same L1-L2 phonological category and contrasting words in the L2 that employ those phonemes should remain homophonous for the L2 learner. For uncategorized-categorized assimilations, new L2 phonetic and phonological categories are likely to be established for the uncategorized phone, with the likelihood being higher for non-overlapping and partially overlapping assimilations than for completely overlapping assimilations (Tyler 2019).

Perhaps the most interesting case is category-goodness assimilation. According to PAM-L2, the learner is likely to develop L2 phonetic and phonological categories for the more deviant phone of the contrast. Best and Tyler (2007) speculated that the learner would first establish a new phonetic category for the more deviant phone. Initially, the deviant phone would simply be a phonetic variant of the common L1-L2 phonological category, but as the learner came to recognize that the phonetic difference between the phones signaled an L2 phonological contrast, a new L2 phonological category would emerge for the newly developed phonetic category. This new L2 phonological category would support the development of an L2 vocabulary that maintains a phonological distinction between minimally contrasting words.

Best and Tyler (2007) suggested that this perceptual learning might occur fairly early in the learning process for adult L2 acquisition. Learners with 6–12 months of immersion experience were considered to be "experienced". An increasing vocabulary was seen as a limiting influence on perceptual learning, but more recent work has shown that vocabulary size may assist learners in establishing L2 categories (Bundgaard-Nielsen et al. 2012; Bundgaard-Nielsen et al. 2011a, 2011b). It is possible that vocabulary expansion

constrains perceptual learning in the case of a single category assimilation, but facilitates perceptual learning when there is a newly established L2 phonological category (i.e., for category goodness assimilations and those involving an uncategorized phone). For example, once new L2 phonological categories are established, learners could use lexically guided perceptual retuning (McQueen et al. 2012; Norris et al. 2003; Reinisch et al. 2013) to accommodate to the phonetic properties of the newly acquired category.

The scenarios outlined in Best and Tyler (2007) assumed an idealized situation where an adult learner with no previous L2 experience is immersed in the L2 environment and where L2 input is entirely through the spoken medium. However, the majority of learners do not acquire an L2 solely from spoken language—the L2 is often acquired first in a formal learning situation using both written and spoken language, and this may occur in the learner's country of origin from a teacher who may speak the L2 with a foreign accent. While category formation could occur from the bottom up in an immersion scenario, the classroom learning environment may expose learners to information about an L2 phonological contrast before they have had the opportunity to attune to the phonetic properties necessary to discriminate it. When the L2 has a phonographic writing system, the most likely source of information would be from orthographic representations of words (see Tyler 2019, for a discussion of how orthography might influence category acquisition in the classroom).

The classroom learning environment would not change the PAM-L2 predictions for contrasts that were initially two-category assimilations, and the uncategorized phone of an uncategorized-categorized assimilation may even benefit from more rapid acquisition under those circumstances (Tyler 2019). However, for those contrasts where both L2 phones are assimilated to the same L1 category, single-category and category-goodness assimilations, the language learning environment could have profound effects on attunement. In fact, this situation may result in a new type of scenario that was not considered in Best and Tyler (2007). Let us reconsider the case of a category-goodness assimilation under those circumstances. The L1 category to which both L2 phonemes are assimilated would form a common L1-L2 phonological category with the more acceptable L2 phoneme, as in the immersion case. For the more deviant phone, rather than discovering a new L2 phonological category on the basis of attunement to articulatory-phonetic information, as was proposed for the immersion context, the learners could possibly discover the L2 phonological contrast via other sources (e.g., via orthography when it has unambiguous grapheme-phoneme correspondences). Since phonological categories are perceptual units, according to PAM-L2, the learners would need to establish a new L2 phonological category for the more deviant phone to acquire L2 words that preserved the phonological contrast. If they had not yet tuned into the phonetic differences that signal the phonological contrast in the L2, then the new L2 phonological category would correspond to a similar set of phonetic properties as the common L1-L2 category. In that situation learners may continue to have difficulty discriminating the L2 contrast, not because the two phones are assimilated to the same phonological category, but because both phones are perceived as being instances of the same two phonological categories. Just as Faris et al. (2016, 2018) have shown that a pair of non-native phones may fall in a region of phonetic space that corresponds to the same set of L1 phonological categories (i.e., a completely overlapping uncategorized-uncategorized assimilation), it is proposed here that a pair of L2 phones might be perceived as consistent with the same two L2 phonological categories.

The aim of this study is to test whether L2 learners who have acquired English in the classroom prior to immersion perceive phonological overlap between L2 phonological categories for contrasting L2 phones that were likely to have been initially perceived as category-goodness assimilations. Furthermore, with immersion experience, the categories should start to separate, such that learners who have recently arrived in an Englishspeaking country should exhibit greater category overlap than L2 users who have been living in the L2 environment for a long period (MacKain et al. 1981).

#### *1.3. Perception of English /r/ and /l/ by Japanese Native Listeners*

The learner group for the present study will be Japanese native speakers and the contrast to be tested will be the English /r/–/l/ contrast. Although the /r/–/l/ contrast was originally thought to be a single-category assimilation (Best and Strange 1992), there is now widespread agreement that it is a category-goodness assimilation (Aoyama et al. 2004; Guion et al. 2000; Hattori and Iverson 2009), with /l/ being rated as a more acceptable version of the Japanese /R/ category than is English /r/. Using this contrast allows the results to be interpreted in light of the rich history of investigations into that listener group/contrast combination.

One of the earliest investigations into Japanese identification and discrimination of /r/ and /l/ was conducted by Goto (1971). He established that Japanese native speakers learning English had difficulty identifying and discriminating minimal-pair words, such as *play* and *pray* and concluded that /r/–/l/ pronunciation difficulties were likely to be perceptual in origin. Miyawaki et al. (1975) had participants discriminate steps on a synthetic /r/–/l/ continuum, in which F1 and F2 were held constant and F3 varied. Unlike native English speakers, Japanese native speakers living in Japan with at least 10 years of formal English language training did not show a categorical peak in discrimination, suggesting that they did not perceive a categorical boundary between /r/ and /l/. Interestingly, the Japanese listeners performed similarly to native English listeners when they were presented with a non-speech continuum that contained only the F3 transition (F1 and F2 values were set to zero). Thus, the Japanese listeners were able to detect the differences in frequencies along the F3 continuum, but when the same acoustic patterns were presented in a speech context, they failed to discriminate them.

In contrast to the findings of Miyawaki et al. (1975), which tested discrimination only, Mochizuki (1981) found that a group of Japanese listeners residing in the USA split the continuum into two separate categories in an /r/–/l/ identification task. This naturally led to the hypothesis that Japanese native speakers can acquire the /r/–/l/ distinction given sufficient naturalistic exposure. To test this, MacKain et al. (1981) directly compared the identification and discrimination of an /r/–/l/ continuum by Japanese native speakers who differed in their exposure to English conversation. Both groups were living in the USA at the time of testing; one group had received extensive English conversation training from a native speaker of US English whereas the other group had received little or no native English conversation training. To optimize the possibility that the less experienced group might discriminate the stimuli, MacKain et al. enriched the /r/–/l/ continuum by providing multiple redundant cues. Whereas Miyawaki et al. held F1 and F2 constant while varying F3, the stimuli of MacKain et al. varied along all three dimensions. In spite of the redundant cues, the less experienced group showed no evidence of categorical perception and their discrimination was close to chance. The more experienced group, on the other hand, perceived the stimuli categorically, and in a similar way to native US-English speakers, and although their discrimination was less accurate than the native speakers, the shape of the response function was similar. They concluded that it was indeed possible for Japanese native speakers to acquire categorical perception of English /r/ and /l/ that approximates that of native speakers.

An individual-differences approach to identification of /r/ and /l/ by Japanese native speakers was undertaken by Hattori and Iverson (2009). Their 36 participants varied in age (19 to 48 years), amount of formal English instruction (7 to 25 years), and the amount of time spent living in an English-speaking country (1 month to 13 years, with a median of 3 months). Participants completed two types of identification task. One was an /r/–/l/ identification task, in which participants heard minimal-pair English words, such as *rock* and *lock*, and indicated whether they began with /r/ or /l/. There was a wide range of accuracy, from close to chance to 100% correct, and a mean of 67% correct. The other identification task was a "bilingual" task, with consonant-vowel syllables formed from the combination of five vowels and three consonants (English /r/, English /l/, and Japanese /R/). Participants were asked to indicate whether the first sound was R, L, or Japanese

R. The identification of /r/ was quite accurate, at 82% correct, and it was almost never confused with /R/ (2%). Identification was less accurate for /l/ (58% correct), with errors split evenly between /r/ and /R/. Japanese /R/ was identified reasonably accurately (77% correct), with most confusions occurring with /l/ (16%). Clearly, the fewest confusions occurred between /r/ and /R/, and the most confusion occurred for /l/ with both /r/ and /R/. While the authors suggested that /l/ appears to be assimilated to Japanese /R/, they did not observe a correlation between /r/–/l/ identification accuracy and the degree of confusion between /R/ and /r/. That is, those who performed poorly on /r/–/l/ identification appear not to have done so because they assimilated both /r/ and /l/ to Japanese /R/. Instead, they suggested that that the learners had an /r/category that was not optimally tuned to English. It is possible that those results could be explained by perceived phonological overlap. The results as a whole are consistent with the idea being proposed here that the participants had developed a new L2 phonological category for /r/, with a corresponding English [ô] phonetic category, and that /l/ had assimilated to /R/ to form a common L1-L2 phonological category, with language-specific [l] and [R] phonetic variants.2 Variability in identification for /r/, /l/, and /R/ could be explained by the different patterns of phonetic and phonological overlap between the categories.

#### *1.4. The Present Study*

The aim of this study is to test the idea that learners who had acquired English in a formal learning setting may perceive phonological overlap when they encounter L2 phones, and that the overlap decreases with immersion experience in an L2 environment. Faris et al. (2018) determined overlap in L1 cross-language speech perception using a categorization task with goodness rating. Across the sample, if the same set of categories was selected above chance for two non-native vowels then the contrast was deemed to be overlapping. While this approach has been shown to provide a reliable indication of overlap in crosslanguage speech perception, particularly for vowels where categorization is more variable than it is for consonants, there is an assumption built into the categorization task that an individual only perceives one phonological category in the stimulus. If the participant perceives the stimulus as consistent with more than one category, then the categorization task may only provide an imperfect approximation of the amount of overlap. For example, if the stimulus is clearly more acceptable as one category than another, then the participant may only ever select the best-fitting category, and the task would fail to reveal perceived phonological overlap.

For the purposes of this study, a task is required that can identify perceived phonological overlap, for a given L2 phone, and that is capable of detecting differences in the amount of overlap between groups. This can be achieved by eliminating the categorization stage and simply rating the goodness of fit of each stimulus to a phonological category that is provided on each trial—a *forced category goodness rating task*. For example, participants would be presented with an English category label, e.g., "R as in ROCK", and an auditory stimulus, and they would be asked to rate the goodness of fit of the auditory stimulus to the given category. That task will be used in the present study to assess category overlap in Japanese learners of English with more versus less experience in an immersion environment, and in a native English control group.

The stimuli will be the 10-step rock-lock continuum, developed at Haskins Laboratories (Best and Strange 1992; Hallé et al. 1999; MacKain et al. 1981), that uses multiple redundant cues for F1, F2, and F3. The /w/ and /j/ end points of the wock-yock continuum will also be included as control stimuli, as participants are expected to indicate that those stimuli are not similar to either /r/ or /l/. Participants will rate the goodness of fit of each stimulus to four English phonological categories: /l/, /r/, /w/, and /j/. Results from across the continuum will give insight into the internal structure of the phonological category, at least along one axis of acoustic-phonetic variability, and they will also pro-

<sup>2</sup> Note that square brackets are used here to denote a phonetic category rather than a specific phone.

vide a link to previous studies that have used those stimuli. To further assist with such comparisons, participants will complete discrimination and /r/–/l/ identification tasks in addition to the forced category goodness rating task. In line with those previous studies, native English listeners should show categorical perception of stimuli from the rock-lock continuum in both identification and discrimination, with progressively less categorical perception observed for more experienced, then less experienced Japanese listeners.

In the forced category goodness rating task, native English listeners should rate [w], [j], and the [ô] and [l] ends of the continuum as good examples only for the corresponding phonological category (e.g., /w/ for [w]), and as low on each of the other three categories. Ratings for the continuum steps should vary in a similar way to the identification task. The more and less experienced Japanese participants should both rate [w] and [j] as good only for the corresponding phonological category, and low on each of the other three categories. If participants perceive phonological overlap for stimuli along the rock-lock continuum, then forced goodness ratings will be above the lowest rating for more than one phonological category. It is anticipated that the ratings as /r/ and /l/ will be high for both continuum end points, but that the difference between /r/ and /l/ goodness ratings for the same stimulus (e.g., [ô]) will be greater for the more than less experienced groups. That is, the overlap should be smaller for more experienced than less experienced Japanese learners of English.

#### **2. Materials and Methods**

#### *2.1. Participants*

Japanese native speakers were recruited from Sydney, Australia, via word-of-mouth, noticeboard advertisements at local universities, and a Japanese-language electronic bulletin board service targeted at expatriate Japanese people living in Sydney. The aim was to recruit two samples of Japanese native speakers: (1) migrants who had been living in Australia for a long period, and; (2) recent arrivals. Fifty-five Japanese native speakers were tested. Data were discarded for four participants who were immersed in an English-speaking country at the age 16 or younger, and for one participant who was raised in Hong Kong. To establish a clear difference in length of residence (LOR) between the more experienced and less experienced groups, data were retained for participants who had been living in Australia for a minimum of 2 years, or for 3 months or less. Data for 13 participants with LORs ranging from 5 to 19 months were therefore discarded (*M* = 0.89 years).

The final sample of Japanese native speakers consisted of 19 more experienced English users (16 females, *M*age = 38 years, Age Range: 21 to 59 years, *M*LOR = 8 years, LOR Range: 2 to 27 years) and 18 less experienced English users (13 females, *M*age = 25 years, Age Range: 20 to 35 years, *M*LOR = 8 weeks, LOR Range: 1 to 13 weeks). The participants were given a small payment for their participation in the study.

The participants were asked to report any languages that they learned outside the home (i.e., in a formal education context) and the age at which they began to learn them. For the more experienced group, all participants began to learn English in Japan between 9 and 13 years of age (*M* = 12 years, *SD* = 0.91 years). One participant did not report the number of years of English study, but the remainder reported between 6 and 15 years (*M* = 9 years, *SD* = 2.41 years). For the less experienced group, the participants began to learn English in Japan between 5 and 13 years of age (*M* = 11 years, SD = 1.9 years). Two participants did not report the number of years of English study, with the remainder completing between 6 and 15 years (*M* = 10 years, *SD* = 2.42 years).

The Australian-English native speakers were recruited from the graduate and undergraduate student population at Western Sydney University, Australia, who received course credit for participation. There were 16 participants (14 females, *M*age = 21 years, Age Range: 18 to 35 years). Data for an additional four participants were collected but discarded due to childhood acquisition of a language other than English (*n* = 1), self-reported history of a language disorder (*n* = 2), or brain injury (*n* = 1).

#### *2.2. Stimuli and Apparatus*

Participants were presented with the 10-step /rak/–/lak/ (*rock-lock*) continuum that was first used in MacKain et al. (1981), and the endpoints of the /wak/–/jak/ (*wock-yock*) continuum from Best and Strange (1992); see also (Hallé et al. 1999). The first consonant and vowel portions of the stimuli were generated with the OVE-IIIc cascade formant synthesizer at Haskins Laboratories, and the /k/ was appended to the synthesized syllables. See the original articles for additional stimulus details, including F1, F2, and F3 parameters. A questionnaire was used to collect basic demographic information, and information about the participants' language learning history. The experiment was run on a MacBook laptop running Psyscope X B50 (http://psy.ck.sissa.it/). Participants listened to stimuli through Koss UR-20 headphones set at a comfortable listening level.

#### *2.3. Procedure*

Participants completed the forced category goodness rating task, followed by /r/–/l/ identification, and AXB discrimination. The language background questionnaire was completed at the end of the session.

*Forced category goodness rating*. Participants were instructed that on each trial they would hear a syllable in their headphones and that their task would be to decide how similar the first sound of the syllable was to one of four English consonant categories, presented on screen (R as in ROCK, L as in LOCK, W as in WOCK, or Y as in YOCK). They were asked to rate the similarity on a 7-point scale, using the numbers on the computer keyboard, where 1 indicated that it was highly similar to the given category, 4 was somewhat similar, and 7 was highly dissimilar.3 Participants were encouraged to try to use the entire scale from 1–7 across the experiment, and they were instructed not to reflect too long on their response. If they did not respond within 4 s, the trial was aborted, and a message instructed them to respond more quickly. Missed trials were repeated later in the list to ensure that each participant provided a full set of rating data. Each of the 12 stimulus tokens (10 /r/–/l/ continuum steps plus the /w/–/j/ endpoints) were presented five times, each in the context of the four rating categories, resulting in a total of 240 trials. Stimuli were randomized within each of the five blocks. To maintain participant vigilance and to give an opportunity to take a short break, the participant was asked to press the space bar to continue every 20 trials. The task took approximately 20 min to complete.

*Identification*. The 10 steps of the /rak/–/lak/ continuum were used in the identification task. Participants were instructed to listen to syllables through headphones and indicate whether the first sound was more like "r" as in "rock" or "l" as in "lock". They responded using the D and L keys on the keyboard, which were labeled with "R" and "L", respectively. The letters R and L were also displayed on the left- and right-hand side of the screen. Participants were instructed not to reflect for too long on their response. The trial timed out after 2 s, which was followed by a message on screen to respond more quickly. Missed trials were reinserted at a random point in the remaining trial sequence. Participants were presented with 20 randomly ordered blocks of the 10 steps (200 trials in total) and they pressed the space bar at the end of each block to continue. The test took approximately 10 min to complete.

*AXB Discrimination*. Following Best and Strange (1992), participants were tested using AXB discrimination. Three tokens were presented sequentially. The first and third were different steps on the continuum and the middle item was identical to the first or the third token. Participants were tested on steps that differed by three along the continuum ([ô]-4, 2- 5, 3-6, 4-7, 5-8, 6-9, 7-[l]). Stimuli were presented in all four possible AXB trial combinations (AAB, ABB, BAA, BBA). These 28 trials (7 step pairs × 4 AXB trial combinations) were randomized within blocks, which were presented 5 times. Each step pair was therefore presented 20 times in total across 140 trials. Participants were told that on each trial they

<sup>3</sup> Although participants had no difficulty using this scale, the scale is reversed in the results section to assist the reader in interpreting the data patterns. That is, 7 is reported as *highly similar* and 1 is reported as *highly dissimilar*.

would hear three syllables in their headphones one after the other, that the first and third syllables were different, and the second syllable was either the same as the first or the third syllable. They were instructed to indicate whether the second syllable was the same as the first or the third syllable, using the keys 1 and 3 on the computer keyboard, basing their decision only on the first sound in each syllable (i.e., the consonant). The software would not allow the participants to respond until the third syllable had finished playing. Participants were instructed not to reflect for too long on their response. A trial timed out after 2 s and was repeated later in the experiment. It took around 5 min to complete.

#### **3. Results**

#### *3.1. Forced Category Goodness Rating*

Participants rated the 10 steps of the /r/–/l/ continuum and the endpoints of the /w/–/j/ continuum against the categories /l/, /r/, /w/, and /j/. The results for the three groups are presented in Figure 1. The top-left panel (a) shows the data for less experienced Japanese listeners, the top-right panel (b) for more experienced Japanese listeners, and the bottom panel (c) for Australian-English listeners. The auditory stimulus is plotted on the *x*-axis. Step 1 of the /r/–/l/ continuum is denoted as [ô], step 10 is [l], and the other steps are denoted by their step number. The endpoints of the /w/–/j/ continuum are denoted as [w] and [j]. The mean ratings are plotted on the *y*-axis. Thus, the top left point on the English native listener plot represents participants' mean ratings for how well auditory [ô] fit the /r/ category, "R"; the bottom left point represents their ratings for auditory [ô] to the /l/ category, "L".

**Figure 1.** Mean forced goodness ratings for each auditory stimulus, to /l/ (L), /r/ (R), /w/ (W), and /j/ (Y) categories, where 1 is "highly dissimilar" and 7 "highly similar". (**a**). Less experienced Japanese (<3 months immersion); (**b**) More experienced Japanese (>2 years immersion); (**c**) Native English listeners. Error bars represent standard error of the mean.

The Australian-English group results, shown in the bottom panel of Figure 1, show a classic categorical perception pattern for the /r/–/l/ continuum for ratings as /l/ and /r/, with the cross-over point between steps 6 and 7. The phones [w] and [j] were each rated as highly similar to /w/ and /j/, respectively, and highly dissimilar to any other category. It

is interesting to note that the ambiguous regions of the /r/–/l/ continuum also appear to have attracted higher acceptability ratings for /w/ and /j/ than the /r/–/l/ endpoints did. To test this, the ratings to /w/ and /j/ for step 7 were compared to those for [l] and [ô], respectively, in separate 2 × 2 repeated measures analyses of variance (ANOVAs). For step 7 versus [l], there was a main effect of step, *F*(1, 15) = 20.79, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.27,<sup>4</sup> and a significant two-way interaction between step and category, *F*(1, 15) = 9.09, *p* < 0.001, η2 <sup>G</sup> = 0.06, indicating that the rating difference between step 7 and [l] was greater for /w/ than for /j/. Simple effects tests showed that the rating difference between step 7 and [l] was significant for both /w/, *F*(1, 15) = 20.92, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.38, and /j/, *F*(1, 15) = 8.46, *p* = 0.01, η<sup>2</sup> <sup>G</sup> = 0.13. For step 7 versus [ô], there were main effects of step, *F*(1, 15) = 26.92, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.22, and category, *F*(1, 15) = 10.36, *p* = 0.006, η<sup>2</sup> <sup>G</sup> = 0.06, but no interaction, *F*(1, 15) = 3.86, *p* = 0.07. This suggests that both step 7 and [ô] are more /w/-like than /j/-like, and that, overall, the ratings for step 7 are higher than those for [ô]. Together, these results indicate that the most ambiguous step was perceived as somewhat /w/-like and weakly /j/-like, in addition to being perceived as somewhat /r/- and /l/-like. The [l] endpoint was not perceived as similar to /w/ or /j/, and it appears that the listeners may have perceived the [ô] endpoint to have some similarity to /w/ but not /j/.

An initial comparison of the pattern of results for the two Japanese groups, in the top-left and top-right panels of Figure 1, suggests that there may be a difference in their ratings of the /r/-/l/ continuum. Considering first the endpoints, the less experienced group appear to rate [l] as both an acceptable /l/ and an acceptable /r/, whereas [ô] appears to be rated as a more acceptable /r/ than /l/. This supports the idea that they perceive a phonological overlap between /r/ and /l/ for both [ô] and [l], but at first glance it suggests that they already have a reasonable sensitivity to the difference between /r/ and /l/ at the [ô] end of the continuum. It is important to note, however, that the ratings as /r/ across the entire continuum are uniformly high. The separation in ratings for [ô] as /r/ and as /l/ may be due to that phone being perceived as a poorer /l/ rather than a more acceptable /r/. The more experienced Japanese group, on the other hand, appear to have rated [ô] as a more acceptable /r/ than /l/, and [l] as a more acceptable /l/ than /r/. The lower of the two goodness ratings for both [ô] and [l] are around 4—"Somewhat Acceptable". There is also a remarkable similarity between the shape of the response curve for ratings as /l/ for both groups at the /l/-end of the continuum, which is consistent with the idea that they have established a common L1-L2 category for English /l/ and Japanese /R/.

To test these observations, the ratings as /r/ and /l/ for the 10 steps of the /r/– /l/ continuum were subjected to a 2 × (2) × (10) mixed ANOVA. The between-subjects variable was *group* (less experienced vs. more experienced Japanese listeners) and the two within-subjects variables were *category* (/l/ vs. /r/) and *step*. There were main effects of category, *F*(1, 35) = 40.47, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.09, and step, *F*(9, 315) = 12.17, *p* < 0.001, η2 <sup>G</sup> = 0.09, and a significant two-way interaction between them, *F*(9, 315) = 18.96, *p* < 0.001, η2 <sup>G</sup> = 0.13. The differential responding by the more experienced and less experienced groups, that can be seen in Figure 1, was confirmed by a significant three-way interaction between category, step, and group, *F*(9, 315) = 2.67, *p* = 0.005, η<sup>2</sup> <sup>G</sup> = 0.02. To explore the three-way interaction further, separate two-way ANOVAs were run for each group. There were two-way interactions between category and step for both the more experienced, *F*(9, 162) = 14.93, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.19, and the less experienced groups, *F*(9, 153) = 5.20, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.07. Another set of two-way ANOVAs was conducted to test whether the groups differed for each category. There was a two-way interaction between step and group for /r/, *F*(9, 315) = 2.59, *p* = 0.007, η<sup>2</sup> <sup>G</sup> = 0.04, but not for /l/, *F*(9, 315) = 0.68, *p* = 0.73, suggesting that the differences between the groups can be accounted for by improvements in perception of their /r/ category only. To test for differences in goodness ratings as /r/

<sup>4</sup> Generalized eta squared (η<sup>2</sup> G) is a measure of effect size that is appropriate for mixed designs (Olejnik and Algina 2003). It is compatible with Cohen's (1988) benchmarks for interpreting eta squared (small = 0.01, medium = 0.06, and large = 0.14).

versus /l/ at each continuum step, post-hoc paired *t*-tests were run separately for each group, with a Bonferroni-adjusted alpha level of 0.005. The results are presented in Table 1. For the less experienced group, there are significant differences between ratings as R versus L at steps [ô] through 4, and for step 6. For the more experienced group, the ratings are also different for steps [ô] through 4. Importantly, and unlike the less experienced group, they are also different for step [l].

**Table 1.** Post-hoc paired *t*-tests for ratings as /r/ versus /l/ at each continuum step, for less experienced and more experienced Japanese listeners.


Values in boldface are significant at a Bonferroni-adjusted alpha rate of 0.005.

Like the Australian-English listeners, both Japanese groups rated [w] and [j] as highly similar to /w/ and /j/, respectively, and not to any other category. The ratings as /w/ across the /r/–/l/ continuum appear to be similar in the groups. They also appear to have similar ratings as /j/, but with a flatter response than the English native listeners. The ratings as /w/ and /j/ for step 7 and [l], and for step 7 and [ô], were compared for the two Japanese groups using separate 2 × (2) × (2) ANOVAs. The between-subjects variable was *group* (less experienced vs. more experienced) and the within-subjects variables were *category* (/w/ vs. /j/) and *step* (step 7 vs. [l] or step 7 vs. [ô]). For step 7 versus [l], there were main effects of category, *F*(1, 35) = 35.77, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.20, and step, *F*(1, 35) = 21.22, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.10, and a significant two-way interaction between them, *F*(1, 35) = 15.90, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.07. Crucially, there was no three-way interaction between category, step, and group, *F*(1,35) = 0.05, *p* = 0.83, suggesting that the more and less experienced Japanese listeners responded similarly to each other. The significant two-way interaction was further probed with tests of simple effects, which showed that the differences in ratings for step 7 and [l] were significant for ratings as /w/, *F*(1, 35) = 22.08, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.03, but not for ratings as /j/, *F*(1, 35) = 1.37, *p* = 0.25. The results were similar for step 7 versus [ô], with the main effects of category, *F*(1, 35) = 48.30, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.27, and step, *F*(1, 35) = 13.73, *p* = 0.001, η<sup>2</sup> <sup>G</sup> = 0.06, and a significant two-way interaction between them, *F*(1, 35) = 4.18, *p* < 0.001, η<sup>2</sup> <sup>G</sup> = 0.02.

#### *3.2. Identification*

The mean percent L responses for identification of steps from the /r/–/l/ continuum are presented in Figure 2. The native English listeners show the classic ogive-shaped categorical perception function that was observed in previous studies using the same stimuli (Best and Strange 1992; Hallé et al. 1999; MacKain et al. 1981). The boundary location appears to be between steps 6 and 7, mirroring the results from the forced category goodness rating task. This seems to be closer to the /l/ end of the continuum than the American-English listeners in the previous studies, whose boundary location was between steps 5 and 6. The more experienced Japanese participants appear to have a shallower function than the native English participants and the endpoints of the continuum do not reach 0% or 100% (at 19% and 77%, respectively). The less experienced Japanese participants' function appears to be even shallower than the more experienced participants' and their endpoints are closer to the chance level of 50% ([ô] = 32%, [l] = 58%).

**Figure 2.** Mean percent "L" responses from the R-L identification task for the three participant groups. Error bars represent standard errors of the mean.

Given the English listeners' ceiling performance for steps [ô]-4, 9, and [l], and the corresponding lack of variability, it is not appropriate to use standard parametric tests to compare their performance to the Japanese groups'. It is clear that they perform differently from the more experienced Japanese listeners. Furthermore, given that the shapes of the response curves for most of the Japanese participants in this data set do not appear to follow an ogive-shaped function, the data have not been fit to a cumulative distribution function, as they were in previous studies (Best and Strange 1992; Hallé et al. 1999). A 2 × (10) ANOVA was conducted to test whether the response curves differed for more experienced versus less experienced Japanese listeners. The between-subjects factor was *experience* and the within-subjects factor was *step*. The shape of the response curve was tested using orthogonal polynomial trend contrasts on the step factor—linear, quadratic (one turning point), cubic (two turning points), and quartic (three turning points) (Winer et al. 1991, for contrast coefficients). There was no significant main effect of experience, *F*(1, 35) = 0.75, but there were significant overall linear, *F*(1, 35) = 60.39, *p* < 0.001, quadratic, *F*(1, 35) = 7.89, *p* = 0.008, cubic, *F*(1, 35) = 17.76, *p* < 0.001, and quartic trends, *F*(1, 35) = 5.34, *p* = 0.02. Those significant trend contrasts simply indicate that the curve has a complex shape. The important question is whether there are any significant interactions between the trend contrasts and experience. The only significant interaction was with the linear trend contrast, *F*(1, 35) = 7.91, *p* = 0.008. Therefore, while there is no evidence for a difference in the shape of the Japanese participants' response curve, the significant interaction shows that the more experienced group have a generally steeper function than the less experienced group, as can be seen in Figure 2.

#### *3.3. Discrimination*

Mean percent correct responses for discrimination of the seven pairs of steps from the continuum are presented in Figure 3. Again, there are clear differences between the English and Japanese listeners' performance. The English listeners show the classic categorical perception discrimination response, with poorer discrimination for steps that are on the same side of the categorical boundary and more accurate discrimination for steps that cross the category boundary. The Japanese participants, on the other hand, show a clear double peak.

**Figure 3.** Mean percent correct discrimination of pairs of steps from the /r/–/l/ continuum for the three participant groups. Error bars represent standard errors of the mean.

As the performance of the native speakers is not at ceiling on this task, it is possible to conduct a 3 × (7) ANOVA comparing all three groups. The between-subjects factor was *group*, with two planned contrasts. The *language background* contrast compared the English listeners with the combined results of the two Japanese groups and the *experience* contrast compared the two Japanese groups only. The within-subjects factor was *step pair*, and the shape of the curve was tested again using orthogonal polynomial contrasts. The language background contrast was not significant, *F*(1, 50) = 3.47, but a significant experience contrast showed that the more experienced group performed more accurately, overall, than the less experienced group, *M*difference = 5.53%, *SE* = 2.74, *F*(1, 50) = 4.08, *p* = 0.049. There was a significant overall linear trend, *F*(1, 50) = 27.48, *p* < 0.001, reflecting the gradual increase in accuracy, collapsed across all three groups, from the [ô]-end to the [l]-end of the continuum. There was no interaction between the linear trend contrast and language background, but it interacted significantly with experience, *F*(1, 50) = 4.49, *p* = 0.04. This can be seen in Figure 3, where the less experienced group appear to be relatively less accurate at the [ô]-end than the [l]-end of the continuum, whereas the more experienced group show more similar levels of accuracy at both ends. There was also a significant overall quadratic trend, *F*(1, 50) = 33.81, *p* < 0.001, which interacted with the language background contrast only, *F*(1, 50) = 20.90, *p* < 0.001. This reflects the fact that the English participants' responses show a quadratic shape, whereas the Japanese participants' responses appear to follow a quartic shape. Indeed, there was no significant cubic trend, and no interactions, and no significant overall quartic trend, but there was a significant interaction between the quartic trend and language background, *F*(1, 50) = 9.67, *p* = 0.003.

#### **4. Discussion**

The aim of this study was to test whether certain L2 learners, who initially acquired their L2 in a formal learning context, might perceive an L2 phone as an instance of more than one L2 phonological category, and whether that perceived phonological overlap is smaller with longer versus shorter periods of L2 immersion. To test for perceived phonological overlap, participants completed a forced category goodness rating task, where they were presented with an auditory token and rated its goodness of fit to a given English phonological category label, L, R, W, or Y. It was hypothesized that Japanese native speakers who had first been exposed to English in Japan would perceive a phonological

overlap between /r/ and /l/, but that the overlap would be smaller for those with a long period of immersion in an English-speaking environment (>2 years) than those with a short period of immersion (<3 months). Native English speakers rated [ô] as a highly similar to /r/, but not /l/, /w/, or /j/, and [l] as highly similar to /l/, but not to the other three categories. In contrast, the Japanese native speakers rated [ô] as highly similar to /r/, moderately similar to /l/, and dissimilar to /w/ and /j/. The two groups of Japanese speakers differed from each other in their ratings of [l]. The less experienced Japanese group rated [l] as highly similar to *both* /r/ and /l/, whereas the more experienced Japanese group rated [l] as highly similar to /l/ and only moderately similar to /r/. In short, the hypotheses were supported. When the Japanese listeners perceived stimuli along the continuum between /r/ and /l/, they perceived varying degrees of phonological overlap with /r/ and /l/, and the overlap was smaller for the more experienced than the less experienced group. This study is cross-sectional, so it is not possible to conclude definitively that there is a reduction in overlap due to immersion experience, but these results are certainly consistent with that idea.

The general pattern of results observed for the less experienced group was predicted by a learning scenario, proposed here, that had not previously been considered by Best and Tyler (2007) when they outlined PAM-L2. The scenarios presented in Best and Tyler were based on functional monolinguals who were acquiring an L2 in an immersion setting. In contrast, the participants in both Japanese groups had learned English in Japan prior to immersion. As the /r/–/l/ contrast is a category-goodness assimilation for Japanese native speakers (e.g., Guion et al. 2000), the PAM-L2 learning scenario suggests that [l] would initially be perceived as a common L1-L2 phonetic category with Japanese [R], and that Japanese /R/ and English /l/ would form a common L1-L2 phonological category. English [ô] would first be perceived as an allophonic variant of the common L1-L2 /R/–/l/ category, and then a new L2-only /r/ phonological category would be established when the learner recognized that the phonetic contrast between /r/–/l/ signaled a phonological distinction. Thus, the non-native category-goodness contrast would become an L2 twocategory contrast. Such a situation should have resulted in a data pattern that resembled the native English speakers' results. Since the Japanese native speakers in this study acquired English in a formal learning situation in Japan, it was argued here that they may have needed to establish a new phonological category for /r/ before they had managed to tune in to its phonetic properties. That learning scenario is consistent with the data pattern observed in this study. The participants were able to indicate that steps along the continuum were perceived as similar to /r/, but they rated the same stimuli as also having various degrees of similarity to /l/. The fact that they rated the steps as being dissimilar to /j/ shows that they were not simply indicating that any L2 consonant was similar to /r/. Rather, it is consistent with the idea that they had developed a new phonological category for /r/ that was poorly tuned to the phonetic properties that distinguish it from /l/. In PAM terms, both /r/ and /l/ would be uncategorized clustered assimilations. Rather than becoming a two-category assimilation, /r/–/l/ would be a completely overlapping uncategorized-uncategorized contrast (Faris et al. 2016, 2018).

A close inspection of Figure 1 shows that the response pattern for ratings as /l/ appears to be similar for the less and more experienced groups, with stimuli closer to [ô] rated as poorer /l/s than those closer to the [l] end. Indeed, follow-up analyses of the threeway interaction between group, category, and step showed that the difference between the two groups was entirely due to differences in ratings as /r/. The less experienced group gave high ratings as /r/ across the continuum, whereas the more experienced group gave lower ratings as /r/ towards the [l]-end of the continuum. The finding that the ratings as /l/ were unaffected by immersion experience fits with the idea that English /l/ is a common L1-L2 category with Japanese /R/, as an L1 category may be more resistant to change with L2 experience than a new L2-only phonological category. Future research could investigate this further by including Japanese /R/ as a rating category and auditory stimulus, in addition to English /r/ and /l/.

Presenting the /r/–/l/ continuum, rather than isolated tokens of /r/ and /l/, allows us to compare the internal structure of the phonological categories. It is clear from comparing the three groups in Figure 1 that the English native listeners have a much clearer separation of their /r/ and /l/ categories than the Japanese listeners. Steps [ô]-4 have phonetic properties that appear to be prototypically /r/ for the English listeners, and those of steps 9-[l] appear to be prototypically /l/. The uniformly high ratings as /r/ across the continuum for the less experienced Japanese group suggests that their L2 English /r/ category may initially cover a broad region of phonetic space, with minimal differences in phonetic goodness of fit. Improvements that have been observed previously after periods of immersion may, therefore, be due more to improvements in category tuning to /r/ than /l/ (e.g., MacKain et al. 1981). Training regimes for improving /r/–/l/ perception in Japanese native listeners have traditionally focused on identifying minimal-pair words (e.g., Bradlow et al. 1997; Hattori and Iverson 2009; Lively et al. 1993; Lively et al. 1992; Lively et al. 1994), that is, on recognizing those phonetic characteristics that define category membership of /r/ versus those that define /l/. If the learner perceives a phonological overlap, and the token is perceived as an equally good example of both /r/ and /l/, then the utility of that training may be limited. The findings of the present study suggest that there may be some benefit in training learners to recognize which tokens should be a good versus poor fit to a category rather than identifying which category they belong to, before transitioning to training regimes that focus on detecting the relational phonetic differences that characterize phonological distinctions (e.g., using a discrimination task).

#### *4.1. Identification and Discrimination*

In contrast to the forced category goodness rating task, where participants rated the phonetic goodness of fit to a phonological category that was provided on each trial, the identification task required them to make a forced choice between two phonological categories. In an identification ask, a participant could decide that a given stimulus was /r/, for example, because either it clearly sounded like /r/ or it clearly did *not* sound like /l/. As the forced category goodness rating task provides some insight into the listeners' judgement of the extent to which each continuum step resembled or did not resemble /l/ or /r/, their ratings should be related to their identification response function. As the forced category goodness ratings as /l/ and /r/ are basically mirror images of each other for the English native listeners, they could have made their decision using either criterion, and their identification response curve would have been similar to the one presented in Figure 2. For both Japanese groups, the separation in forced category goodness ratings as /l/ versus /r/ is wider at the [ô]-end than the [l]-end of the continuum. This is reflected in the identification accuracy, given that both groups were more accurate at identifying [ô] than [l] (see Figure 2). That result is consistent with Hattori and Iverson (2009), and with a recent study where Japanese listeners were more accurate for /r/ than /l/ in a *ranby* versus *lanby* identification task (Kato and Baese-Berk 2020). The authors of that study suggested that the asymmetry was due to a bias towards the category that was more dissimilar to the closest native category. The results of this study complement that conclusion by providing a novel theoretical explanation for the source of the bias. Japanese listeners identify /r/ more accurately than /l/ because the perceived phonological overlap between /r/ and /l/ is less pronounced for /r/ than it is for /l/. It would be interesting to see whether the same pattern of overlap is observed for the more dissimilar phone in category-goodness contrasts from other languages and language groups.

Given that the two groups did not differ on their ratings as /l/ in the forced category goodness rating task, they should have had similar judgements in the identification task about what did and did not sound like /l/. Relative differences between the groups' identification function should be attributable to their divergent ratings of each step as /r/. Indeed, all of the steps were acceptable-sounding versions of /r/ for the less experienced group (Figure 1a) and the shape of their identification function (Figure 2) is strikingly similar to that for their ratings as /l/. The more experienced group appears to have a

slightly wider separation than the less experienced group between ratings as /r/ and /l/ for steps [ô]-4 (Figure 1b), which is reflected in identification by a lower percentage of "L" responses on steps [ô]-4. Ratings as /r/ dropped and remained steady from steps 5-[l] (Figure 1b) and this corresponds to the relatively higher accuracy of the more versus less experienced group on identification of those steps (Figure 2). Thus, the forced category goodness rating task has provided insights into the reason why Japanese listeners do not show categorical perception across an /r/-/l/continuum. Differences in goodness of fit to simultaneously perceived phonological categories modulates their judgements of whether the stimulus is or is not a member of each category.

MacKain et al. (1981) compared identification and discrimination of the same continuum by Japanese native speakers living in the US who had undergone intensive conversation training in English, another group with little or no such training, and native speakers of English. The experienced group in that study showed categorical perception along the rock-lock continuum, with an identification response that did not differ from the native speakers', whereas the inexperienced group showed a fairly flat response function. The identification results in the present study were comparable to MacKain et al. for the less experienced group, but the more experienced group was not similar to the native speakers in this study. One key difference between the studies was the criterion used to select the more experienced group. Whereas MacKain et al. selected participants on the basis of conversational training, here they were selected on the basis of length of residence. Focused conversational training in an immersion situation may have resulted in more native-like L2-learning outcomes than simply residing in an L2 environment. A comparison of the present results with those of MacKain et al. would support Flege's (2009, 2019) contention that access to quality native-speaker input is a more direct predictor of L2 perceptual learning than length of residence.

In discrimination, the English speakers showed a clear peak across the categorical boundary, in line with previous research. There is a clear double peak for both Japanese groups, which is in contrast to the relatively flat distribution observed in MacKain et al. (1981). The identification results do not seem to provide any explanation for why a double peak was observed. For example, the less experienced group showed a peak for step 3 versus step 6, but these were identified similarly. A double peak is an indication that the participants may have perceived a third category in the middle of the continuum. The forced category goodness ratings suggest that there was some degree of /w/ perceived in the middle of the continuum. Although it may seem unlikely that participants would have identified /w/ in the middle of the continuum when the goodness ratings for /w/ were no higher than they were for /r/ or /l/, previous findings of /w/ identification in the middle of other /r/–/l/ continua (Iverson et al. 2003; Mochizuki 1981) make this a plausible explanation. Another possibility is that they perceived a different category in the middle of the continuum (e.g., their native Japanese /R/ or possibly the vowel-consonant sequence /WR/, see Guion et al. 2000), but as they were not asked to rate the stimuli against other categories, that is a question that would need to be addressed in future research. Discrimination accuracy generally increased for the less experienced group along the continuum from [ô] to [l], whereas the relatively more accurate discrimination of the more experienced group was fairly level. This may reflect the more experienced group's greater sensitivity to goodness differences in both /r/ and /l/, whereas the less experienced group may have relied primarily on goodness differences relative to /l/ only.

#### *4.2. Conflicting Findings between Pre-Lexical and Lexical Tasks*

As phonological categories are pre-lexical perceptual units for PAM-L2, perceived phonological overlap would be a logical consequence of acquiring a phonological category before sufficient perceptual learning had taken place to differentiate it from other categories in the phonological system. The results of this study, and of Kato and Baese-Berk (2020), are consistent with that account. However, in a study examining the time course of L2 spoken word recognition, Cutler et al. (2006) observed an asymmetry that appears to be the

reverse of the one observed here in pre-lexical tasks. Japanese and English native speakers heard an instruction to click on one of four objects presented on the screen while an eye tracker monitored their eye movements. On critical trials, one picture depicted an object containing /r/ (e.g., *writer*), another containing /l/ (e.g., *lighthouse*), and there were two non-competitor pictures containing neither /r/ nor /l/. The /r/-/l/ word pairs were chosen so that the onsets overlapped phonologically, such that the participants would need wait for disambiguating information (e.g., the /h/ of *lighthouse*) if they were unable to tell /r/ and /l/ apart. When the word containing /r/ was the target, the Japanese participants took longer to settle their gaze on the correct object than the English native speakers did, suggesting that they were unable to disambiguate the words on the basis of /r/ or /l/. However, when the /l/-word was the target, they settled on the correct picture early, at the same point in time as the English native speakers. Thus, there was an asymmetry in word recognition, such that they were apparently more efficient at recognizing words beginning with /l/ than those beginning with /r/ (see Weber and Cutler 2004, for similar results on Dutch listeners' recognition of English words containing /ε/ and /æ/). Given that perceived phonological overlap in this study was smaller for /r/ than /l/, and /r/ is identified more accurately than /l/ (Kato and Baese-Berk 2020), it is surprising that spoken word recognition should show an asymmetry in the opposite direction (see Amengual 2016; Darcy et al. 2013 for other examples of a mismatch between performance pre-lexical and lexical tasks). Cutler et al. explained their results in terms of lexical processing, rather than perception of phonological categories, which may account for the difference. They suggested that the Japanese listeners had established lexical entries that preserved the /r/–/l/ phonological distinction, even though they could not reliably discriminate the contrast, and provided two possible explanations for the asymmetry. One possibility (also suggested by Weber and Cutler 2004) is that when /r/ is included in a lexical entry, it does not receive any bottom-up activation from speech, and nor does it inhibit (or is it inhibited by) the activation of other words as they compete for selection as the most likely word candidate. Activation of the word containing /r/ (and inhibition of other competitors) would only proceed via input that matched its other phonemes. The second possibility is that both /l/ and /r/ words contain the L1 /R/ category in their lexical entries. Words containing /l/ would be activated when a reasonable sounding /R/ is encountered and those containing /r/ would be activated when encountering a poorer match. By that account, the asymmetry arose because /r/ would never be perceived as a reasonable match for /R/, but /l/ could be perceived as a poorer match for /r/. Thus, /l/ would only ever contact words in the lexicon containing /l/, but there is a reasonable probability that /r/ would contact words containing both /l/ and /r/.

Darcy et al. (2013) also concluded that lexical encoding was responsible for the asymmetry. They showed that, in spite of accurate discrimination of L2 Japanese singletongeminate contrasts or German front-back vowel contrasts, lexical decision performance for L2 learners was poorer than it was for native speakers, particularly for nonword items. They also observed an asymmetry in lexical decision. The stimulus words contained either a more or less native-like L2 phoneme and the nonwords were created by swapping the target phoneme with the other member of the pair (e.g., the German word for 'honey', *Honig* /honIç/, became the nonword \**Hönig* /hønIç/). Accuracy for words was higher when the category was more versus less nativelike, and accuracy for minimal-pair nonwords was higher when the category was less versus more native-like. Similar to Cutler et al. (2006), Darcy et al. concluded that lexical coding for the less native-like category is fuzzy, and that it encodes the goodness of fit to the dominant L1 category. Interestingly, advanced German learners did not show the asymmetry, which suggests that lexical encoding can improve with L2 experience.

PAM-L2 (Best and Tyler 2007) may provide a slightly different perspective to the conclusions of Cutler et al. (2006) and Darcy et al. (2013). For PAM (Best 1995) and PAM-L2, phonological categories are perceptual, and they are the result of attunement to the higher-order phonetic properties that are relevant both for recognizing words and

for telling them apart from other words in the language. As /l/ is initially perceived by Japanese native listeners as a good instance of /R/, their existing L1 /R/ category would be used for acquiring any English words containing /l/ (a common L1-L2 category), and a new L2-only phonological category would be established for English /r/. In spoken word recognition, then, words containing /l/ would benefit from an existing L1 category that is already integrated into processes of lexical competition. In contrast, it may take some time for words containing a new L2 category (i.e., /r/) to establish inhibitory connections that would reduce the activation of competitor words (as may have eventually occurred for the advanced German learners in Darcy et al. 2013). Thus, when the Japanese native listeners in Cutler et al. (2006) perceived /l/, their native lexical competition processes would have inhibited activation of competitor words, including the one containing /r/. They may also have perceived /r/ pre-lexically, but without the benefit of inhibitory connections to other words, the /r/ competitor word would have been inhibited by the word containing /l/. For target words containing /r/, both the /r/- and /l/-words would be activated, but the poorer fit to the /l/ category would limit the activation of the /l/-word competitor. Thus, both candidates remained activated until disambiguating information was encountered. Clearly, more research needs to be done to tease apart the pre-lexical and lexical influences on L2 speech perception.

#### *4.3. Methodological Considerations*

The forced category goodness rating task was devised for this study because categorization with goodness rating might underestimate the perceived phonological overlap. To illustrate, Japanese participants rated step 8 as having various degrees of similarity to /l/, /r/, and /j/, but the ratings for /j/ were lower than for the other two categories. Had they completed a categorization test first, they may not have selected "Y" at all because the other three categories are clearly a better fit. A categorization task was not included for comparison here because the session was already quite long. Nevertheless, it is clear that the forced category goodness rating task is capable of detecting category overlap, and that category overlap was observed for both the native and non-native listeners.

The success of the forced category goodness rating task at detecting perceived phonological overlap raises the question of whether it should be adopted in favor of the standard categorization with goodness rating task. Indeed, Faris et al. (2018) suggested that it might be necessary to reconsider the use of arbitrary categorization criteria and the forced category goodness rating task removes the necessity of specifying a threshold for categorization. It may also solve a problem with the categorization of vowels; some participants have difficulty using the keyword labels for identifying vowels, particularly for a language like English, where some of the grapheme–phoneme correspondences are ambiguous. Faris et al. familiarized participants with the 18 English vowel labels, using English vowel stimuli and providing feedback, but they found that up to a quarter of the participants had difficulty selecting the correct label. While this could mean that those participants had difficulty categorizing native vowels, a more likely possibility is that they had poor phonological awareness, and that affected their ability to perform well on that metalinguistic task. Forced category goodness rating might alleviate that problem because the participants are provided with the category against which to judge the auditory stimuli and they do not need to search for the category that corresponds to the vowel that they heard. However, one clear limitation of the forced category goodness rating is that it is much more labor-intensive than categorization. In the case of Faris et al., participants would need to have rated 32 Danish vowels multiple times against the 18 English vowel categories. This would have resulted in thousands of trials. This is not to say that a forced category goodness rating task should be avoided. If it proved to provide a more accurate estimation of perceptual assimilation, then researchers would need to devote the time necessary to collecting those data.

If the forced category goodness rating task was adopted as a test of perceptual assimilation, then there would no longer be arbitrary thresholds for determining whether

a non-native phone was assimilated as categorized to the native phonological system. Instead, a non-native phone could be deemed to be categorized as a given native phonological category if the mean rating of the stimulus to that category was significantly above the lowest possible rating (e.g., 1 out of 7, where the 1 is defined as "no similarity"). Expanding on Faris et al. (2016), non-native phones would be categorized as focalized if only one category had a non-negligible rating, categorized as clustered if more than one category had a non-negligible rating, or uncategorized as dispersed if no category had a non-negligible rating.

In spite of the potential for this task to support future theoretical advances in research on cross-language and second-language speech perception, careful comparisons need to be made between the data obtained from current methods and the forced category goodness rating task before suggesting a change to standard research protocols. For example, the English native speakers in this study did not give the lowest possible score for [ô] as /w/ (2.05 out of 7), even though they clearly perceived that stimulus as belong to their /r/ category ([ô] as /r/ was rated at 6.90 out of 7). This suggests that ratings at the lower end of the scale may reflect phonetic similarity rather than phonological category membership. It may nevertheless be necessary to ask participants to make a decision about category membership rather than relying solely on a goodness-of-fit judgement. In fact, Tyler (2021) has identified four different sources of information that non-native listeners could use to discriminate non-native phones. Any new method for assessing perceptual assimilation would need to be capable of assessing listeners' sensitivity to any information available to them for discriminating given non-native contrast. Until careful methodological studies have been completed comparing different approaches to categorization, studies testing PAM/PAM-L2 predictions should continue to use categorization with goodness rating, giving participants the opportunity to select from among all possible vowel or consonant categories (Bundgaard-Nielsen et al. 2011a, 2011b; Faris et al. 2016).

#### **5. Conclusions**

In an ideal learning situation, adults would tune in to the phonetic and phonological properties of an L2 prior to establishing a large L2 vocabulary. However, this is not the way that L2s are generally learned. Classroom-based learning is more common and it provides opportunities to learn about phonological distinctions before attuning to the phonetic properties that define phonological categories and distinguish them from each other. It was argued here that such a situation may give rise to perceived phonological overlap between L2 categories. The results of a forced category goodness rating task showed that Japanese native speakers who first acquired English in the classroom perceived varying degrees of phonological overlap between English /r/ and /l/ when they encountered either category in speech. The overlap was smaller for those with more than two years of immersion experience, as compared to those with less than three months, suggesting that learners continue to attune to the phonological distinction with appropriate input. Assessment of perceived phonological overlap in L2 learners may help with tracking phonological development and with tailoring perceptual training to those contrasts where lexically guided perceptual retuning is most likely to be effective (see Tyler 2019, for a discussion of how PAM-L2 might apply to classroom foreign language acquisition). Future research should investigate category overlap using natural stimuli, and test whether discrimination accuracy for /r/ and /l/ can be predicted by the degree of phonological overlap between the L2 /r/ and /l/ categories.

**Funding:** This research was funded by the Australian Research Council, grant number DP0880913.

**Institutional Review Board Statement:** The protocol was approved by the University of Western Sydney Human Research Ethics Committee (Approval 07/040).

**Informed Consent Statement:** All subjects gave their informed consent for inclusion before they participated in the study.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available as informed consent was not obtained for publishing the data at the time of testing.

**Acknowledgments:** Thank you to Rikke Bundgaard-Nielsen for assistance with participant recruitment, Louise de Beuzeville for assistance with manuscript preparation, Kikuko Nakamura, Atomi Ohama, Susan Wijngaarden, and Mark Antoniou for research assistance, and two anonymous reviewers for helpful comments that greatly improved the manuscript. The Haskins continua were kindly provided by Pierre Hallé and Catherine Best.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


Gibson, James J. 1979. *The Ecological Approach to Visual Perception*. Boston: Houghton-Mifflin.


## *Article* **The Role of Acoustic Similarity and Non-Native Categorisation in Predicting Non-Native Discrimination: Brazilian Portuguese Vowels by English vs. Spanish Listeners**

**Jaydene Elvin 1,2,\*, Daniel Williams 2,3, Jason A. Shaw 4,5, Catherine T. Best 4,6 and Paola Escudero 2,4**


**Abstract:** This study tests whether Australian English (AusE) and European Spanish (ES) listeners differ in their categorisation and discrimination of Brazilian Portuguese (BP) vowels. In particular, we investigate two theoretically relevant measures of vowel category overlap (acoustic vs. perceptual categorisation) as predictors of non-native discrimination difficulty. We also investigate whether the individual listener's own native vowel productions predict non-native vowel perception better than group averages. The results showed comparable performance for AusE and ES participants in their perception of the BP vowels. In particular, discrimination patterns were largely dependent on contrast-specific learning scenarios, which were similar across AusE and ES. We also found that acoustic similarity between individuals' own native productions and the BP stimuli were largely consistent with the participants' patterns of non-native categorisation. Furthermore, the results indicated that both acoustic and perceptual overlap successfully predict discrimination performance. However, accuracy in discrimination was better explained by perceptual similarity for ES listeners and by acoustic similarity for AusE listeners. Interestingly, we also found that for ES listeners, the group averages explained discrimination accuracy better than predictions based on individual production data, but that the AusE group showed no difference.

**Keywords:** acoustic similarity; perceptual similarity; non-native discrimination; non-native categorisation

#### **1. Introduction**

It is well known that learning to perceive and produce the sounds of a new language can be a difficult task for many second language (L2) learners. Models of speech perception such as Flege's Speech Learning Model (SLM; Flege 1995), Best's Perceptual Assimilation Model (PAM, Best 1994, 1995), its extension to L2 acquisition PAM-L2 (Best and Tyler 2007) and the Second Language Linguistic Perception model (L2LP; Escudero 2005, 2009; van Leussen and Escudero 2015; Elvin and Escudero 2019; Yazawa et al. 2020) claim that both the phonological and articulatory-phonetic (PAM, PAM-L2), or acoustic-phonetic similarity (SLM, L2LP) between the native and target language are predictive of L2 discrimination patterns. This suggests that discrimination difficulties are not uniform across groups of L2 learners, at least at the initial stage of learning, as a result of their differing native (L1) phonemic inventories.

When non-native sounds are categorised according to native categories, this is known as a "learning scenario" in the L2LP theoretical framework, as "perceptual assimilation patterns" in PAM, and as "equivalence classification" in SLM. However, it is important to

**Citation:** Elvin, Jaydene, Daniel Williams, Jason A. Shaw, Catherine T. Best, and Paola Escudero. 2021. The Role of Acoustic Similarity and Non-Native Categorisation in Predicting Non-Native Discrimination: Brazilian Portuguese Vowels by English vs. Spanish Listeners. *Languages* 6: 44. https:// doi.org/10.3390/languages6010044


Received: 28 October 2020 Accepted: 18 February 2021 Published: 5 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

note that whereas L2LP and PAM explore these learning scenarios or assimilation patterns by investigating L2 or non-native phonemic contrasts, SLM focuses on the similarity or dissimilarity between individual L1 and L2 sound categories, rather than contrasts. Specifically, L2LP and PAM posit that contrasts which are present in the native language inventory may be easier to discriminate in the L2 than contrasts which are not present in the L1. This has been demonstrated in Spanish listeners' difficulty to perceive and produce the English /i/–/I/ contrast (Escudero and Boersma 2004; Escudero 2001, 2005; Flege et al. 1997; Morrison 2009). This may be attributed to the fact that the Spanish vowel inventory does not contain /I/ and Spanish listeners often perceive both sounds in the English contrast as one native sound category. The Spanish listeners' difficulty with the English /i/–/I/ contrast can be considered an example of the NEW scenario in L2LP and single-category assimilation in PAM. That is, the two sounds in the non-native contrast are perceived as one single native category. Both models predict that this type of learning scenario (or assimilation pattern) will result in difficulties for listeners when discriminating these speech sounds. Specifically, according to the L2LP framework, the NEW learning scenario is predicted to be difficult because in order for listeners to acquire both sounds (the learning task), a learner must either create a new L2 category or split an existing L1 category (van Leussen and Escudero 2015; Elvin and Escudero 2019).

In contrast, German learners, who have a /i/–/I/ contrast in their L1 vowel inventory, have fewer difficulties when perceiving the contrast in English than Spanish learners (Bohn and Flege 1990; Flege et al. 1997; Iverson and Evans 2007). It is likely that this is an example of the SIMILAR learning scenario in L2LP1, and PAM's two-category assimilation, whereby the two non-native sounds in the contrast are mapped onto two separate native vowel categories. Both PAM and L2LP would predict that a scenario (or assimilation pattern) of this type would be less problematic for listeners to discriminate than a NEW scenario (or Single Category assimilation) as they can rely on their existing L1 categories to perceive the difference between the L2 phones. In the L2LP framework, the learning task is considered to be easier because learners simply need to replicate and adjust their L1 categories so that their boundaries match those of the L2 contrast (van Leussen and Escudero 2015; Elvin and Escudero 2019; Yazawa et al. 2020). A third scenario, known as the SUBSET scenario (i.e., multiple category assimilation) in L2LP, occurs when one or both sounds in the L2 contrast are perceived as two or more native L1 categories. This scenario may be comparable to focalised, clustered or dispersed uncategorised assimilation in PAM (Faris et al. 2016). While some studies suggest that this learning scenario is not problematic for L2 learners (e.g., Gordon 2008; Morrison 2009, 2003), other studies have shown the SUBSET scenario (or PAM's uncategorised assimilation) to lead to difficulties in discrimination (Escudero and Boersma 2002), particularly when a perceptual or acoustic overlap between the two non-native sounds in the contrast and the perceived native categories occurs (Bohn et al. 2011; Elvin et al. 2014; Tyler et al. 2014; Vasiliev 2013). That is, the two vowels in the contrast are perceived as (or are acoustically similar to) the same multiple categories. For example, Elvin (2016) found that the Brazilian Portuguese vowels /i/ and /e/ were both acoustically similar to and perceived as the same multiple categories in Australian English, namely /i:/, /I/, and /I@/. In the L2LP framework, this poses a difficult challenge for learners as they must first realise that certain features or sounds in the target language do not exist, that they cannot process them in the same manner as their L1, and must therefore proceed in a similar manner as with the learning task for the NEW scenario (Elvin and Escudero 2019). The SUBSET scenario can therefore be divided into

<sup>1</sup> The L2LP terms "NEW" and "SIMILAR" scenarios differ notably from SLM's use of these terms, and should not be confused with them. The difference in terminology arises from the different foci of the two models: L2LP addresses phonemic contrasts, whereas SLM focuses on individual phones. SLM posits that when listeners are presented with an L2 phone that does not closely resemble any L1 phoneme they form a *new* phonetic category which should be easier to acquire than an L2 phone that is *similar* to an existing L1 phoneme, which should be more difficult to acquire despite the phonetic differences (Colantoni et al. 2015). In contrast, in L2LP a NEW scenario requires the listener to establish a new *contrast* in the L2, which does not exist in the L1, while a SIMILAR scenario reflects a contrast that is similar to one already existing in the L1. SIMILAR scenarios are therefore predicted to be much easier to acquire than NEW scenarios in L2LP.

two categories: SUBSET EASY (or uncategorised non-overlapping in PAM), when the two vowels in the L2 contrast are acoustically similar to and perceived as multiple L1 categories without any overlap, and SUBSET DIFFICULT (or uncategorised overlapping), where the two vowels in the L2 contrast are acoustically similar to and perceives as multiple categories with overlap. A diagram that shows examples of the L2LP scenarios can be seen in Figure 1.

**Figure 1.** Visual representation of the L2LP learning scenarios.

As the above theoretical models claim, it is the similarity of L2 sounds to native categories that determines L2 discrimination accuracy. It could be the case that individuals whose L1 vowel inventory is larger and more complex than that of the L2 may be faced with relatively less difficulty discriminating L2 vowel contrasts simply because there are many native categories available onto which the L2 vowels can be mapped. Indeed, Iverson and Evans (2007, 2009) found that listeners with a larger vowel inventory (e.g., German and Norwegian) than the L2 were more accurate and had higher levels of improvement post-training at perceiving L2 vowels (e.g., English) than those with a smaller vowel inventory (e.g., Spanish). However, other studies have found that having a larger native vowel inventory than the L2 does not always provide an advantage in L2 discrimination. For instance, recent studies have shown that Australian English listeners do not discriminate Brazilian Portuguese (Elvin et al. 2014) or Dutch (Alispahic et al. 2014; Alispahic et al. 2017) vowels more accurately than Spanish listeners, despite the fact that the Australian English vowel inventory is larger than Brazilian Portuguese and approximately similar in size to that of Dutch, while the Spanish vowel inventory is smaller than those of both Brazilian Portuguese and Dutch. In fact, the findings in Elvin et al. (2014) indicate that Australian English and Spanish listeners found the same Brazilian Portuguese contrasts perceptually easy or difficult to discriminate despite their differing vowel inventory sizes, and that overall, Spanish listeners had higher discrimination accuracy scores than English listeners.

Thus, it seems that vowel inventory size was a good predictor of L2 discrimination performance for some of the aforementioned studies, but not all, which may suggest that this factor alone is not sufficient for predicting L2 discrimination performance. After all, theoretical models such as L2LP and PAM claim that the acoustic-phonetic or articulatoryphonetic similarity between the vowels in the native and target languages, rather than phonemic inventory *per se*, predict L2 discrimination performance. In fact, the L2LP model claims that individuals detect phonetic information in both the L1 and L2 by paying attention to specific acoustic cues (e.g., duration, voice onset time and formants frequencies) in the speech signal. As a result, any acoustic variation in native and target vowel production can influence speech perception (Williams and Escudero 2014). Specifically, the model proposes that the listener's initial perception of the L2 vowels should closely match the acoustic properties of vowels as they are produced in the listener's first language (Escudero

and Boersma 2004; Escudero 2005; Escudero et al. 2014; Escudero and Williams 2012). In this way, the L2LP model proposes that both L2 and non-native categorisation patterns and discrimination difficulties can be predicted through a detailed comparison of the acoustic similarity between the sounds of the native and target languages.

This L2LP hypothesis is supported by a number of studies which show that acoustic similarity successfully predicts non-native and L2 categorisation and/or discrimination (e.g., Elvin et al. 2014; Escudero and Chládková 2010; Escudero and Williams 2011; Escudero et al. 2014; Escudero and Vasiliev 2011; Gilichinskaya and Strange 2010; Williams and Escudero 2014). For example, acoustic comparisons successfully predicted that Salento Italian and Peruvian Spanish listeners would categorise Standard Southern British English vowels differently, despite the fact that their vowel inventories contain vowels that are typically represented with the same IPA symbol. The difference was predicted because, despite those shared transcriptions, the acoustic realisations of the five vowels are not identical across the two languages (Escudero et al. 2014). Furthermore, as previously mentioned, Elvin et al. (2014) investigated Australian English and Iberian Spanish listeners' discrimination accuracy for Brazilian Portuguese vowels and found that a comparison of the type and number of vowels in native and non-native phonemic inventories was not sufficient for predicting L2 discrimination difficulties, and that accurate predictions can be achieved if acoustic similarity is considered. Specifically, the L2LP model posits that for the most accurate predictions, the acoustic data should be collected from the same group of listeners intended for perceptual testing. It is this postulate that differentiates L2LP from both PAM/PAM-L2 and SLM, which is the reason we use L2LP as the framework for the current research.

#### *The Present Study*

The present study investigates the non-native categorisation and discrimination of Brazilian Portuguese (BP) vowels by Australian English (AusE) and European Spanish (ES) listeners. Similar to Elvin et al. (2014), these language groups were chosen on the basis of their differing inventory sizes. The AusE vowel inventory contains thirteen monophthongal vowels, namely /i:, I, I@2, e, e:, 3:, 5, 5:, æ, o:, O, U and 0:/, and is larger than BP, which has seven oral vowels, /i, e, E, a, o, O, u/, while ES has the smallest vowel inventory of the three languages, containing five vowels, /i, e, a, o, u/. Unlike Spanish and Portuguese vowels which are relatively stable in their production, AusE vowels are known to be more dynamic and this has been shown to affect discrimination of some AusE contrasts (see Williams et al. 2018 and Escudero et al. 2018). In this study, we use the L2LP theoretical framework to investigate (1) whether detailed acoustic comparisons using the AusE and ES participants' own native production data successfully predict their non-native categorisation of BP vowels, (2) whether the L2LP learning scenarios identified in nonnative categorisation subsequently predict their BP discrimination patterns, and (3) whether measures of acoustic and perceptual (categorisation) overlap are equally good predictors of discrimination accuracy at both group and individual levels (i.e., using individual overlap scores vs. group averages).

While most empirical research in L2 vowel perception investigates L2 development for groups of learners, the present study investigates non-native perception from a group versus an individual perspective. Studies typically focus on learner groups rather than individuals because speech communities have shared linguistic knowledge that allows them to understand each other. As a result, most researchers are particularly interested in how populations behave and how their shared L1 knowledge is relevant to L2 learning. Despite the fact that many researchers are aware that some variability does exist among individuals (Mayr and Escudero 2010; Smith and Hayes-Harb 2011), the group data obtained are generally sufficient for their purposes of demonstrating that shared knowledge

<sup>2</sup> The /I@/ vowel is traditionally considered a diphthong in Australian English. However, recent studies have shown that this vowel is produced as a monophthong when presented in a closed CVC context (see Elvin et al. 2016) as in this study.

of the sound patterns of the L1 influences L2 speech perception. Importantly, however, other studies (e.g., Díaz et al. 2012; Smith and Hayes-Harb 2011; Wanrooij et al. 2013) have shown that an investigation of individual differences can be important for understanding L2 development. For example, Smith and Hayes-Harb (2011) warn that researchers need to be careful in drawing general conclusions about typical performance patterns for L2 listeners based on group averages, as individual data may be crucial to interpreting group results, especially given the large variety of situations that influence L2 learning by individual learners.

Most of the studies that investigate individual differences in L2 speech perception focus predominately on factors such as age of acquisition, length of residence, language use or motivation (Escudero and Boersma 2004; Flege et al. 1995). In particular, much of the research conducted under the SLM theoretical framework (e.g., Flege et al. 1997, 1995) investigated the above extra-linguistic factors as a means of explaining the degree of foreign accent in an L2 learner. However, even when these factors are controlled, individual differences still seem to persist (Jin et al. 2014; Sebastián-Gallés and Díaz 2012). Furthermore, studies have shown that there are differences in how people hear phonetic cues despite having similar productions that may be related to their auditory processing or their auditory memory (see Wanrooij et al. 2013; Antoniou and Wong 2015). The fact that individual differences persist even when possible factors that influence such variations are controlled suggests that there are real cognitive differences amongst individuals, such as processing style, that influence second language learning. Therefore, language learners, even those at the initial stage (i.e., the onset of learning), may follow different developmental paths to successful acquisition of L2 speech based on their differing cognitive styles and exposure (for a review on recent literature relating to individual differences in processing, see Yu and Zellou 2019). While SLM investigates differences among L2 learners at the level of experiential factors such as age of acquisition and language exposure, the approach is to group learners according to these factors prior to comparing their performance (Colantoni et al. 2015). Studies investigating perception under the framework of PAM also acknowledge the existence of individual differences among listeners; however, few studies are yet to explain such differences. In fact, Tyler et al. (2014) found individual differences in assimilation of non-native vowel contrasts, and proposed that individual variation should be considered when predicting L2 difficulties, but did not examine the sources of the individual differences they had observed. This is where the L2LP model may be particularly relevant: it was specifically designed to account for individual variation among non-native speakers at all stages of learning and across different learning abilities (i.e., perception, word recognition and production). As a result, L2LP predictions can be made for individual learners based on detailed acoustic comparisons of their L1 categories and the categories of the specific target language variety (Colantoni et al. 2015, p. 44).

In our investigation of individual variation, we focus specifically on the fact that individuals from the same native language background may have different acoustic realisations of vowels and this factor may predict individual differences in perceptual performance. That is, the within-category variation in native production may influence non-native categorisation and discrimination. Very few studies (e.g., Levy and Law 2010) have collected vowel productions from the same listeners that they tested in perception, which, according to the L2LP model, is an essential ingredient for accurate predictions of L2 difficulty and for the identification of any individual variation that may be caused by individuals' different acoustic realisations of their own native vowels. Thus, although representative acoustic measurements from the listener populations have successfully explained L2 perceptual difficulty, such comparisons may not account for individual variation among listeners.

The present study reports native acoustic production as well as non-native categorisation and discrimination data from the same participants across all tasks. Although we look at individual versus group data in this study, it is important to note that unlike most other studies of perception and production, the group data we use for perception and production are from the same individuals, which may make the group data more reliable than data for

perception and production taken from different groups. Furthermore, the BP acoustic data that we use to measure acoustic similarity are the same recordings that we use as stimuli in the non-native categorisation and discrimination tasks. By doing so, we are able to make predictions relating to the actual stimuli that the participants were presented with, rather than averages taken from other speakers and for vowels in other phonetic contexts. We also control for variation within languages and speakers by ensuring that the participants in each BP-naïve listener group, as well as the speakers in our target BP dialect, were all of similar ages selected from a single urban area within each of their respective countries. By controlling for variation relating to language experience, age and native background, we are able to conduct a carefully controlled investigation of individual differences in non-native perception that may be explained by individual differences in L1 production.

We chose the /fVfe/ context as our target BP stimuli to ensure that our data were comparable to previous studies, specifically Elvin et al. (2014) and Vasiliev (2013). Vasiliev (2013) originally selected target vowels extracted from a voiceless fricative rather than stop context because the voiceless stops differ in VOT (voice onset time) and formant transitions among Spanish, Portuguese, and English. In Elvin et al. (2014), the Australian English acoustic predictions were based on the Cox (2006) corpus, which contained acoustic measurements of adolescent speakers from the Northern Beaches (north of Sydney in New South Wales), collected in the 1990s and extracted from an /hVd/ context. However, Elvin et al. (2016) found that vowel duration and formant trajectories varied depending on the consonantal context in which they were produced. Specifically, vowels produced in the /hVd/ context were acoustically the least similar to the vowels produced in all of the remaining consonantal consonants. Thus, /hVd/ may not be the most representative phonetic context for predicting L2 vowel perception difficulty; in this study, we instead formulated predictions based on native vowels produced in the same phonetic context used as stimuli in testing.

To measure acoustic similarity between vowels, Elvin et al. (2014) used Euclidean Distances between the reported F1 and F2 averages for each vowel. However, because native production data were available for the present study we instead used cross-language discriminant analyses as a method of measuring acoustic similarity, to use in predicting performance in the non-native categorisation and discrimination tasks. This should improve predictions of acoustic similarity over those from simple Euclidean Distance, as we are able to include more detailed acoustic information relevant for vowel perception as input parameters for each individual participant3.

Considering that patterns of non-native categorisation underlie discrimination difficulties, which according to the L2LP model is predictable based on acoustic properties, the inclusion of non-native categorisation data in the present study further allows for an investigation of whether or not listeners' individual categorisation patterns do in fact predict difficulty in discrimination. The incorporation of a categorisation task also allows us to investigate whether the L2LP learning scenarios at the onset of learning (unfamiliar BP stimuli) are similar across the two listener groups of differing vowel inventories (ES and AusE).

It was essential that we replicated and extended the discrimination task reported in Elvin et al. (2014) with this new set of participants who also completed the native production and non-native categorisation tasks, in order to adequately test the individual difference assumptions of the L2LP model. The L2LP model explicitly states that different listeners have different developmental patterns and it is important to conduct all tasks on the same set of listeners. To this end, we selected naïve listeners in both non-native groups who represent the initial stage of language learning in the L2LP framework. Their inclusion

<sup>3</sup> We note, however, that there are reasons as to why a listener's own productions might not be the best predictors of how they perceive other speakers. Part of a listener's knowledge about vowels includes the ways that different members of their speech community vary (e.g., vocal tract anatomy and social factors). However, we do believe a good way to find symmetry between perception and production is to compare those in the same group of people as in Chládková and Escudero (2012).

provides a good opportunity for assessing differences in language learning ability that are not confounded by other factors that vary widely among actual L2 learners.

The discrimination task in the present study further differs from that reported in Elvin et al. (2014) in that the vowels are presented in a nonce word context rather than as vowels in isolation. We made this change because, outside of the laboratory, learners are faced with words rather than vowels in isolation. The L2LP model assumes continuity between lexical and perceptual development, specifically positing that perceptual learning is triggered when learners attempt to improve recognition by updating their lexical representations (van Leussen and Escudero 2015). Furthermore, if listeners do not interpret the stimuli as speech, which could potentially occur with isolated vowels (particularly synthesized rather than natural vowels), then language-specific L1 knowledge may play less of a role in their perception. That is, listeners from different L1 backgrounds may perceive non-speech in a similar manner but differ in how they perceive the vowels that they perceive to be speech. Given the fact that there were very few group differences in Elvin et al. (2014), it might be that the stimuli were not engaging native language phonology sufficiently reliably for all listeners. Thus, the presentation of vowels in the context of a nonce word not only reflects learning that is closer to a real world situation but also these more speech-like materials allow us to determine whether language-specific knowledge played less of a role in their discrimination of BP.

The present study is therefore, to our knowledge, one of the first to evaluate predictions about L2 perception (both non-native categorisation and discrimination) based on the listeners' own native productions, thereby providing a novel test of one of L2LP's core assumptions. In Section 2, native AusE and ES listeners' native vowel productions are compared to the BP production data that are used as stimuli in the non-native categorisation task (Section 3) and the XAB discrimination task (Section 4). Results from the cross-language acoustic comparisons are used to predict the non-native categorisation patterns in Section 3 and the discrimination results in Section 4. As mentioned above, the participants in the cross-language acoustic comparisons were the same as the participants in the non-native categorisation and discrimination tasks. We do note that the results presented in the cross-language acoustic comparisons and the non-native categorisation tasks are descriptive as we use their categorisation patterns to predict discrimination results in Section 4. In regards to a power analysis of the sample size, for experiment designs with repeated measures analysed with mixed-effects models, Brysbaert and Stevens (2018) recommend a sample size of at least 1600 observations per condition. In our non-native discrimination task, each of the 40 participants completed 40 trials per BP contrast, therefore, this recommendation was met (40 participants × 40 trials = 1600 observations per BP contrast). We do acknowledge a loss of five participants in the non-native categorisation task and we address how this affects our power in our modelling analyses in Section 4.

#### **2. Cross-Language Acoustic Comparisons**

#### *2.1. Participants*

Twenty Australian English (AusE) monolingual listeners from Western Sydney and twenty European Spanish (ES) monolingual participants from Madrid participated in this study. All participants were Australian English or European Spanish listeners currently residing in Greater Western Sydney or Madrid, respectively, and aged between 18 and 30 years old. The AusE participants reported little to no knowledge of any foreign language. The ES participants reported little to intermediate knowledge of English and little to no knowledge of any other foreign language. AusE participants were recruited through the Western Sydney University psychology pool or from the Greater Western Sydney region, and received \$40 AUD for their participation. ES participants were recruited from universities and institutes around the Universidad Nacional de Educación a Distancia and received €30 for their time. All participants were part of a larger-scale study that looked at the interrelations among non-native speech perception, spoken word recognition and non-native speech production. All participants provided informed consent in accordance

with the ethical protocols in place at the Universidad Nacional de Educación a Distancia and the Western Sydney University Human Research Ethics Committee.

#### *2.2. Stimuli and Procedure*

AusE and ES participants completed a native production task in which they read pseudo-words containing one of the 13 Australian English monophthongs, namely, /i:, I, I@, e, e:, 3:, 5, 5:, æ, o:, O, 0and 0:/, or one of five European Spanish vowels, /i, e, a, o, u/, in the /fVf/ (AusE) or /fVfo/ (ES) context. There were 10 repetitions of each vowel, presented in a randomised order, which provided a total of 130 tokens for AusE and 50 tokens for ES per participant. The tokens we used for the analysis of BP vowels were the same as those we used as stimuli in the non-native categorisation and non-native discrimination task. That is they were tokens presented in pseudo-words in the /fVfe/4 context, produced by five male and five female speakers from São Paulo, selected from the Escudero et al. (2009) corpus. There were a total of 70 BP vowel tokens (one repetition per vowel, per speaker). These BP pseudo-words were produced in isolation and within a carrier sentence e.g., "Fêfe. Em fêfe e fêfo temos ê" which translates to: "Fêfe. In fêfe and fêfo we have ê" Escudero et al. (2009). In our analyses, we selected the vowel in the first syllable of the isolated word which was always stressed and corresponded to one of the seven Portuguese vowels /i, e, E, a, o, O, u/. We used WebMaus (Kisler et al. 2012), an online tool used for automatically segmenting and labelling speech sounds, to segment vowels within each target word in each language (AusE, ES and BP). The automatically generated start and end boundaries were checked and manually adjusted to ensure that they corresponded to the onset/offset of voicing and vocalic formant structure. Vowel duration was measured as the time (ms) between these start and end boundaries. Formant measurements for each vowel token were extracted at three time points (25%, 50%, 75%) following the optimal ceiling method reported in Escudero et al. (2009), in order to ensure that our methods of formant extraction are comparable across both the target and native languages. In the optimal ceiling method, the "ceiling" for formant measurements is selected by vowel and by speaker to minimize variation for the first and second formant values. Formant ceilings ranged between 4500 and 6500 Hz for females and between 4000 and 6000 for males.

#### *2.3. Results: Cross-Language Acoustic Comparisons*

Figure 2 shows the average (of all speakers) midpoint F1 and F2 normalised values of the thirteen AusE (black) and five ES (blue) vowels, together with the average (of all speakers) midpoint F1 and F2 normalised values for the BP (purple, circled) vowels that were selected from Escudero et al. (2009) and used as stimuli in the present study. The Lobanov (1971) method was implemented to normalise vowels using the NORM suite (Thomas and Kendall 2007) in R. This specific normalisation method was chosen because it resulted in the best classification performance for the same Brazilian Portuguese vowels used in this study as shown by Escudero and Bion (2007).

Visual inspection of the plot reveals that although AusE has many more vowels in its native vowel inventory than ES, the vowels of both languages fall in and around similar locations along a rough inverted triangle within the acoustic space. Following Strange et al. (2004) and Escudero and Vasiliev (2011), we conducted a series of discriminant analyses as a quantitative measurement of acoustic similarity and used these analyses to predict listeners' non-native categorisation patterns. Before comparing our target language's acoustic similarity with Brazilian Portuguese, we first needed to determine how a trained AusE or ES discriminant analysis model would classify tokens from the same native

<sup>4</sup> Given the fact that the CVCV context is the most common word structure in Spanish, we specifically chose to analyse the /fVfo/ context in the ES native production task. This also prevented the ES participants from producing the target BP stimuli and thereby having an unfair advantage over the AusE participants. We do acknowledge, however, that the post-stressed /e/ in the second syllable of the BP target items is different to the post-stressed /o/ in the second syllable of the Spanish targets, which may have a minor impact on the stressed V acoustic parameters in BP as compared to ES.

<sup>5</sup> See Elvin et al. (2016); Williams et al. (2018) and Escudero et al. (2018) for an overview and visualization of the AusE formant trajectories.

language (known as a cross-validation method). To this end, we fit four separate linear discriminant analysis models: AusE females; AusE males; ES females; ES males. These analyses were conducted to determine the underlying acoustic parameters that predict the vowel categories for test tokens from the BP corpus. The input parameters were F1 and F2 (normalised) values measured at the vowel midpoint (i.e., 50%) as well as duration. We also ran discriminant analyses using F1, F2 and F3 (Bark, duration and formant trajectory as input parameters. We report the results for the discriminant analyses using normalised values as they were more accurate than the values in Bark for both languages. The ES model yielded 98% correct classifications for both males and females, and the AusE model yielded 91.2% (females) and 90.4% (males) correct classifications.

**Figure 2.** The left panel shows the averaged normalised F1 and F2 values (Hz) for the thirteen AusE5 (black), and seven BP (gray, circled) vowels. The right panel shows the averaged normalised F1 and F2 values (Hz) for the five ES (black) and seven BP (gray, circled) vowels.

We then conducted a cross-language discriminant analysis, using F1 and F2 normalised values (measured at 50%) and duration as input parameters to determine how likely the BP vowel tokens would be categorised in terms of AusE and ES vowel categories. We fit one model for each individual AusE and ES listener for a total of 40 LDA models. For each individual model, the training data consisted of the 6 tokens of each AusE vowel produced by that same speaker that for which the model was being tested, resulting in a total of 78 native tokens. The test tokens were the same for each of the individual LDA models, that is, 70 male and female BP tokens which were also used as stimuli in the non-native categorisation and discrimination tasks.

In some previous work that has used a typical discriminant analysis, the vowels in the test corpus (in our case BP) are categorised with respect to linear combinations of acoustic variables established by the input corpus (Strange et al. 2004). In other words, the discriminant analysis tests how well the BP tokens can be classified into the vowel categories of the (AusE or ES) input corpus, providing a predicted probability that each vowel token will be categorised as one of the native vowel categories (Strange et al. 2004). Further, the discriminant analysis tests how well the BP tokens fit with centres of gravity of the input corpus tokens (AusE or ES), providing a predicted probability that each vowel will be categorised as one of the native vowel categories. The native vowel category that receives the highest probability for a given BP vowel indicates the native vowel that is acoustically closest to the non-native vowel.

Given the fact that we only have one token per vowel per BP speaker (5 male and 5 female), rather than reporting the overall percentage of times a BP vowel was categorised as a native vowel as is commonly reported (and is usually based on many more tokens), we instead report the probabilities of group membership averaged across the BP vowel

tokens: For each individual BP vowel token, we report the predicted probability of it being categorised as any of the 13 native AusE or 5 native ES vowels and average these probabilities over all speakers' tokens for that BP vowel. The benefit of reporting average probabilities across tokens in the present study is that it takes into account that some BP tokens may be acoustically close to more than one vowel, which can be masked by categorisation percentages. The predicted probabilities averaged across the BP tokens for an individual listener and then averaged across all listeners in the AusE and ES groups for AusE and ES are shown in Tables 1 and 2, respectively.

**Table 1.** Average probability scores of predicted group membership for male and female BP tokens tested on each individual AusE listener model. Probabilities are averaged across the individual discriminant analysis for each speaker. The native vowel category with the highest probability appears in a cell in bold, with no shading, probabilities above chance appear in cells shaded dark grey and probabilities below chance (i.e., 0.08) appear in cells shaded light grey.


**Table 2.** Average probability scores of predicted ES vowel group membership for BP male and female tokens tested on the ES model. Probabilities are averaged across the individual discriminant analysis for each speaker. The native vowel category with the highest probability appears in a cell in bold, with no shading, probabilities above chance appear in cells shaded dark grey and probabilities below chance (i.e., 0.20) appear in cells shaded light grey.


We take the across-individual average probability of vowel group membership to correspond to the degree of acoustic similarity, i.e., a high probability indicates a high level of acoustic similarity. Table 1 shows the averaged probabilities of predicted group membership, which we interpret as being representative of the "average listener". The table of averaged probability scores reveals that each BP vowel showed a strong similarity to a single AusE vowel. However, six of these also showed lower levels of above-chance

similarity to one or more other AusE vowels. In the cases where a BP vowel is acoustically similar to two or more AusE vowels, the similarity is not equal across the vowel categories, with the probability scores indicating a greater likelihood of classification of one vowel over the other. An acoustic categorisation overlap can be observed for BP contrasts /i/–/e/ and /o/–/u/, where each vowel in the BP contrast is acoustically similar to the same native AusE vowel(s). In the case of BP /i/–/e/, there is a 0.71 probability that BP /i/ will be categorised as AusE /I/ and a 0.59 probability that BP /e/ will also be categorised as AusE /I/. There is also a 0.22 probability that BP /i/ will be categorised as AusE /i:/, and a probability of 0.13 that BP /e/ will be categorised as AusE /i:/. There is also a 0.11 probability that BP /e/ will be categorised as AusE /I@/ as well as a 0.10 probability that it could be categorised as AusE /0:/. For BP /o/–/u/ there is a 0.53 probability that BP /o/ will be categorised as AusE /0/ and a 0.92 probability that BP /u/ will be categorised as AusE /0/. We also observe a 0.20 probability that BP /o/ will also be categorised as AusE /o:/. Partial acoustic overlapping is also observed in the BP /o/–/O/ contrast with a 0.26 probability for BP /o/ and a 0.25 probability for BP /O/ to be categorised as AusE /O/. There was also a 0.32 probability that BP /O/ will be categorised as AusE /5/ and we therefore observe a very minimal acoustic overlap with BP /a/. Although there is a 0.74 probability that BP /a/ will be categorised as AusE /æ/, we do see a 0.15 probability that BP /a/ will be categorised as AusE /5/. Finally, we do not see any acoustic overlapping in the BP /a/–/E/.

Table 2 shows the BP tokens tested on the ES model. The results indicate that BP /i/, /E/, /a/ and /u/ are each acoustically similar to different single native categories, namely ES /i/, /e/, /a/, and /u/. while the remaining three BP vowel categories (/e/, /o/ and /O/) show moderate (but much lower) acoustic similarity to a second ES vowel. Similar to the AusE categorisation above, when a BP vowel is acoustically similar to two ES vowels, the similarity is not equal across both categories, with the probability scores indicating a greater likelihood of classification of one vowel over the other. For example, there is a 0.77 probability that BP /e/ would be categorised as ES /e/ and a 0.23 probability of it being categorised as ES /i/. There is also a 0.78 probability that BP /o/ will be categorised as ES /u/ and only a 0.22 probability that it will be categorised as ES /o/. In the case of BP /O/, there is a 0.64 probability that it will be categorised as ES /o/, a 0.27 probability that it would be categorised as ES /e/.

Cases of acoustic overlap can be identified in the ES predicted probabilities of categorisation for the BP contrasts /i/–/e/, /e/–/E/, /o/–/u/ and /o/–/O/. Specifically, in the case of BP /e/–/E/, both vowels were acoustically closest to ES /e/, and in the case of BP /o/–/u/ both vowels were closest to ES /u/. Furthermore, there is a small amount of acoustic overlap in the BP contrasts /i/–/e/ and /o/–/O/. In BP /i/–/e/, while the majority of /i/ tokens and the majority of /e/ tokens were acoustically similar to ES /i/ and /e/, respectively, a smaller percentage of the BP /i/ tokens were acoustically similar to ES /e/ and a smaller percentage of BP /e/ tokens were acoustically similar to ES /i/. A similar result is found with /o/–/O/ as the majority of the BP /O/ tokens were acoustically similar to ES /o/ and a smaller percentage of BP /o/ tokens were acoustically similar to ES /o/. Finally, we do not see any evidence of acoustic overlap for BP /a/–/E/ and /a/–/O/.

#### *2.4. L2LP Predictions for Non-Native Categorisation*

The acoustic similarity as determined by the probability scores from our discriminant analyses are used to predict perceived phonetic similarity in a categorisation task. For AusE, there are several cases where the two vowels in the BP contrast are acoustically similar to more than two native categories (in other words, there were predicted probabilities that the BP vowel tokens could be categorised as more than two native categories, with a predicted probability greater than chance). We therefore predict several cases of L2LP's SUBSET EASY and SUBSET DIFFICULT scenarios. Based on the acoustic results averaged across participants, it is likely that all BP vowels will be categorised as more than one native category. For BP /a/–/E/, there is no acoustic overlapping and thus, in non-native

categorisation, we expect to find the SUBSET EASY scenario where each BP vowel in the contrast is perceived as more than one native category, but there is no overlapping between the response choices for the two vowels. Where acoustic overlapping occurs, we expect to find the SUBSET DIFFICULT learning scenarios in the non-native categorisation patterns. Specifically, we expect to find perceptual overlap in the non-native categorisation of the BP contrasts /i/–/e/ and /o/–/u/. Partial acoustic overlapping might also lead to instances of the SUBSET DIFFICULT scenario for BP /a/–/O/ and /o/–/O/.

For the ES listeners, we predict on the basis of the LDA results that most BP vowels should be categorised as one single native category. In particular, we expect that BP /i/, and /a/ will be categorised as /i/ and /a/, respectively. Furthermore, we expect to see cases of L2LP's NEW and SIMILAR scenarios. Specifically, we expect to observe instances of the SIMILAR scenario for ES participants in the BP /i/–/e/ and /a/–/E/ contrasts because both vowels in the BP contrast are acoustically similar to separate native categories, with predicted probabilities above 75%. Despite the fact that categorisation of BP /O/ is spread across multiple response categories, we would still predict a case of the SIMILAR scenario for BP /a/–/O/ given the fact that there is no acoustic overlap in the response categories. In contrast, examples of the NEW scenario are predicted for BP /e/–/E/ and /o/–/u/ because both BP /e/ and /E/ are acoustically similar to ES /e/, and both /o/ and /u/ are acoustically similar to ES /u/, with predicted probabilities above 75% It is likely that BP /o/ will be categorised to two native categories as there is a 0.78 probability that it will be categorised as ES /u/ and a 0.22 chance it will be categorised as ES /o/. Finally, BP /O/ should predominately be categorised as ES /o/, but it might also be categorised as ES /e/.

#### *2.5. L2LP Predictions for Non-Native Discrimination*

The L2LP model claims that discrimination difficulty can be predicted by the acoustic similarity between native and target language vowel categories, unlike PAM which makes predictions based on articulatory-phonetic similarity and collects perceptual assimilation data and category-goodness ratings to test its predictions. Perhaps the reason that acoustic similarity can be used to predict discrimination difficulty is because acoustic properties and articulation relate to one another (Noiray et al. 2014; Blackwood Ximenes et al. 2017; Whalen et al. 2018). For example, Noiray et al. (2014) have shown that variation in vowel formants correspond closely to variations in the vocal tract area function and even coarser grained articulatory measures such as height of the tongue body. Whalen et al. (2018) compared articulatory and acoustic variability using data from an x-ray microbeam database and found that contrary to popular belief, articulation was not more variable than acoustics, but that variability was consistent across vowels and that articulatory and acoustic variability were related for the vowels. Given this relationship it seems reasonable that acoustic similarity be equal to perceptual similarity in its ability to predict discrimination difficulty. As mentioned in the introduction, we are interested in whether or not acoustic similarity can predict discrimination accuracy and in particular, whether it is comparable to perceptual similarity. One way to measure acoustic and/or perceptual similarity is to calculate the amount of acoustic/perceptual overlap that can be found in a given BP contrast. When two vowels in BP are acoustically/perceptually similar to the same listener vowel category(ies), discrimination of the BP vowels is predicted to be difficult. In this section we use the LDA results to determine how much BP vowels overlap with our listener's native vowel categories and a similar method will be used to measure the amount of perceptual similarity in the non-native categorisation task. We predict that the perceptually easy contrasts for both groups of listeners to discriminate would be those with little to no acoustic overlap (i.e., the two vowels in the BP contrast are acoustically similar to different native categories). The BP contrasts with a large amount of acoustic overlap (i.e., the two BP vowels in the contrast are acoustically similar to the same native category(ies)) should be difficult to discriminate.

To quantify acoustic overlap, we adopted Levy's (2009) "cross language assimilation overlap" method. This method provides a quantitative score of overlap between the members of a non-native contrast and native categories. Although originally designed to compute perceptual overlap scores based on listeners' perceptual assimilation patterns for testing predictions in PAM (which we do in fact apply to our categorisation data), we use our LDA results. Each overlap score was calculated by adding categorisation probabilities in cases where the two vowels in the BP contrast were categorised as the same native categories. This gives an aggregate probability of perceiving the two BP vowels as the same native category. For example, in the case of BP /i/–/e/, as observed in Table 1, there was a non-zero probability that both BP /i/ and BP /e/ would each be categorised as AusE /i:/, /I/, /I@/ and /0:/. To calculate the overlap score for this contrast, we took the smaller proportion of when both BP vowels had a probability of being categorised as the same AusE vowel category for each native vowel and add those together. Thus, in the case of AusE /I/ there was a 0.71 probability that BP /i/ would be categorised as this vowel and a 0.59 probability that BP /e/ would also be categorised as AusE /I/. The smaller proportion in this case would be 0.59 for BP /e/, as well as AusE /i:/ (0.13), AusE /I@/ (0.05) and AusE /0:/ (0.02), which were included in the calculation of the acoustic overlap to obtain an acoustic overlap score of 0.79. Thus, summing together each of the smaller proportions, the calculation of acoustic overlap for BP /i/–/e/ was as follows: AusE /I/0.59 + AusE /i:/0.13 + AusE /I@/0.05 + AusE /0:/0.02 = 0.79 acoustic overlap. Table 3 shows the acoustic overlap scores for each language.

**Table 3.** Acoustic overlap scores for AusE and ES listeners.


Based on the acoustic overlap scores in Table 3, we would predict that the BP contrasts with little to no acoustic overlap would be perceptually easier to discriminate than those with higher overlap scores. In particular, both groups should find BP /o/–/u/ difficult to discriminate and BP /a/–/E/ easy to discriminate. However, these acoustic comparisons do predict group differences: ES listeners should find BP /a/–/O/ and /i/–/e/ easier to discriminate than AusE listeners, whereas BP /e/–/E/ should be easier for AusE than ES listeners to discriminate.

#### **3. Non-Native Categorisation**

#### *3.1. Participants*

The participants in the non-native categorisation task were the same as those previously reported in the cross-language acoustic comparisons. However, the non-native categorisation results of five ES participants were excluded due to an error that occurred during testing.

#### *3.2. Stimuli and Procedure*

Participants were presented with the same BP pseudo-words that served as the test data for the discriminant analyses in the cross-language acoustic comparisons. There were a total of 70 /fVfe/ target words (7 vowels × 10 speakers), as well as three additional nonsense words by each speaker (/pipe/, /kuke/ and /sase/), included as filler words. Thus, in the non-native categorisation task we had a total of 100 BP word tokens (70 target and 30 fillers).

In keeping with Vasiliev (2013; see also, e.g., Tyler et al. 2014), this vowel categorisation task followed the discrimination task (reported in the next section) because we wanted to avoid any familiarisation with the natural stimuli in the discrimination task. We present

the results of the categorisation task first because these results are used to make predictions about discrimination. In the categorisation task, participants categorised the stressed vowel sound of each target BP word (i.e., the target vowel) to one of their own 13 AusE or 5 ES vowel categories. Unlike Spanish, English is not orthographically transparent. Thus, while the ES listeners saw the 5 vowel categories (i, e, a, o, u) on the screen, the AusE vowels were presented in one of the 13 keywords, heed, hid, heared, head, haired, heard, hud, hard, had, hoard, hod, hood and who'd, which correspond to the AusE phonemes /i:, I, I@, e, e:, 3:, 5, 5:, æ, o:, O, 0, 0:/, respectively. Participants heard each target and filler item once, and were required to choose one of their own native response options on each trial, even when unsure. The task did not move on to the next trial until a response had been chosen. All trials were presented in a randomised order. Participants received a short practice session before beginning the task and took approximately 10 min to complete it.

#### *3.3. Results*

Tables 4 and 5 present the percentage of times each BP vowel was categorised by each group as a native AusE or a native ES vowel, respectively.

**Table 4.** Australian English listeners' classification percentages. The native vowel category with the highest classification percentage appears in bold, classification percentages above chance appear in cells shaded dark grey and classification percentages below chance (i.e., 0.08) appear in cells shaded light grey.


**Table 5.** European Spanish listeners' classification percentages. The native vowel category with the highest classification percentage appears in bold, classification percentages above chance appear in cells shaded dark grey and classification percentages below chance (i.e., 0.20) appear in cells shaded light grey.


The categorisation percentages reported in Table 4 (AusE) and Table 5 (ES) are in line with our prediction based on acoustic similarity that most BP vowels would be categorised into more than two native categories by AusE listeners, and that most BP vowels would instead be categorised into one single native category by ES listeners.

Indeed, as predicted, all BP vowels were categorised as two or more native AusE categories, and there was some evidence of perceptual overlap between the expected pairs of BP vowels. In particular, we found examples of the SUBSET DIFFICULT scenario in the BP contrasts /i/–/e/, /e/–/E/, /o/–/O/ and /o/-u/, where both vowels in the contrast were categorised into two or more of the same native AusE vowels.

With respect to the BP /i/–/e/ contrast, BP /i/ was categorised as AusE /i:/ as well as AusE /I/, 43% of the time for both vowels. This finding was predicted by acoustic similarity, however instead of there being a larger percentage of categorisation to AusE /I/ as expected, categorisation was split equally across the two AusE categories. As for BP /e/, our discriminant analysis indicated it would be categorised across four native vowel categories, namely /i:/, /I/, /I@/ and /0:/, with the largest classification percentage predicted to be to AusE /I/. This prediction was largely consistent with AusE listeners' nonnative categorisation patterns, with /e/ being categorised as AusE /i:/ (23%), /I/ (20%) and /I@/ (15%). Although the discriminant analysis prediction was that BP /e/ would be categorised as AusE /0:/ to a small extent (10%), this was not observed, and listeners instead rather substantially categorised the vowel to two unpredicted AusE vowels, /e/ (14%, i.e., as often as to /I@/) and /e:/ (23%, i.e., equal to the actual choices of the top acoustically predicted choice /i:/).

As for the BP contrast /e/–/E/, BP /E/ tokens were expected to be predominately categorised as AusE /e/ and /e:/, two of the AusE vowels to which BP /e/ was categorised. This was indeed the case as BP /E/ was categorised as AusE to /e:/58% of the time and to /e/ 14% of the time.

For the BP /o/–/O/ contrast, a large majority of BP /o/ tokens were acoustically predicted to be categorised as AusE /0/, with a much lower probability that some would also be categorised as AusE /o:/ and /O/. However, the listeners actually reversed the balance between the two AusE vowel categories: the large majority BP /o/ tokens were instead categorised as AusE /o:/ (53% of the time), and as AusE /0/ only 23% of the time. Furthermore, our acoustic predictions suggested that BP /O/ tokens would be categorised as a number of different AusE vowels, specifically AusE /5/, /O/ and /e/. However, the non-native categorisation patterns indicate that the great majority of BP /O/ tokens were categorised as AusE /o:/ (58% of the time), with only 15% being categorised as /O/ and 13% as /5:/, and none selected AusE /e/.

Finally, with respect to the BP /o/-u/ contrast, while our acoustic analysis successfully predicted the majority of BP /u/ tokens to be categorised as AusE /0/ (65%), interestingly 12% of the tokens were categorised /0:/, which was not predicted to be a listener choice. Recall that the results of the discriminant analysis indicated a 10% probability that BP /e/ would be categorised as AusE /0:/ but this did not occur, yet conversely, here we see that BP /u/ was categorised as AusE /0:/12% of the time, although it was not predicted acoustically.

For the BP /a/–/E/ and /a/–/O/ contrasts, our results are partially consistent with the patterns of acoustic similarity. We found that BP /a/ was categorised as /æ/38% of the time. However, instead of the predicted moderate level of choice of AusE /æ/ (as in had) for BP /a/, the long AusE /5:/ (as in heart) was instead selected most frequently at 50% of the time. Given that no perceptual overlap is observed in BP /a/–/E/, the categorisation pattern would correspond to a SUBSET EASY scenario. The same might also apply to BP /a/–/O/. However, we do see a partial overlap, with 13% of the BP /O/ perceived as AusE /5:/ (which was the most frequent response for BP /a/).

The minor discrepancies between the acoustic predictions and the categorisation results could be related to the fact that we selected the best fitting discriminant model that in this case did not include F3, which conveys information related to lip rounding. It may be that although in machine learning vowels can be categorised with high accuracy using duration and normalised F1 (height) and F2 (backness) values only, human listeners may not be able to help but pay attention to other aspects of the signal. Thus, human listeners may primarily use F3 when it cues rounding, an articulatory property, but ignore it (or give it unequal weight) in cases where rounding is not present. This therefore suggests that listeners seem to perceive rounding (as opposed to F3), and the likely reason they did not choose /0:/ in the categorisation of BP /e/ is that they did not detect the rounding that they may have detected when categorising BP /u/.

Turning to the categorisation results for ES presented in Table 5, we found that acoustic similarity was largely consistent in predicting ES listeners' non-native categorisation patterns. As expected, BP /i/, /E/, /a/ and /u/ were each categorised as the different single native ES vowel categories identified in acoustics, namely /i/, /e/, /a/ and /u/, respectively.

Also in accordance with our acoustic analyses, BP /e/, /o/ and /O/ showed some degree of categorisation to more than one ES category. BP /e/ was categorised as ES /i/ and /e/ as expected. BP /o/ was categorised as both ES /o/ and /u/ as expected. However, the majority of tokens were categorised as ES /o/ instead of ES /u/, reversing the acoustic prediction. Finally, our discriminant analysis predicted a 64% probability that BP /O/ would be categorised as ES /o/ and only a 27% probability that it would be categorised as ES /e/. However, the non-native categorisation task indicated that 97% of the /O/ tokens were categorised by actual listeners as ES /o/. Thus, here we also see a case where our acoustic prediction regarding the categorisation of BP /O/ to ES /e/ was not borne out. Again, it seems as though this discrepancy could be explained by the amount of rounding in the BP /O/ and /o/vowels. Recall that the best fitting LDA for our data was the one that included normalised duration and F1 and F2 values only, thus it did not take into consideration F3, which as mentioned above usually corresponds to lip rounding. Although our acoustic analysis for ES indicated a potential categorisation of BP /O/ to ES /e/, it is not surprising that this was not the case as BP /O/ is a rounded vowel whereas ES /e/ is not. It seems likely that human listeners would weight rounding heavily in their categorisation of BP /O/, i.e., hear it as ES /o/ rather than ES /e/. Furthermore, the results indicate that SIMILAR and NEW L2LP scenarios are represented in these non-native categorisation results. Non-native categorisation of BP /a/–/E/ and /a/–/O/ both show evidence of the SIMILAR scenario, as neither BP contrast yielded perceptual overlapping in the ES response categories. For the remaining contrasts, we see evidence of L2LP's NEW scenario as the perceptual overlapping occurred over just one single response category.

#### *3.4. Discussion*

Based on the above findings, it appears that the non-native categorisation patterns for ES listeners were largely in line with predictions based on acoustic similarity between the target BP vowels and the listeners' production of their native vowel categories reported in the cross-language acoustic comparisons.

For AusE listeners, however, there seemed to be more discrepancies between acoustic predictions and categorisation patterns. For example, there were some cases where the acoustic analyses indicated a small probability that AusE /0:/ would be a likely response category and it was not, or vice versa. We also observed that BP /a/ tokens were categorised more frequently as AusE /5:/ rather than AusE /æ/, which was acoustically predicted to be the most likely response category. These differences between predicted and actual categorisation could be related to cue weighting as well as to dynamic features in AusE vowels. The L2LP theoretical framework includes a strong emphasis on acoustic and auditory cue-weighting (the relative importance of acoustic cues in the learner's native and target languages). Thus, it may be that listeners weight certain cues (e.g., lip rounding or duration) more than others, as has been shown in previous studies (e.g., Curtin et al. 2009).

Studies have also shown that AusE vowels are marked by dynamic formant features (Watson and Harrington 1999; Elvin et al. 2016; Escudero et al. 2018), thus it could be

that listeners are searching for these dynamic features in order to categorise the target BP vowels. Future studies may consider improving the acoustic analyses by measuring the amount of spectral change in the native and target language and running discriminant analyses on those data. For example, Escudero et al. (2018) measured the amount of spectral change in three AusE vowels (/i/, /I/ and /u:/) by extracting formant values at 30 equally spaced time points. Discrete cosine transform (DCT) coefficient values were obtained, which correspond to the vowel shape in the F1/F2 space (formant means, magnitude and direction). These DCT values were used in discriminant analyses, resulting in better overall categorisation of the AusE vowels than discriminant analyses run on F1, F2 and F3 values alone. Thus, using DCT values for the native and target language may provide more reliable acoustic predictions that correspond more closely to actual human performance in non-native categorisation.

#### Predictions for Discrimination Accuracy

In order to predict listeners' performance in discrimination, we calculated perceptual overlap scores based on the amount of overlapping in the listeners' non-native categorisation of the BP vowels. We computed the perceptual overlap scores following Levy (2009) for our categorisation data to determine how predictions of discrimination difficulty based on non-native categorisation would compare with our predictions based on our acoustic comparisons as described in the cross-language acoustic comparisons (see Table 3). To determine a perceptual overlap score based on the categorisation percentages reported in Tables 4 and 5, we sum the smaller percentages of when both BP vowels in a given contrast are categorised as the same native vowel category. Table 6 presents the acoustic and perceptual overlap scores.


**Table 6.** Acoustic and perceptual overlap scores for AusE and ES listeners expressed as proportions.

When comparing the predictions for discrimination difficulty based on perceptual overlap scores with those based on acoustic overlap, the predictions are rather similar for BP /a/–/E/. That is, both acoustic similarity and non-native categorisation patterns predict that this contrast should be perceptually easy for both groups of listeners to discriminate. That is because this contrast appears to correspond to the L2LP SUBSET EASY learning scenario for AusE listeners and the SIMILAR learning scenario for Spanish listeners. Acoustic similarity and categorisation patterns indicate the same L2LP scenarios to apply to BP /a/–/O/ in both languages, and so this contrast should also be perceptually easy to discriminate.

For the remaining four contrasts, predictions based on acoustics and non-native perceptual categorisation differ. Acoustic similarity predicts BP /i/–/e/ to be perceptually difficult for AusE listeners to discriminate, due to the L2LP SUBSET DIFFICULT, but the ES categorisation results suggest that this is also likely to be difficult for ES listeners to discriminate as there is evidence of the L2LP NEW scenario. Predictions based on acoustic similarity predict that both groups should find BP /o/–/u/ to be one of the most difficult contrasts to discriminate, whereas perceptual overlap scores suggest that /o/–/O/ should be the most difficult to discriminate. Difficulties for both contrasts is predicted by the L2LP SUBSET DIFFICULT scenario for AusE listeners and the NEW scenario for ES listeners.

From these findings, two possible predicted patterns of difficulty can be identified. Predictions based on acoustic similarity would suggest the following order of difficulty for the two groups (ranging from the lowest acoustic overlap score to the highest):

AusE: /a/-/E/>/a/-/O/>/e/-/E/>/o/-/O/>/i/-/e/~/o/-/u/

(1)ES: /a/-/E/~/a/-/O/>/e/-/E/>/o/-/O/~/i/-/e/~/o/-/u/

On the other hand, non-native categorisation patterns, i.e., perceptual similarities, would predict that both AusE and ES listeners would share the same pattern of difficulty:

AusE and ES: /a/-/E/>/a/-/O/>/o/-/u/>/e/-/E/>/i/-/e/>/o/-/O/

In all cases, BP /a/–/E/ is predicted to be the easiest to discriminate, with the order of difficulty differing among the rest of the contrasts for the acoustic and perceptual predictions. An examination of the pattern of difficulty in the results for discrimination accuracy will shed light on whether discrimination difficulty is in line with predictions based on acoustic similarity or those based on non-native categorisation patterns.

#### **4. Non-Native Discrimination**

#### *4.1. Participants*

Participants in this task were the same 20 AusE and 20 ES participants previously reported in the cross-language acoustic comparisons and non-native categorisation task6.

#### *4.2. Stimuli and Procedure*

Listeners were presented with the same 70 naturally produced BP /fVfe/ target words (7 vowels × 10 speakers), selected from Escudero et al.'s (2009) corpus, previously reported and analysed in the cross-language acoustic comparisons and the non-native categorisation task.

To test for discrimination accuracy, participants completed an auditory two-alternative forced choice (2AFC) task in the XAB format, similar to that of Escudero and Wanrooij (2010), Escudero and Williams (2012) and Elvin et al. (2014). The task was run on a laptop using the E-Prime 2.0 software program.

Three stimulus items were presented per trial. The second (A) and third (B) items were always from different BP vowel categories and the first item (X) was the target item about which a matching decision was required. In each trial, X was always one of the 70 target BP words, produced by the five male and five female speakers reported above. The A and B stimuli were always the seventh male and seventh female speaker from the Escudero et al. (2009) corpus to avoid any confusion of overlapping target stimuli and response categories. The gender of the A stimuli was always the same gender of the speaker of the B stimuli. This differs from the Elvin et al. (2014) study where the A and B stimuli were synthetic. Furthermore, the order of the A and B responses was counterbalanced (namely, XAB and XBA). On each trial, participants were instructed to listen to the three words using headphones and were required to make a decision as to whether the first word they heard sounded more like the second or the third.

For the first ten participants for each language group, testing consisted of six blocks of categorical discrimination tasks, with a short break permitted between blocks. Each block consisted of 40 trials with one of the six BP contrasts, namely /a/–/O/, /a/–/E/, /i/–/e/, /o/–/u/, /e/–/E/ and /o/–/O/, with the blocks presented in a randomised order. To determine whether discrimination accuracy differs when stimuli are blocked by contrast or randomised, the remaining 10 participants per group completed the same

<sup>6</sup> We note that the five participants missing from the non-native categorisation are included in our analyses of non-native discrimination. We do address the issue of the missing data from the categorisation task in our comparison of cross-linguistic acoustic similarity vs. perceptual similarity below.

discrimination task, with the same breaks, but with the stimulus contrasts presented in random order, unblocked.

#### *4.3. Results*

We conducted a repeated-measures Analysis of Variance (ANOVA), with contrast as a within-subjects factor and condition (blocked, randomised) as a between-subjects factor, to evaluate whether blocking by BP contrast has an effect on overall performance. The results yielded no significant effect of condition on performance, [F (1, 38) = 1.905, *p* = 0.176, η p<sup>2</sup> = 0.048], suggesting that listeners had similar accuracy scores regardless of the condition (blocked vs. randomised). Thus, Figure 3 shows discrimination accuracy for the AusE and ES groups, including their variability, across the six BP vowel contrasts for both conditions pooled together.

**Figure 3.** Overall discrimination accuracy including variability for AusE and ES.

The figure shows that the average accuracy scores are comparable across the two language groups. Both groups appear to have highest accuracy for /a/–/O/ and /a/–/E/ and lowest accuracy for /i/–/e/, /o/–/O/ and /o/–/u/, with intermediate accuracy on /e/–/E/.

In order to test for differences across the contrasts and between the two groups, a linear mixed-effects binary logistic model was conducted in SPSS with participant, X stimulus and trial included as random effects and BP contrast and language group included as fixed effects. Recall that for experiment designs with repeated measures analysed with mixed-effects models, Brysbaert and Stevens (2018) recommend a sample size of at least 1600 observations per condition. As each of the 40 participants completed 40 trials per BP contrast, this recommendation was met (40 participants × 40 trials = 1600 observations per BP contrast).

The model revealed a significant main effect of contrast [χ<sup>2</sup> (5, N = 9599) = 646.212, *p* ≤ 0.001]. This significance is based on a comparison of nested models by the likelihood ratio test. There was no significant effect for language group [χ<sup>2</sup> (1, N = 9599) = 0.880, *p* = 0.348]. However, the interaction of BP contrast\*language group [χ<sup>2</sup> (5, N = 9599) = 19.35, *p* = 0.002] was significant. This confirms that discrimination accuracy varies depending on the BP contrast and that although there are no reliable differences between AusE and ES in terms of overall accuracy, the two groups did differ in their performance on some of the BP contrasts. We ran Fisher's LSD-corrected post-hoc pairwise comparisons to determine the group differences across the contrasts and found that the ES listeners had higher accuracy than AusE participants for discrimination of BP /a/–/O/ (*p* = 0.035, 95% CI [−0.06, 0.00]), whereas the AusE participants performed better than the ES participants on BP /o/–/O/ (*p* ≤ 001, 95% CI [0.05, 0.14]).

Fisher's LSD-corrected post-hoc pairwise comparisons were also used to compare discrimination accuracy for each language group across the six BP contrasts. The results indicated that both groups found the same contrasts equally easy/difficult to discriminate. In particular, both groups had significantly higher accuracy scores for BP /a/–/E/ than the remaining contrasts (AusE: all *p*s ≤ 0.013, ES: all *p*s ≤ 0.001), with the exception of the ES listeners' performance on BP /a/–/O/ which was comparable to BP /a/–/E/ (*p* = 0.628). The results further indicated that both groups found BP /a/–/O/ to be significantly easier to discriminate than the remaining four contrasts (/i/–/e/, /o/–/u/, /e/–/E/ and /o/– /O/) (AusE and ES: all *p*s ≤ 0.001). The AusE participants had significantly lower accuracy scores for BP /i/–/e/ and /o/–/u/ than the other four BP contrasts (all *p*s ≤ 0.018), but comparable levels of difficulty among the latter four contrasts (*p* = 0.339). Likewise, the ES participants had comparable levels of difficulty for BP /i/–/e/ and /o/–/u/, but also /o/–/O/ (*p*s = 0.233–0.676), with significantly lower accuracy scores on these contrasts than the remaining three contrasts (all *p*s ≤ 0.001). The results indicate that there was no significant difference between BP /e/–/E/ and BP /a/–/O/or /a/–/E/ for both AusE and ES listeners. However, BP /e/–/E/ was significantly easier for both groups to discriminate than the remaining three contrasts (AusE and ES: all *p*s ≤ 0.001). Based on the results from the statistical analyses, the order of difficulty from least difficult to most difficult (where "~" means equal or comparable difficulty and ">" signifies higher accuracy) is as follows:

AusE /a/-/E/>/a/-/O/>/e/-/E/>/o/-/O/>/i/-/e/~/o/-/u/ (2)ES /a/-/E/~/a/-/O/>/e/-/E/>/o/-/O/~/i/-/e/~/o/-/u/

#### *4.4. Acoustic vs. Perceptual Similarity as a Predictor of Non-Native Discrimination*

Recall that we predicted two possible patterns of discrimination difficulty, depending on whether or not findings would be more consistent with predictions based on acoustic similarity or those based on non-native categorisation patterns as determined by the degree of perceptual overlap. As predicted by both acoustic similarity and perceptual overlap, BP /a/–/E/ was indeed easiest for both groups to discriminate. We also find that in line with the acoustic and perceptual overlap predictions, the BP /a-O/ contrast was indeed perceptually easy for ES listeners. However, it was also perceptually easy for AusE listeners. As predicted acoustically, BP /i/–/e/ was indeed difficult for AusE listeners and, in line with the perceptual overlap predictions, this contrast was also difficult for ES listeners. We also find that in line with our acoustic predictions, BP /o/–/u/ was difficult for both groups to discriminate. In comparison, the AusE listeners' results for BP /e/–/E/ and /o/–/O/ were more in line with predictions based on the perceptual overlap scores of their non-native categorisation patterns.

To assess quantitatively how different measures of vowel category overlap (acoustic vs. perceptual) relate to discrimination, we fit mixed-effects binomial logistic regression models using the glmer function (binomial family) in R (3.5.1). Accuracy was the dependent variable (correct vs. incorrect) and either perceptual overlap or acoustic overlap was the predictor (fixed factor). Rather than use the raw values for the predictor, acoustic and perceptual overlap scores for each BP contrast were rank coded from least overlap (=1) to greatest overlap (=6) in light of Levy's (2009) treatment of the overlap scores as ordinal and not as interval measures. For any instances of a tie, the average rank was assigned (as shown in Table 7). Subsequently, overlap was centred around the middle of the ranking scale, meaning that the models' intercepts represent average accuracy between ranks 3 and 4 and that the fixed effect of overlap represents the average decrease in accuracy associated with a one-unit increase in overlap rank.



The random factors were Participant (with random slopes for either perceptual or acoustic overlap rank, as this factor was repeated across listeners), Item (the X Stimulus-Contrast combination with random slopes for either perceptual or acoustic overlap rank, as this factor was repeated across items) and Trial.

As five ES listeners lacked perceptual assimilation data, for these participants, we used the mean individual perceptual overlap values from the remaining ES listeners and ranked the six BP contrasts accordingly. For the ES model on individual perceptual overlap, we checked whether controlling for the subgroup of five ES listeners with imputed individual overlap scores would provide a closer model fit. To do so, a likelihood ratio test was conducted comparing a model not controlling for the subgroup and a model including an effect of subgroup (the five ES listeners versus the remaining ES listeners) and its interaction with individual perceptual overlap. This showed that the more complex model provided almost no improvement over the simpler model (χ<sup>2</sup> (2) = 0.42, *p* = 0.81).

Before accepting the results of the mixed-effects models, we tested whether they were sufficiently powered to detect the smallest meaningful effect size of perceptual or acoustic overlap. This was because the models were run on the two groups' data separately unlike the previous analysis examining discrimination accuracy with both groups together, meaning there were far fewer than the recommended 1600 observations per condition (Brysbaert and Stevens 2018). We defined the smallest meaningful effect size as one fewer correct response with each one-unit increase in overlap rank (equivalent to 2.5% of trials within each listener's set of responses per BP contrast). For each model, using the SIMR package in R (Green and MacLeod 2016), 1000 Monte Carlo simulations were run where correct and incorrect responses were randomly generated such that the regression coefficient for the smallest meaningful effect of overlap rank remained the same. We deemed a model to have sufficient power if at least 80% of its simulations detected this smallest effect with a *p*-value less than 0.05. All models passed this test.

The results from the mixed models presented in Table 8 indicate that the level of acoustic overlap and perceptual overlap based on both individual and group calculations indeed influenced the participants' discrimination accuracy. This means that these measures can be reliably used to predict discrimination difficulty. To examine whether one measure of overlap (acoustic vs. perceptual and group vs. individual) better explained our discrimination data, we conducted pairwise comparisons on the Bayesian Information Criterion (BIC) from each model. BIC is intended for model selection and takes into account

the log-likelihood of a model and its complexity. To quantify the weight of evidence in favour of one model over an alternative model, Bayes Factors (BFs) can be computed based on each model's BIC (Wagenmakers 2007). BFs < 3 provide weak evidence, BFs > 3 indicate positive support and BFs > 150 indicate very strong support for the alternative model (Wagenmakers 2007). For AusE listeners, the models containing group or individual acoustic overlap scores were very strongly supported over their counterpart models containing perceptual overlap scores (BFs > 150). For ES listeners, on the other hand, the opposite was the case, namely, the models containing group or individual perceptual overlap scores were very strongly supported over the counterpart models containing acoustic overlap scores (BFs > 150).


**Table 8.** Results of the mixed models for acoustic and perceptual models for groups and for individuals.

Next, we compared each group model, as reported in Table 8, with its counterpart individual model. For AusE listeners, the group acoustic overlap model provided modest positive support over the individual acoustic overlap model (BF = 9.97), whereas the group perceptual overlap model provided weak support over the individual perceptual overlap model (BF = 1.82). For ES listeners, both the group perceptual and acoustic overlap models provided very strong support over the individual models (BFs > 150). In summary, the pairwise model comparisons indicate that acoustic overlap scores better predict AusE listeners' discrimination performance and perceptual overlap scores better predict ES listeners' performance, and group overlap scores better predict ES listeners' performance, whereas there is less strong evidence in favour of group overlap scores predicting AusE listeners' discrimination performance.

#### **5. General Discussion**

The present study investigated whether Australian English (AusE) and European Spanish (ES) listeners differed in their categorisation and discrimination of Brazilian Portuguese (BP) vowels. Specifically, we were interested in whether acoustic similarity (based on individuals' own native production data) predicted their non-native categorisation patterns, as predicted by L2LP (Escudero 2005, 2009; van Leussen and Escudero 2015; Elvin and Escudero 2019), as well as whether perceptual similarity also predicted discrimination accuracy to a better, worse or same degree as acoustics. We further investigated whether individual native vowel production and categorisation patterns better predicted non-native discrimination than production and/or perception averages. We conducted a comprehensive acoustic analysis of the cross-linguistic differences between the listeners own native vowel production and the target BP vowels in order to predict their non-native categorisation patterns from acoustic similarities, according to L2LP principles. We further calculated the amount of acoustic and perceptual overlap (i.e., where the two vowels in a

BP contrast were acoustically similar to/perceived as the same native vowel category[ies]) in order to predict discrimination difficulty. We predicted that the greater the acoustic and/or perceptual overlap, the more difficult the BP contrast would be to discriminate.

Our results indicated that AusE and ES listeners' patterns of non-native categorisation were partially consistent with L2LP predictions based on the cross-linguistic acoustic similarity between the listeners' own native vowel productions and the target similarity. For AusE listeners, acoustic similarity successfully predicted cases of L2LP's SUBSET scenario in that each BP vowel was categorised to multiple categories of L1 vowels, as expected. For ES listeners instances of L2LP's NEW scenario, in which two L2 categories were mapped on to the same native category, were identified, also in line with acoustic predictions. Interestingly, the acoustic comparison also successfully predicted that BP /e/ and BP /o/ would be mapped to two native ES categories, which contributed to perceptual overlap.

We do find some discrepancies between our L2LP acoustic predictions and non-native categorisation patterns, particularly for the AusE listeners, in that some listener categorisation responses were not predicted by acoustic similarity. These differences are likely caused by the fact that our acoustic analysis only used F1, F2 and durational values, and did not include additional features of the vowel such as F3 (e.g., for lip rounding/vocal tract lengthening) and dynamic formant trajectories, which have been show to play an important role in AusE vowel perception and production (Elvin et al. 2016; Escudero et al. 2018; Williams et al. 2018). Another possible explanation for these differences could be the influence of orthographic labels (Escudero and Wanrooij 2010; Bassetti et al. 2015) in non-native vowel categorisation or the number of response options (Benders et al. 2012), both of which differed between the AusE and ES listeners. For instance, our acoustic analyses are not influenced by orthography, whereas listeners were presented with orthographic labels to represent each native vowel category in the non-native categorisation task. The influence of orthography on vowel perception has been demonstrated in Escudero and Wanrooij (2010) where Spanish learners of Dutch exhibited different patterns of vowel categorisation across an auditory only and auditory with orthography task (for a full review see: Escudero and Wanrooij 2010).

Turning to our results for non-native discrimination, in line with Elvin et al. (2014), the results from the present study indicate that both groups found some of the same BP contrasts easy versus difficult to discriminate. However, unlike that previous study, we found an interaction between language group and BP contrast and therefore the exact patterns and rankings of discrimination difficulty differed slightly across the two groups between these studies (see Table 9). In Elvin et al. (2014), both AusE and ES had comparable accuracy scores for BP /a/–/O/, /e/–/E/, /o/–/O/. However, in the current study, the AusE participants found /a/–/O/ to be easier than /e/–/E/, which was easier than /o/– /O/. For the ES listeners, BP /a/–/E/ and /a/–/O/ were equally the easiest to discriminate, followed by /e/–/E/; and /o/–/O/ was as difficult to discriminate as /i/–/e/ and /o/– /u/.

**Table 9.** Reported patterns of discrimination accuracy in Elvin et al. (2014) and the present study beginning from perceptually easy to perceptually difficult.


We also investigated acoustic and perceptual similarity as predictors of discrimination accuracy. The results from our generalised linear mixed models suggest that both measures are indeed reliable predictors of discrimination accuracy. Specifically, the higher the perceptual and/or acoustic similarity, the lower the accuracy scores. We ran further model comparisons in order to determine whether one measure of vowel category overlap better explained the discrimination accuracy scores for each group. Interestingly, we found that perceptual overlap scores were a better predictor of discrimination difficulty than acoustic similarity for the ES listeners. However, the opposite was true for AusE, that is, acoustic overlap scores better predicted discrimination accuracy.

The differing vowel inventory sizes could possibly explain why perceptual similarity was a better predictor of discrimination accuracy for the ES participants. Spanish has a five vowel inventory with transparent spelling, whereas Australian English has 13 vowels and opaque spelling. Therefore, in the non-native categorisation task, the ES participants are making a decision between fewer response categories than the AusE participants. Therefore, perhaps the fewer response options reduce the chance of labelling error in that task leading to strong results in the non-native categorisation task. Relatedly, data based on acoustic similarity may be better at predicting AusE discrimination accuracy because of potential labelling errors in the non-native categorisation task. In that task, AusE listeners have 13 vowels to choose from, and due to the opaque nature of English vowel spellings, the vowels were embedded in an orthographic context, which may have increased the demands and/or posed difficulties for the AusE participants. In a task with 13 response categories, there are many potential non-native categorisation patterns available, but also a greater chance of labelling errors. This may suggest that the AusE categorisation trends are not always especially strong or clear in terms of their response frequencies. This has been found in Shaw et al. (2018), where the experiment presented all English vowels with a grid of the 20 corresponding response options and AusE listeners show poor categorisation results for native Australian accented vowels. In the current study, the AusE listeners' categorisation of BP /O/ is a good example of this, where the most selected option had a 32% categorisation frequency. This issue of labelling errors is not applicable in acoustic data, which may be why the acoustic analysis yields a more consistent result. To determine whether or the number of response categories and/or the opaqueness/transparency of the language influenced perception results, future studies could compare languages that have the same number of vowels, but differ in terms of the degree of orthographic transparency.

We also investigated whether measures of acoustic and perceptual categorisation overlap are equally good predictors of discrimination accuracy at both group and individual levels (i.e., using individual overlap scores vs. group averages). This was made possible by our inclusion of native production data. The results from our analyses revealed that both measures do in fact predict discrimination accuracy. However, we found that the model based on group averages was a better predictor of discrimination accuracy than individual score averages for the ES group. Yet the evidence for group scores was not as strong for predicting the AusE listeners' discrimination performance. This finding goes against the L2LP model claims as well as studies that show the importance of individual variation in predicting L2 perceptual development. It is possible that the group model provides a better estimate of individual behaviour because the averages across the individuals are less noisy than the individual averages. That is, population data are less affected by response errors/variability in responses when doing a non-native categorisation task, as it aggregates responses from many trials (compared to a single listener with fewer trials), weakening the influence of any "outlier" behaviour. Finally, the fact that there is not actually much difference between individual and group data for the AusE group suggests that assimilation/similarity patterns at individual and group level are not as reliable or are quite variable (because there are so many assimilation possibilities available). Further investigation into the effect of individual differences for L2 development is required.

In sum, our findings indicate that listeners' non-native categorisation patterns are largely predicted by a detailed acoustic comparison of the native and target languages, with data collected from the same populations for both vowel productions and perceptual testing, mostly in keeping with L2LP predictions. Importantly, we find that AusE listeners do not have an advantage when perceiving non-native vowels despite their native language

(English) having a larger and more complex vowel inventory than that of the ES listeners (Spanish). In fact, we find that listeners' discrimination patterns are largely dependent on the L2LP learning scenario identified for each vowel contrast, which were similar across the two language groups. That is, contrasts which contained evidence of L2LP's NEW or SUBSET scenarios (containing an acoustic or perceptual overlap where the two vowels in a given BP contrast are acoustically similar to, or categorised to, the same native category[s]) resulted in similar discrimination difficulties for ES and AusE listeners, as both scenarios resulted in a failure to detect a distinction between the non-native vowels. In addition, both of these learning scenarios are likely to be more difficult than contrasts where only the SIMILAR scenario is present. These findings are also consistent with previous studies (Bohn et al. 2011; Levy 2009; Tyler et al. 2014; Best et al. 2019) testing PAM's theoretical predictions that a higher degree of perceived phonetic similarity (i.e., perceptual overlap) between members of a non-native contrast, as observed by perceptual assimilation patterns, is associated with a greater level of discrimination difficulty for that contrast.

We further found that performance in non-native discrimination can be predicted by measures of acoustic and perceptual similarity using both group and individual data, although we found that perceptual similarity was a better predictor for ES and acoustic similarity for AusE. We also found that group data better explained ES discrimination accuracy, but not as clearly so for AusE listeners. We suspect that these findings are related to vowel inventory size differences between their native languages, and the nature of the response categories in the non-native categorisation task. At this stage, it is difficult to do a direct comparison between the overlap measures (acoustic vs. perceptual and group vs. individual) because both predict the discrimination data. Further studies on acoustic predictions should include F3 and dynamic vowel measures in addition to F1 and F2 static and durational measures, in line with previous studies showing the importance of dynamic information for AusE vowels in particular (Elvin et al. 2016; Escudero et al. 2018; Williams et al. 2018). It seems that the nature of the non-native categorisation task and the differing language backgrounds complicate any conclusions that could be drawn in regards to which measure is better. For now, it is sufficient to state that both acoustic and perceptual similarity predict performance on discrimination and further investigation would be required to identify whether one measure is better than the other, or perhaps it is a case where different measures are more suited to different languages.

Finally, according to the L2LP model, perception is linked to spoken word recognition and production, thus future studies should compare data from the tasks of the present study with the same listeners' spoken word recognition and non-native vowel production data. Indeed, some prior studies (e.g., Broersma 2002; Escudero et al. 2013; Escudero et al. 2008; Pallier et al. 2001; Weber and Cutler 2004) have shown that vowel contrasts that are difficult to perceive are also difficult in spoken word learning and word recognition tasks. Therefore, we would expect that these same listeners' patterns of discrimination difficulty could be used to predict difficulty in a spoken word learning and word recognition task containing the same pseudo-words used in the present study. Furthermore, the findings from the present study could also apply to non-native speech production. In particular, the SLM and L2LP theoretical models claim that listeners' production of non-native or L2 sounds is influenced by their perception of these sounds in the L1. However, to date, most studies (e.g., Diaz Granado 2011; Flege et al. 1999) have used the theoretical framework of SLM to test L2 production. Considering that the L2LP model posits a direct link between nonnative and L2 production, it is perhaps surprising that very few studies have tested this claim (e.g., Rauber et al. 2005). Thus, acoustic analyses and in particular the acoustic and perceptual overlap scores could be used to predict the patterns in the same listeners' non-native vowel productions. In conclusion, future research is required that adequately tests the L2LP model predictions for the role of perception in both word recognition and non-native production, and compares them to other models of non-native and L2 speech perception and production.

**Author Contributions:** J.E., P.E., J.A.S. and C.T.B. conceived and designed the experiments. J.E. ran all experiments. J.E. and D.W. conducted all the analyses. J.E. wrote the original draft. J.E., D.W., J.A.S., C.T.B., and P.E. reviewed and edited this manuscript. All authors have read and agreed to the published version of this manuscript.

**Funding:** This work was funded by an Australian Postgraduate Award, the MARCS Research Training Scheme and the Australian Research Council (ARC) Centre of Excellence for the Dynamics of Language (CE140100041). PE's work was supported by an ARC Future Fellowship (FT160100514).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the National Ethics Application Form, and approved by the Human Research Ethics Committee) WESTERN SYDNEY UNIVERSITY (H10427 approved 01/11/2013).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Cross-Linguistic Interactions in Third Language Acquisition: Evidence from Multi-Feature Analysis of Speech Perception**

#### **Magdalena Wrembel 1,\*, Ulrike Gut 2, Romana Kopeˇcková <sup>2</sup> and Anna Balas <sup>1</sup>**


Received: 1 September 2020; Accepted: 28 October 2020; Published: 3 November 2020

**Abstract:** Research on third language (L3) phonological acquisition has shown that Cross-Linguistic Influence (CLI) plays a role not only in forming the newly acquired language but also in reshaping the previously established ones. Only a few studies to date have examined cross-linguistic effects in the speech perception of multilingual learners. The aim of this study is to explore the development of speech perception in young multilinguals' non-native languages (L2 and L3) and to trace the patterns of CLI between their phonological subsystems over time. The participants were 13 L1 Polish speakers (aged 12–13), learning English as L2 and German as L3. They performed a forced-choice goodness task in L2 and L3 to test their perception of rhotics and final obstruent (de)voicing. Response accuracy and reaction times were recorded for analyses at two testing times. The results indicate that CLI in perceptual development is feature-dependent with relative stability evidenced for L2 rhotics, reverse trends for L3 rhotics, and no significant development for L2/L3 (de)voicing. We also found that the source of CLI differed across the speakers' languages: the perception accuracy of rhotics differed significantly with respect to stimulus properties, that is, whether they were L1-, L2-, or L3-accented.

**Keywords:** multilingualism; third language acquisition; speech perception; rhotics; final obstruent devoicing

#### **1. Introduction-Bilingual vs. Multilingual Perspective**

In this contribution, we explore cross-linguistic interactions between phonological subsystems in third language acquisition, based on evidence from multifeature analysis of speech perception. Research on third language (L3) phonological acquisition has shown that Cross-Linguistic Influence (CLI) plays a role not only in forming the newly acquired language but also in reshaping the previously established ones (cf. Wrembel and Cabrelli 2018). There is scarcity of evidence from perceptual studies, however, which seems unfortunate considering that speech perception has been seen as driving the process of non-native phonological acquisition. The most influential second language (L2) phonology models use (cross-language phonetic) perception to explain the outcomes of L2 speech learning e.g., Speech Learning Model, (Flege 1995; Flege and Bohn 2020) and Perceptual Assimilation Model (Best 1995; Tyler 2019). It would, therefore, seem beneficial and necessary that L3 phonology research complements its findings by examining cross-linguistic interactions in multilingual perceivers in order to be ultimately in a more favorable position to both explicate their production and gain a more complete picture of multilingual phonological acquisition.

Third language acquisition (TLA) has recently gained recognition as an independent field of enquiry from second language acquisition (SLA). Scholars working on this new perspective maintain that the former is inherently more complex than the latter, as it involves a quality change in the language learning and processing (e.g., Cenoz et al. 2001; De Angelis 2007). They imply that the process of learning the first foreign language (L2) is fundamentally different from the process of learning a third or additional language (L3/Ln), mainly because of enhanced language awareness, language learning strategies, and increased potential for cross-linguistic interactions between L1, L2, L3, or Ln that occur in additional language acquisition. A number of linguistic and psycholinguistic studies support these claims by providing evidence for the existence of qualitative and quantitative differences in processing the third language as compared to the first or second language (Cenoz and Jessner 2000; Cenoz et al. 2001; Hufeisen and Lindemann 1997). From a theoretical linguistic perspective, Flynn et al. (2004) argue that the study of L3 acquisition can offer new insights into the process of language learning that exceed those offered by investigations of the first or the second language.

One of the major differences between the acquisition of a second and a third language is that L3 learners have already acquired their first foreign language, and, thus, they can resort to some conscious linguistic knowledge as well as language-learning experience and strategies (cf. De Angelis 2007). Multilingual learners, thus, have at their disposal a broadened phonetic repertoire, a raised level of metalinguistic awareness, and potentially enhanced perceptual sensitivity, which may facilitate the learning of a subsequent phonological system (cf. Gut 2010; Wrembel 2015). In a dedicated volume on "Universal or diverse paths to English phonology", Gut et al. (2015) attempt a comprehensive comparison between the acquisition of phonology from a SLA vs. TLA perspective, showing that L3 learners' development of perception and production differs sharply from that of L2 learners' in being more differentiated and constrained by a greater number of factors.

The extant findings from L3 phonology research suggest that any of the previously or currently acquired languages can serve as a source for CLI in the perception and production of target segments and suprasegmentals, and that this phenomenon is multidirectional (cf. Cabrelli and Wrembel 2016). We have a growing understanding of the combination of factors conditioning the different types of phonological CLI in L3 learning, such as proficiency in the respective languages, (psycho)typology as well as the type of phonological task performed (for an overview, see Wunder 2014). However, so far the investigations have been mostly limited to a single feature and/or one testing time, thus exploring this question with more phonetic features and longitudinally seems paramount for our understanding of the relative effect of cross-linguistic processes in non-native speech learning, and speech perception in particular.

In the present paper, we examine L2 and L3 speech perception of two phonetic features, which have a different standing in the phonological repertoire of the multilinguals of this study, over the course of the first year of their instructed L3 learning. We seek to investigate how and to what extent phonological CLI may change over time in multilingual perceivers.

#### **2. Non-Native Speech Perception**

Considering that only a few studies to date have examined speech perception of multilingual learners (cf. Balas et al. 2019; Wrembel et al. 2019; Nelson 2020), models of L2 speech perception may serve as an informative starting point for the formulation of predictions for L3 learners, taking into consideration the learners' enlarged phonological repertoire as well as greater language learning experience. Most L2 speech perception models have predicted accuracy of perception on the basis of similarities and differences between L1 and L2 sounds. Starting with Lado (1957) Contrastive Analysis Hypothesis, L2 phonemes that are similar to L1 phonemes were considered easy to perceive and L2 sounds that are different from the L1 sounds difficult. Eckman (1977) Markedness Differential Hypothesis proposed that target language structures that are both different and more marked should prove difficult for learners, whereas structures that are different but less marked should not pose difficulties. The Speech Learning Model (SLM; Flege 1995) predicts that it is the fairly similar L2 sounds (to their L1 counterparts) that are most challenging for L2 learners to acquire, as they are subject to equivalence classification, i.e., they are perceptually equated with existing L1 categories. Conversely, the sounds that do not resemble any of the L1 categories may enhance the process of category formation, and, hence, be perceived accurately. Similarly, the Perceptual Assimilation Model (PAM; Best 1995; Best and Tyler 2007) presupposes that not all target language sounds are equally challenging for learners, but it focuses on non-native contrasts rather than on individual phonemes. Discrimination of non-native sounds varies depending on how a non-native contrast is assimilated and goodness-rated to native language phonological categories, resulting in at least four different assimilation patterns for each non-native sound contrast (Best 1995, pp. 194–98).

Most relevantly for the present study, PAM predicts a continuous refinement of L2 learners' speech perception as a function of their extended experience with learning the L2 (PAM-L2; Best and Tyler 2007). With time, learners are likely to enjoy not only more L2 input but also to gain greater experience in producing the target contrasts and to increase their knowledge of L2 (minimal pair) vocabulary (Bundgaard-Nielsen et al. 2011). According to the model, L2 learners are, thus, expected to start perceiving within-category differences and develop new categories for the non-native sounds and contrasts. The way this category refinement may reshape in the context of L3 learning, particularly when L2 continues to develop too is still to be examined (for the first attempt, see Wrembel et al. 2019).

As non-native speech perception is characterized by considerable inter-listener variation, the Second Language Linguistic Perception Model (L2LPM; Escudero and Boersma 2004; Escudero 2005, 2009) concentrates on individual developmental paths on the basis of a detailed acoustic comparison of the production of L1 and L2 sounds. Two main learning scenarios are present for L2 learners, according to this model: When two L2 sounds are categorized to the same native language category, the learner needs to create a new category for one of the L2 sounds or split the existing category. When two L2 sounds are heard as separate L1 categories, the learner's task is to shift category boundaries to accommodate the L2 sounds. The latter scenario in which an L2 sound is perceived as more than one native category may be challenging as it may lead to overdifferentiation in the L2. The speed of perceptual learning in this model is, thus, predicted to depend on the particular learning scenario and richness of both L1 and L2 input that an individual learner enjoys in their learning environment.

#### *2.1. Development of Non-Native Speech Perception*

Previous research on the role of experience in the perception of non-native sounds and contrasts has yielded mixed results. Flege (1991); Baker et al. (2002); Kopeˇcková (2012), and Rallo Fabra and Romero (2012) reported (immersion) experience effects on the discrimination and identification of at least some L2 English vowels and consonants of speakers of diverse L1 backgrounds, Cebrian (2006) found no significant differences between experienced and inexperienced Catalan-Spanish bilinguals in categorizing English /i:/ and /ı/ vowels. The former group of English learners had resided in Canada for an average of 25 years, while the latter group consisted of undergraduate students of English philology living in Barcelona. Cebrian (2006) reported both learner groups to rely on duration rather than spectral cues in the perception of the target contrast. In a similar vein, Broesma (2005) showed that highly experienced Dutch learners of English can accurately categorize word-final lenis-fortis contrasts, but do not use native-like weighing of cues for voicedness for this familiar contrast (present in Dutch) in an unfamiliar coda position.

Mixed findings have also been reportedin perception training studies. Forinstance,Bradlow et al. (1999) found a long-term increase in identification accuracy of English liquids by L1 Japanese speakers. Anderson (2011) showed in a study with American English learners of Spanish that after about three weeks of identification training, some of the learners perceived the Spanish tap-trill contrast highly variably first, but then it perceptually stabilized with time; that some perceived the acoustic differences rather well in the beginning, but also revealed little change and no bifurcation of the existing phoneme category, and finally that there were also "non-learners" who showed no progress in the perception of this novel contrast. The question of refinement of non-native categories for diverse phoneme types and most crucially, under what type of learning experience it happens, thus remains at present unanswered.

#### *2.2. Previous L3 Speech Perception Studies*

As argued in previous sections, one type of learning experience that may offer important insights into the process of phonological learning in general and cross-linguistic interaction in particular is that of additional/L3 phonological learning. In one of the first studies examining phonological CLI in L3 acquisition, Wrembel et al. (2019) showed that beginner L3 Polish learners perceptually assimilate L3 sibilants to both their L1 German and L2 English categories, with preference for the latter. They can perceive subtle differences between highly similar vowel sounds across the three languages and seem to develop separate L3 categories for them. Beginner L3 learners were, thus, theorized in this study to behave similarly to experienced L2 learners thanks to their extended prior linguistic and learning experience. These are important initial insights, yet longitudinal studies examining the development of speech perception beyond only the L3 are needed to gain a more holistic picture of cross-linguistic mapping processes in multilingual learners, and possible changes thereof over time.

Some first attempts for this methodologically challenging endeavor appeared in Balas et al. (2019) and Nelson (2020). Although an examination of category formation in multilingual speech perception was the main aim of neither of these longitudinal studies, the reported findings into the development of L2 and L3 perception jointly shed at least some light on the process. In a study that stems from the same research project as the present paper, Balas et al. (2019) examined the perception of L2 and L3 rhotic sounds in two groups of young multilinguals five and nine months into their first year of L3 learning. Both L1 Polish and L1 German speakers were found to perceive L2 English rhotics highly accurately and consistently after about five years of learning the language, suggesting fairly stable phonetic categories for this novel sound (in relation to their L1) and no perceptual change as a result of the one year of additional language learning. L1 German speakers were further found to perceive the novel L3 Polish alveolar trills and taps highly accurately, and significantly better and more consistently than L1 Polish speakers did in perceiving L3 German uvular fricatives; the accuracy in perceiving the novel sound further dropped significantly between the two testing times for the latter learner group. The findings were interpreted as suggesting a joint effect of the learner's L1, but not L2, markedness and L2/L3 proficiency in the perception of rhotic sounds by multilingual learners. The present contribution expands on and refocuses this study.

Nelson (2020) examined young and adult L3 learners' perception of the /v-w/ contrast, present in their L2 but not L1, reporting more accurate and faster discrimination ability in the L3 than in the L2 after only a few hours of L3 input. The author hypothesized a positive 'novelty effect' for the L3 learners, maintaining that very initial learners may not automatically assimilate novel sounds to their pre-existing categories (whether those of L1 or L2) but rather resource acoustic cues available to them and tap possible yet different processing and phonological skills at that stage of L3 phonological learning. With respect to their L2 perception development, the young learners evidenced a drop in accuracy after around 10 weeks of their L3 learning, which was interpreted as suggesting a reverse cross-linguistic effect in the form of a temporary 'perceptual confusion'. However, after ten months of learning the L3, the novelty effect as well as the negative cross-linguistic effect disappeared for the young L3 learners, who perceived the contrast in their L2 and L3 similarly (67% and 74% accuracy levels).

To sum up, a common denominator for the existing L3 perception studies is that the phonological space of multilinguals seems to be reshaped relatively early in the course of learning the new L3, and that category boundaries can be expanded to accommodate L1, L2, and L3 categories of similar phonetic types, while new L3 categories for novel phonetic types may be formed. Initial sensitivity to phonetic contrasts may also deteriorate with time as a result of language interactions and be modulated by the status of various contrasts in L3 acquisition, including that of markedness. In the present paper, we attempt to contribute to these emerging findings by examining the perception of novel rhotic sounds (both in the L2 and L3 of the multilinguals, and more marked in their L3) and the perception of final obstruent (de)voicing (more marked in their L2) in the first months of L3 learning.

#### **3. The Present Study**

The aim of this study is to explore the development of speech perception in young multilinguals' non-native languages (L2 and L3) and to trace the patterns of cross-linguistic mappings over the first year of L3 learning. This study forms a part of the international MULTI-PHON research project, in which speech perception and production was investigated with a battery of tests in two parallel groups of young adolescents in Polish and German schools.

#### *3.1. Participants*

The participants were 13 L1 Polish speakers (aged 12–13) who had been learning English as their L2 at school for five years (pre-intermediate level) and who had just started to learn German as their L3 in an instructed setting. They were observed over the first year of L3 learning. Our strict inclusion criteria featured no prior command of German, only Polish as an L1, no additional languages, and data availability at all testing times, thus, for the sake of the present analysis the number of participants was reduced from a larger participant pool (initially 24) to 13 speakers with a homogeneous profile (see Table 1).

**Table 1.** Participant profiles.


\* Self-evaluation of proficiency was assessed on a 5-point scale (1 = very poor, 5 = very good).

An informed consent was obtained from all the subjects who participated in the study, their parents, and the school authorities where the data was collected. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ministry of Education in Brandenburg on 17/07/2017 (ref. number 51/2017).

Language background interviews were conducted in the participants' L1 Polish at the very onset of the project in order to collect information about the individual learner's language backgrounds, including information about their language learning history (i.e., age of learning, length and intensity of instruction), language use (declared percentage in varied situations/contexts), self-evaluation of proficiency (at the onset of instructed L3 learning), and attitudes towards foreign language learning.

#### *3.2. Features under Investigation*

Two phonetic features were selected for investigation, rhotics and final obstruent devoicing, since they have a relatively different standing in the phonological repertoire of the L3 learners in this study (see Table 2). The former sounds are realized differently in each of the speakers' three languages, whereas the latter process is productive in their L1 and L3 but not L2.


**Table 2.** Selected features under analysis in the present study.

#### 3.2.1. Rhotics

In spite of belonging to a phonological natural class, for which there are more phonological than phonetic arguments (cf. Ladefoged and Maddieson 1996), rhotics exhibit large interlanguage variability. In the three languages under investigation in this paper, the distribution of rhotics is as follows: Polish has the alveolar trill, which may be produced as a tap intervocalically or in fast speech (Jassem 2003). In standard German, the conservative uvular trill /ö/, occurring in word-initial or in stressed positions, is usually produced as the uvular fricative /K/ (Kohler 1999). English rhotics include British English postalveolar approximant /ô/ and prevocalically [/ô/ ˙ ], and an American English retroflex approximant (Ladefoged and Maddieson 1996) articulated either with tongue retroflexion or bunching (Ladefoged 2001). The English rhotic is generally voiced except when adjacent to a voiceless obstruent. It occurs in syllable-initial (e.g., run/ô2n/), and syllable-final position (e.g., poor/pOô/) (not in British English), both as singletons and in clusters (e.g., tree/tôi/; heart /haôt/). Both English rhotic sounds are continuants, as opposed to the 'interrupted' variants such as taps or trills in Polish. Worthy of note is the fact that in all three languages, the rhotic sounds are represented orthographically using the <r> letter. This suggests that orthography may promote multiple and multidirectional phonological transfer (cf. Rafat 2011).

#### 3.2.2. Final Obstruent Devoicing

The three languages under investigation differ in the realization of coda obstruents. While English retains a voicing contrast in a syllable-final position, this opposition is neutralized in German and Polish (Gonet 2001; Smith et al. 2009). Although both German and Polish manifest final obstruent devoicing, Polish additionally applies the rule of regressive voicing assimilation (Rubach 1984). Both languages have also been associated with less than a total neutralisation of the underlying voicing contrast, in that small differences in one or more acoustic properties, such as the length of the preceding vowel, have been reported when compared to underlying voiceless counterparts (Slowiaczek and Dinnsen 1985). English, in contrast to German and Polish, typically manifests the marked voiced/voiceless contrast among word final obstruents, even though individual variation has also been reported, as well as the effect of phonological environment on the production of specific word-final obstruents (Gonet 2001; Smith et al. 2009). Finally, English voiced word-final obstruents have primarily been characterized by longer duration of the preceding vowel and not necessarily by glottal pulsing (Krause 1982).

#### *3.3. Research Questions and Hypotheses*

In order to investigate cross-linguistic interactions in multilinguals' speech perception, the following research questions were posed in the study:

1. Is there evidence of CLI in the perception of L2 English and L3 German?

It is hypothesized that cross-linguistic interactions in the two foreign languages may differ and result in variable performance on the measures of perception accuracy and reaction time (RT), depending on the language status (L2 vs. L3) as well as the investigated feature (rhotics vs. final obstruent devoicing). Better performance on both measures and, thus, less CLI is expected for the more established L2 as compared to the newly acquired L3.

**Hypothesis 1 (H1).** *Both phonological feature and language determine perception accuracy and reaction times. There will be less CLI in the learners' L2 English than their L3 German.*

2. Is there a perceptual development over time caused by a change in CLI? Does the perceptual development in L3 parallel that in L2? In this study, cross-linguistic interactions were operationalized as

the respondents' preferences for L1-, L2-, L3-accented stimuli in the performed forced-choice goodness

task. We expect different patterns to hold for the two foreign languages acquired. We expect to observe a change in CLI patterns as a function of the testing time (T1 vs. T2).

**Hypothesis 2 (H2).** *There will be changes in CLI across time. The developmental patterns of CLI di*ff*er between the learners' L2 English and L3 German*.

#### *3.4. Materials and Methods*

The participants performed perceptual tasks in both their L2 English and L3 German, respectively, to test their perception of rhotics and final obstruent (de)voicing. Response accuracy and reaction times were recorded for analyses at two testing times (T1, after 5 months and T2, after 10 months of L3 learning). To create appropriate language modes, the data collection for each of the languages was carried out on two separate days with L1 speakers of the respective languages as instructors.

A forced-choice (FC) goodness task was selected for the present study as an alternative to more traditional perceptual paradigms such as discrimination or identification. Perception discrimination tasks, in which the listener decides whether two stimuli are the same or different, seemed to be of little use as the aim was to test the association of a given variant of a sound with a chosen language in the multilingual's repertoire. Identification tasks in turn are inherently notorious for specifying response alternatives (including difficulties concerning non-transparent orthography), the problem being magnified in the case of three phonological systems in interaction. Moreover, identification tasks are not useful for testing allophonic differences across languages. Overly complex perception tasks needed to be avoided, too: when task complexity increases, perceivers have been found to switch to a primarily phonological level of reasoning (Strange 2009). Therefore, a forced-choice goodness task was selected for the present research, which allowed for elicitation of an association of a given allophone across multiple languages while the complexity of stimulus identification was avoided.

More specifically, the participants in this study heard two renditions of the same phrase differing on the last stimulus items embedded in a carrier phase. By pressing one of two buttons (marked 1 and 2) on a button box, they had to decide which phrase sounds more natural (i.e., more target-like) to them. One rendition was a target realization and the other was an accented language realization, where only the investigated feature was manipulated. For example, for rhotics, in the English version of the task, the stimuli included the target-like phrase "You will hear the word ring /ôiŋ/" followed by the Polish-like realization of the rhotic sound "You will hear the word ring /riŋ/".

For rhotic sounds, this included two trials of pair items as the target item was positioned next to two other possible realizations, while for obstruent (de)voicing, it featured a single trial as the target was presented in opposition to voiced or devoiced/voiceless. The order of presentation of target and non-target stimuli was counterbalanced across trials.

Thus, in the English version, there were stimuli with English target rhotics as well as with Polish and German rhotics. Likewise, in the German version, the stimuli included German target rhotics embedded in a carrier phrase as well as Polish- and English-accented manipulated rhotics in the target words. In case of obstruent (de)voicing, the stimuli in the English version included the target-like phrase "You will hear the word have" /hæv/, followed by a manipulated realization of the final obstruent /hæf/. Similarly, in the German version, the target words embedded in a carrier phrase ("Du hörst das Wort Hand" /hant/) included final obstruents that were either voiceless (thus target like) or voiced (i.e., L2-accented).

The stimuli in each language version involved 10 pair items containing rhotics, 13 to 14 pair items featuring final (de)voicing, and three training pair items that preceded the testing blocks. In total, the FC task, thus, included 26 English and 27 German pair items for the participants to respond to.

The target rhotics occurred either in word-initial or medial position and included:

• For English: *ring, rabbit, red, round, gira*ff*e* (with the manipulated items realized as having an L1-Polish-accented alveolar trill or an L3-German-accented uvular fricative).

• For German: *rot, Regen, Reise, Fahrrad, verloren* (with the manipulated items realized as having an L1-Polish-accented alveolar trill or an L2-English accented post-alveolar approximant).

The final obstruent (de)voicing stimuli were in coda positions and featured as follows:


The stimuli were randomized across trials in E-prime. The inter-stimulus interval was set at 500 ms and the participants had a 3000 ms response limit, thus, the task was timed. The participants' performance on the timed forced-choice goodness task was examined in terms of accuracy and reaction time (RT). The latter was included as a proxy for the perceptual difficulty of the tested stimuli.

The stimuli were recorded by three female native speakers of the respective languages, who were fluent advanced speakers of the other two languages in the triad of languages. The stimuli were produced naturalistically to avoid artificial concatenation. To ensure naturalness, several recordings of the same items were performed and validated by selecting the ones in which the performed accented manipulation sounded the most acceptable to the researchers. The process of stimulus validation was based on the perceptual assessment of each stimulus by native speakers of the respective languages. We adopted a perceptual 'category goodness' criterion, which was deemed to have the best ecological validity given the nature of the FC goodness task administered to the participants.

As far as the three speakers who produced the stimuli are concerned, their stay in a foreign country ranged from a few months to a few years. While we acknowledge the fact that their L1 production could be affected by a highly proficient knowledge of the L2/Ln, it is debatable if the prototypical monolingual rendition should be sought as the target production of the stimuli, in the light of the recent discussions on the native monolingual norm in research on multilingual acquisition (see e.g., Sorace 2020; Kroll 2020). Moreover, monolingual speakers of German, Polish, or English are increasingly impossible to find. Therefore, it was not our goal to search for a native monolingual rendition of the target items, but rather to allow for a potential variation represented by native speakers of particular languages who are multilingual speakers themselves.

#### **4. Results**

Due to violation of the assumption of normality and homogeneity of variance of the present dataset, nonparametric tests were used for between-subjects (Mann–Whitney *U*-test) and within-subjects (Wilcoxon signed-rank test) comparisons. The statistical tests were run using STATISTICA 10. The performed analyses included perceptual development over time, feature comparison, language comparison, individual variability, and CLI analysis, which will be presented in the following subsections.

#### *4.1. Nonparametric Tests of Perception Accuracy and RT*

#### 4.1.1. Perceptual Development over Time: Perception Accuracy at T1 and T2

The performed across-time comparison did not show much development in perception accuracy for the multilingual learners. The only statistical difference between the two testing times in the performance in L2 and L3 for the two features under investigation was found for L3 German rhotics (and not in the expected direction), in which case the perception accuracy was higher at T1 than at T2 (Z = 4.5, *p* < 0.05) (see Table 3).


**Table 3.** Perception accuracy for second language (L2) and third language (L3) at testing time one (T1) and testing time two (T2).

4.1.2. Perceptual Development over Time: RT at T1 and T2

The performed Wilcoxon matched pairs signed rank test for the comparison of reaction times (RT) at two testing times (T1 vs. T2) did not show much development over time either. The only statistically significant result was found for L3 German obstruent devoicing (Z = 2.14, *p* < 0.05), with the processing time being longer at T1 than at T2 (see Table 4).


**Table 4.** Reaction time (RT) for L2/L3 at T1 and T2.

On the whole, the results did not demonstrate much development over time in perception accuracy and processing speed as measured by means of a FC task. It appears, however, that the L2 English is the more established phonological system, while L3 German is more susceptible to changes over the two testing times (i.e., a significant change in the perception accuracy of rhotics and in processing speed for obstruent devoicing). There is no consistency though in the observed developmental changes (the decrease in RT for the perception of obstruent devoicing is as expected, whereas the decrease in accuracy of rhotics perception appears counterintuitive).

#### 4.1.3. Feature Comparison: Perception Accuracy

In the performed feature comparison, the Mann–Whitney *U*-test for perception accuracy demonstrated statistical differences in three out of four conditions: L2 English rhotics were perceived with greater accuracy than obstruent devoicing both at T1 (Z = −6.18, *p* < 0.05) and T2 (Z = −6.51, *p* < 0.05), while for L3 German the same held true at T1 (Z = −5.19, *p* < 0.05) (see Table 5).


**Table 5.** Comparison of perception accuracy of features at T1 and T2.

#### 4.1.4. Feature Comparison: RT

When the two features were compared in terms of reaction time, only one statistical difference was attested for L3 German at T1, when RT were longer for final devoicing than for rhotics (Z = 2.98, *p* < 0.05). Otherwise, the processing speed did not differ across features (see Table 6).


**Table 6.** Comparison of RT to features at T1 and T2.

#### 4.1.5. Language Comparison: Perception Accuracy

To compare the perception performance across languages, a Mann–Whitney *U*-test was performed, which demonstrated statistically significant differences for three out of four conditions, i.e., the perception accuracy was higher for rhotics in L2 English than in L3 German at both T1 (Z = 4.0, *p* < 0.05) and T2 (Z = 7.63, *p* < 0.05), and for obstruent devoicing at T1 (Z = 2.7, *p* < 0.05). A higher proficiency in the more established L2 was reflected in better accuracy performance in perception (see Table 7).


**Table 7.** Perception accuracy comparison between L2 English and L3 German.

#### 4.1.6. Language Comparison: RT

A Mann–Whitney *U*-test for reaction time comparison between L2 English and L3 German demonstrated statistically significant differences for three out of four conditions, i.e., RTs were longer in L3 German than in L2 English for the perception of obstruent devoicing at both T1 and T2 and for the perception of rhotics at T2. On the whole, it took longer to process the perception task in the L3 than in the L2 (see Table 8).


**Table 8.** Reaction time comparison between L2 English and L3 German.

#### 4.1.7. Correlation: Perception Accuracy and RT

No statistically significant correlations were found between perception accuracy and reaction time for L2 English and L3 German performance in the perception of rhotics and final devoicing at either T1 or T2 (see Table 9).

**Table 9.** Correlation between perception accuracy and RT at T1 and T2.


#### *4.2. GLM Modelling*

We fitted our data to a generalized linear model (GLM), with the dependent variable being perception accuracy and independent variables including RT, testing time (T1 and T2) and feature (obstruent devoicing and rhotics). The analysis was performed separately for each language and based on the number of token items rather than participants.

The GLM analysis for L2 English revealed a significant effect of feature on the perceptual accuracy in L2 English [F(1,522) = 92.79, *p* < 0.05)], while the testing time and RT were not significant predictors (see Table 10).


**Table 10.** Results of a linear model for the dependent variable—Accuracy for L2 English.

The Bonferroni pairwise comparisons confirmed that there were statistically significant differences (*p* < 0.001) between perception accuracy for rhotics and obstruent devoicing, with the former feature generating higher accuracy rates (see Table 11).

**Table 11.** Mean Accuracy with respect to Feature for L2 English.


The GLM analysis for L3 German failed to find a significant effect of RT, however, the remaining variables proved to be significant predictors for perceptual accuracy in L3 German, namely testing time [(F(1, 516) = 11.85, *p* = 0.000)], feature [(F(1, 516) = 10.55, *p* = 0.001)], and the Time\*Feature interaction [(F(1, 516) = 18.05, *p* = 0.000)] (see Table 12).

**Table 12.** Results of a linear model for the dependent variable—Accuracy for L3 German.


The Bonferroni pairwise comparisons pointed to a statistically significant difference (*p* = 0.017) between the two testing times in L3 German, with higher perception accuracy observed at T1 (see Table 13).


**Table 13.** Mean Accuracy with respect to Testing Time for L3 German.

Bonferroni correction confirmed a statistically significant difference between perceptual accuracy of the two investigated features (*p* = 0.0008), with rhotics being perceived more accurately than obstruent devoicing in L3 German (Table 14).

**Table 14.** Mean Accuracy with respect to Feature for L3 German.


The Bonferroni pairwise comparisons confirmed that there were statistically significant differences for perceptual accuracy in L3 German between the following variables: (1) obstruent devoicing at T1 and rhotics at T1 (*p* < 0.0001); (2) obstruent devoicing at T2 and rhotics at T1 (*p* < 0.0001); (3) rhotics at T1 and rhotics at T2 (*p* < 0.0001) (see Table 15).

**Table 15.** Mean Accuracy with respect to the Time\*Feature interaction for L3 German.


#### *4.3. Individual Di*ff*erences*

Figures 1–8 show that, in general, more inter- and intraspeaker variability occurs in L3 German than in L2 English, in which individual perceptual performance seems more homogeneous across learners. This is especially true for the perception of the English rhotic where six learners show ceiling performance at both testing times. Pronounced changes in perception accuracy across time are, however, apparent for individual learners. In the case of Subject 20, for instance, their perception of both L2 English obstruent voicing and rhotics drops drastically from T1 to T2 and also shows a drop in perception accuracy in the L3 German rhotic from well above chance to well below it from T1 to T2 (see Figures 1–4, 7 and 8). Subject 12, in turn, performs consistently accurately in their perception of the L2 sounds under examination (Figures 1–4). Their perception of the L3 counterparts drops between the two testing times, most dramatically for rhotics (Figures 5–8). Some increase in L2 English perception of final obstruents (Figures 1 and 2) together with a dramatic improvement of L3 German perception of the same feature (Figures 5 and 6) was evidenced in Subject 6. See Figures 1–8, illustrating perception accuracy of individual subjects in L2 English and L3 German at T1 and T2 for obstruent devoicing and rhotics (with group means marked as horizontal black lines on the graphs).

**Figure 1.** Perception accuracy in L2 English for obstruent devoicing at T1.

**Figure 2.** Perception accuracy in L2 English for rhotics at T1.

**Figure 3.** Perception accuracy in L3 German for obstruent devoicing at T1.

**Figure 4.** Perception accuracy in L3 German for rhotics at T1.

**Figure 5.** Perception accuracy in L2 English for obstruent devoicing at T2.

**Figure 6.** Perception accuracy in L2 English for rhotics at T2.

**Figure 7.** Perception accuracy in L3 German for obstruent devoicing at T2.

**Figure 8.** Perception accuracy in L3 German for rhotics at T2.

#### *4.4. CLI*

In order to explore cross-linguistic mappings in the perception of the multilingual learners of this study, we further explored their perception accuracy (as the dependent variable) with respect to the different stimulus properties of the perception task employed (i.e., L1-accented, L2-accented, L3-accented) as independent variables.

For rhotics, the performed ANOVA (with L2 and L3 treated jointly) demonstrated that there was a statistically significant difference in perception accuracy between these three conditions (F(2;24) = 46.38, *p* < 0.05). The post-hoc Scheffé test for multiple comparisons showed that the differences between all pairs of differently accented stimuli were significant (*p* < 0.05). The accuracy of perceiving the correct rhotic stimuli in L2 English was the highest when the other manipulated stimulus was L3-(German) accented, while it was the least accurate when the unnatural stimulus was L2-(English) accented in L3 German (see Figure 9). Interestingly, however, when we compared the latencies of responses in all these conditions, there were no statistical differences found in RT for rhotics irrespective of the source of accent in the manipulated stimuli.

**Figure 9.** Perception accuracy of rhotics according to stimuli types (L1-accented, L2-accented, L3-accented), with 0.95 confidence intervals as whisker bars.

For final devoicing, given the binary response option as well as difficulty in strictly disentangling L1-based source of CLI in the perception of this feature from arguably the lack of it (L3-target stimuli), the results evidence CLI primarily from L1 and/or L3 in the case of perceiving L2 final obstruent voicing (accuracy levels at chance levels, with acceptance of L1/L3-based and L2-based stimuli to comparable levels). However, in the case of L3 final obstruent devoicing, L1-based CLI prevailed (L1-accented/L3-based stimuli were generally perceived as being more natural than L2-accented stimuli) (t = 4.12, *p* < 0.05).

As far as the reaction time is concerned, none of the independent variables (i.e., feature, stimulus type) entered into the GLM analysis proved to be significant, nor did the interaction between feature and stimulus type (*p* > 0.05). It follows that no statistical differences were found in RT, irrespective of the source of accent in the manipulated stimuli, in the perception of both of the investigated features, although there was a visible trend for the L2-accented stimuli in the perception of L3 obstruent devoicing taking longer to process that the L2-accented stimuli in L3 rhotics.

#### **5. Discussion**

Our results show that the effects of CLI on multilinguals' perception differ across both their two languages and the two features under investigation, thus confirming Hypothesis 1. Overall, perception accuracy is higher in their L2 English than in their L3 German and processing speed is faster, as predicted by Hypothesis 1. Moreover, perception accuracy in the L2 English, which they had been learning for 5–6 years, is more stable across time than for the L3 German, confirming Hypothesis 2. Our results, thus, suggest that CLI is lowest for the perception of the L2 English, especially for rhotics, where most of the investigated learners seem to have established stable perceptual categories. However, we did not test learners' perception in their L2 English after a few weeks of learning the new language German, and, thus, might have missed the short-term effect of influence from the new L3, the 'perceptual confusion' found by Nelson (2020). In fact, one individual learner did show a drop in L2 perception accuracy even after ten months of learning the L3, which might have the same underlying cause.

Our results further show that overall perception accuracy is higher for rhotics than for final obstruent in both languages. Perception of final obstruent devoicing in both the learners' L2 and L3 is at chance level at both testing times, evidencing no improvement for any of the learners, while perception of the rhotics is significantly higher in both languages, with individual speakers reaching ceiling performance. Contrary to the predictions of our Hypothesis 1, this suggests a high level of CLI for the former feature, even in the L2 English, for which learners had been attending school lessons for 5–6 years. One explanation might be the lower perceptual saliency of final obstruent (de)voicing compared to the different articulations of the rhotics in the three languages under investigation. Moreover, the phonological process of obstruent voicing in coda position is characterized by a complex interaction of phonetic cues beyond that of glottal pulsing (Krause 1982). As shown in Broesma (2005), even highly proficient learners of English do not use native-like weighing of cues for the perception of voicedness in an unfamiliar position. Our learners may, thus, have had a hard time to attend to the relevant phonetic cues, longer duration for the preceding vowel in particular, to distinguish between the pairs of tested stimuli.

By the same token, evidence for CLI was found in the learners' perception in their L3 German: their accuracy of perceiving the German rhotic /R/ was higher after 5 months of learning than after 10 months. It appears that some restructuring of perceptual categories is still under way in the first ten months of exposure to a new language, thus echoing findings by Balas et al. (2019). However, again, this restructuring seems to be feature-dependent rather than a general mechanism as these changes were found only for the perception of the rhotics but not for the perception of final obstruent devoicing. Our findings thus appear to partially contradict the predictions of PAM-L2 (Best and Tyler 2007), which would expect a continuous refinement of the learners' speech perception as a function of their extended experience with learning the language. Possibly, this refinement only takes place after more input than our learners had enjoyed in their L3 after 10 months of learning. Not incompatible with this line of reasoning, it might be that the L3 learners in this study had been increasingly exposed to foreign-accented realizations of the German rhotic sound in their classroom environment, whether from their peers or their Polish teacher of German, thus, developing a nontarget representation of naturalness for it. Their own experience with producing the articulatorily challenging sound in the first year of learning German may have also contributed to the process of their category formation for the sound (cf. Bundgaard-Nielsen et al. 2011). As an alternative explanation for the drop in perceptual performance, one could point to a possible decreased attention to the task at the second testing point as compared with the novelty of the first testing time that triggered more focused interest and auditory processing in the participants. This finding would be in line with Nelson (2020) observations concerning the initial 'novelty effect' in perceptual performance of her child and adult L3 learners.

The source of cross-linguistic influence on the perception accuracy of the multilingual learners was found to vary: the accuracy of perceiving the L2 English rhotic /ô/ was higher when it was contrasted with an L3-German accented stimulus than with an L1-Polish accented stimulus in the FC task. This would point towards a stronger influence of the L1 than the L3 in the perception of rhotics in the L2, although recall that overall L2 rhotics were perceived highly accurately by the learners. On the other hand, in L3 German, the accuracy of perceiving the rhotic was lowest when it was contrasted with an L2 English accented stimulus rather than with an L1-Polish stimulus, which leads to the conclusion that the L2 rather than the L1 was a stronger source of CLI for the L3 perception of this feature. This would seem to suggest initially a greater influence of the L2 than the L1 on the perception of L3 rhotics, a finding that was also reported in Wrembel et al. (2019). Indeed, initial L3 learners appear to map new non-native phones to both their L1 and L2, which may be interpreted as aligning with the general reasoning of most L2 speech perception models: non-native phones are perceived in relation to previously established (or currently being established) categories depending on the degree of perceived cross-linguistic similarity between the phones concerned. The way in which such perceived cross-linguistic mappings are to be most effectively elicited in multilingual perceivers presents one of the greatest methodological challenges in future L3 speech research.

Regarding obstruent devoicing, it was not possible to disentangle the sources of CLI for L2 perception (due to the identical nature of L1-accented and L3-based stimuli). However, if we assume the existence of CLI at this stage of L3 learning, L3 perception of the devoicing feature was arguably influenced more significantly by the L1 than the L2, considering the more marked status of obstruent voicing in L2 as well as the similar standing of this feature in the L1 and L3.

Our results further showed that factors other than CLI might influence speech perception. Higher accuracy in the L2 than in the L3 and the fact that L2 is processed faster than the L3 are

viewed as evidence that what also matters in non-native speech perception is experience. Our results corroborate the effects of language learning experience on non-native consonant perception, similarly to some previous studies (e.g., Bradlow et al. 1999; Rose 2010; Anderson 2011), which reported some improvement for more experienced participants or after perception training, but also considerable variation across subjects and the phones tested, as predicted by L2LPM (Escudero and Boersma 2004; Escudero 2005, 2009).

No correlation was found between the learners' perception accuracy and reaction time in the perception of rhotics and final devoicing in either language and at either observation point. This suggests that processing speed is quite independent of the degree of establishment of perceptual categories and may not be the most informative proxy for evaluations of the learnability of different sounds, at least for L3 learning contexts.

As for the role of markedness, in the present study we tested one feature which was more marked in the L2 than in the L1 and L3 (i.e., final obstruent devoicing) and one feature which was more marked in the L3 than in the L2 (i.e., German uvular vs. English postalveolar rhotics). L2 English rhotics were more accurately chosen when contrasted with L3 German stimuli, possibly suggesting a stronger influence of the less marked L1 rhotic than of the most marked L3 rhotic on the L2 perception of a relatively unmarked rhotic variant. Contrastively, in L3 German, the less marked L2 rhotic influenced perception to a greater extent than the more marked L1 rhotic. In final obstruent devoicing, the accuracy was around or below the chance level, and it seems the more marked L2 variant has not been internalized by the learners at all. Therefore, in order to further disentangle the influence of language status from markedness of the tested feature, more studies that would use various combinations of markedness and language status are needed.

#### **6. Conclusions**

The overall results indicate that CLI in perceptual development is feature-dependent with relative stability evidenced for L2 rhotics, reverse trends for L3 rhotics, and no significant development for L2/L3 (de)voicing. We also found that perception accuracy of rhotics differed significantly with respect to stimulus properties, (i.e., whether they were L1-accented, L2-accented, or L3-accented) and that it took longer to process the perception task in the L3 than L2. On the whole, major findings include a nonlinear development of foreign language phonology, diverse CLI patterns that are feature-dependent, and differential learnability of phonetic features. We hope the present findings will be an incentive to extend current theoretical frameworks beyond L2 speech perception models to account for these phenomena in multilingual speech perception.

**Author Contributions:** Conceptualization, M.W., R.K., A.B., and U.G.; methodology, M.W., R.K., A.B. and U.G.; formal analysis, M.W. and R.K.; investigation, M.W., R.K., A.B., and U.G.; data curation, M.W., R.K. and A.B.; writing—original draft preparation, M.W., R.K., A.B., and U.G.; writing—review and editing, M.W., R.K., A.B., and U.G.; visualization, M.W.; project administration, U.G.; funding acquisition, M.W., R.K., A.B., and U.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Polish-German Foundation of Science [project no. 2017-10].

**Acknowledgments:** We wish to acknowledge the assistance of Iga Krzysik and Halina Lewandowska in the data collection process.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Short-Term Sources of Cross-Linguistic Phonetic Influence: Examining the Role of Linguistic Environment**

#### **Daniel J. Olson**

School of Languages and Cultures, Purdue University, West Lafayette, IN 47907, USA; danielolson@purdue.edu Received: 16 August 2020; Accepted: 13 October 2020; Published: 24 October 2020

**Abstract:** While previous research has shown that bilinguals are able to effectively maintain two sets of phonetic norms, these two phonetic systems experience varying degrees of cross-linguistic influence, driven by both long-term (e.g., proficiency, immersion) and short-term (e.g., bilingual language contexts, code-switching, sociolinguistic) factors. This study examines the potential for linguistic environment, or the language norms of the broader community in which an interaction takes place, to serve as a source of short-term cross-linguistic phonetic influence. To investigate the role of linguistic environment, late bilinguals (L1 English—L2 Spanish) produced Spanish utterances in two sessions that differed in their linguistic environments: an English-dominant linguistic environment (Indiana, USA) and a Spanish-dominant linguistic environment (Madrid, Spain). Productions were analyzed at the fine-grained acoustic level, through an acoustic analysis of voice onset time, as well as more holistically through native speaker global accent ratings. Results showed that linguistic environment did not significantly impact either measure of phonetic production, regardless of a speaker's second language proficiency. These results, in conjunction with previous results on long- and short-term sources of phonetic influence, suggest a possible primacy of the immediate context of an interaction, rather than broader community norms, in determining language mode and cross-linguistic influence.

**Keywords:** bilingualism; phonetics; language mode; cross-linguistic influence; transfer; voice onset time; global accent rating

#### **1. Introduction**

Research has shown that bilinguals, including both early bilinguals (e.g., MacLeod and Stoel-Gammon 2005) and late second language learners (e.g., Schmid et al. 2014), can effectively maintain two separate phonetic systems for their two languages. However, these two phonetic systems are not fully independent, and cross-linguistic influence, in which the phonetic system of one language is influenced by the competing language, has been evidenced across a range of bilingual populations and contexts. Importantly, there are a variety of both long-term and short-term sources of cross-linguistic influence (i.e., transfer), impacting both a bilingual's first (L1) and second (L2) languages. Broadly, short-term refers to contexts in which production or perception may be altered for a single speaker in response to immediate or momentary changes in the linguistic situation (e.g., bilingual language mode and code-switching), while long-term refers to sustained influences over longer periods of times (e.g., acquisition and immersion). While some long-term sources of phonetic cross-linguistic influence, such as immersion (e.g., Casillas 2020) and instruction (e.g., Lee et al. 2015), are well-studied, less research has focused on potential short-term (i.e., transient) sources of cross-linguistic influence (Simonet 2014).

Given the previous focus on long-term sources of cross-linguistic phonetic influence, and the emerging research showing the relevance of a number of short-term sources, the current study examines the potential impact of a novel source of such influence: linguistic environment. Linguistic environment is broadly defined as the language norms of the broader community in which an interaction or experimental paradigm is conducted. Bilinguals naturally move from one linguistic environment to another for work, travel, and social interaction, and such shifts in context or environment may serve to foster cross-linguistic influence at the phonetic level. This study adds to our theoretical understanding of the organization of bilingual phonetic systems, and highlights both the sources of and limits on cross-linguistic phonetic influence.

#### **2. Literature Review**

Cross-linguistic influence at the phonetic level can be described as the way in which the phonetic system of one of a bilingual's languages impacts the production and perception of speech sounds in their other language (Jarvis and Pavlenko 2008). Within studies of bilingual phonetics, and as a condition for examining cross-linguistic influence or transfer, early research sought to establish that bilinguals are indeed able to produce and maintain separate phonetic norms in each of their two languages (e.g., Caramazza et al. 1973). While bilinguals produce different phonetic categories for their two languages, the relationship of these categories to the monolingual norms may depend on a variety of factors. For example, some research has shown that bilinguals, particularly early bilinguals, may show little to no deviance from the monolingual targets in each of their languages (e.g., Flege et al. 1999; Guion et al. 2004; Piske et al. 2002), while others have found that late bilinguals (e.g., Flege and Eefting 1987; Flege and Port 1981) and even some early bilinguals (e.g., Flege et al. 1995; Fowler et al. 2008) produce phonetic categories that deviate from those produced by monolingual speakers.

#### *2.1. Long-Term Sources of Cross-Linguistic Phonetic Influence*

In the line of research that has examined cross-linguistic phonetic influence, there has been a significant body of research that has examined the impact of L1 phonetic systems on L2 phonetic categories. Broadly, this line of research has established that the extant L1 system exerts influence over the L2 system, shaping both production and perception of L2 phonetics. Several L2 phonological models provide theoretical accounts for the mechanisms that govern the acquisition of new phonetic categories, and there is broad agreement that the ability to acquire a new L2 category depends on the relationship to existing L1 sounds (e.g., Speech Learning Model (Flege 1987, 1988, 1991, 1995); Native Language Magnet theory (Kuhl 1992, 1993a, 1993b); Perceptual Assimilation Model-L2 (Best and Tyler 2007)). Moreover, as the L2 phonetic system develops as the result of engagement with the L2, either through immersion (for review of study abroad and at-home instruction, see Casillas (2020); for longer-term immigration, see Piske et al. (2001); for study abroad, see Solon and Long (2018))1 or instructed acquisition (for review of L2 phonetic instruction, see Lee et al. (2015)), the cross-linguistic influence of the L1 on the L2 diminishes and the L2 becomes more native-like. As shown in research on study abroad, such effects may be observed following stays of just a few weeks (e.g., Lord 2010), although Solon and Long (2018) note significant individual variation. Following long-term exposure, ultimate phonetic attainment may (e.g., Schmid et al. 2014) or may not (e.g., Flege 1987, 1991) match native speaker norms (for variability in attainment, see Simon (2009)).

Beyond the expected influence of long-term exposure to the L2 on L2 phonetic production, there is also evidence of influence of the L2 on the L1 phonetic system. Several studies have found an inverse relationship between the length of residence in an L2 linguistic environment or L2 proficiency and the degree of phonetic influence of the L2 on the L1 (e.g., Bergmann et al. 2016; Major 1992;

<sup>1</sup> For a discussion on the interaction between length of residence and other variables, notably age of acquisition, see Piske et al. (2001). As Piske et al. (2001) note, length of residence "only provides a rough index of overall L2 experience" (p. 197). While many studies talk about immersion, immigration, and length of residence, these may be used as a proxy for overall L2 experience. For the current study, it is relevant that these factors, whether conceptualized as L2 experience or length of residence, function as long-term sources of linguistic change and interaction.

Stoehr et al. 2017). Speakers that had spent longer in the L2 linguistic environment, or those with greater proficiency in the L2, evidenced greater degrees of L2 phonetic transfer to the L1 (e.g., Major 1992), leading Major (1992) to claim that the L1 phonetic system "is not a fixed and stable system but rather a fluid and changeable one that is highly subject to the influence of a well-developed second system" (Major 1992, p. 204). Other research has shown a degree of bidirectional cross-linguistic influence (e.g., Fowler et al. 2008), in which the L1 influences the L2 and the L2 influences the L1. Again, cases of L2 to L1 transfer have most often been found following long-term engagement with the L2.

Although much of this research has focused on a linear trajectory, such as tracking the shift in L2 production towards monolingual norms over time, it is clear that such long-term shifts are dynamic. In their seminal work, Sancier and Fowler (1997) tracked the phonetic production of a single Portuguese-English bilingual speaker as they moved between multiple linguistic environments. Their results showed that, following a stay of several months in a Portuguese linguistic environment, voice onset times (VOTs) for both Portuguese and English productions became shorter and more Portuguese-like, a phenomenon referred to as "gestural drift" (Sancier and Fowler 1997, p. 422). In contrast, following a stay of several months in an English linguistic environment, VOTs for both languages became longer and more English-like. These results demonstrate that a long-term change in linguistic environment can promote a degree of phonetic interaction, whereby both of the bilingual's languages are impacted by the language of the ambient environment. In short, cross-linguistic phonetic influence, as evidenced by bilingual gestural drift, appears to be dynamic and responds long-term to factors in the ambient linguistic environment.

The existing studies on long-term immersion, study abroad, and gestural drift generally consider changes in L1 and L2 phonetic production following a significant length of stay in the L2 linguistic environment, from weeks or months (e.g., Díaz-Campos 2004; Lord 2010; Nagle et al. 2016) to years (Piske et al. 2001). As noted by Simonet (2014), "The vast majority of work on interlingual or cross-linguistic phonetic influence in bilingualism does not explicitly distinguish between long-term and transient interference. Albeit implicitly, most studies explore features that are attributed to long-term interference" (p. 27). Yet more recent research has shown that the degree of cross-linguistic phonetic influence may also be subject to short-term variables.

#### *2.2. Short-Term Sources of Cross-Linguistic Phonetic Influence*

While the previous research detailed above highlights the notion that phonetic influence between a bilingual's two languages may occur as the result long-term factors, more recent research has begun to examine short-term sources of cross-linguistic phonetic influence. As Simonet (2014) notes, other authors use somewhat different terminology to refer to the same or similar phenomenon (p. 27). Grosjean (2012) differentiates between static and transient sources of cross-linguistic influence, while Paradis (1993) differentiates between competence and performance related cross-linguistic interference. For reasons noted in Simonet (2014), the long-term vs. short-term distinction is used henceforth. Here, *short-term* is used in contrast to long-term, broadly referring to situations or contexts in which phonetic production or perception is altered for a single speaker in response to immediate changes in the broader linguistic situation. These short-term factors may include a shift in the languages used in an interaction, also called the language context (e.g., Olson 2016), changes in external sociolinguistic factors, and even use of cognate tokens.

Several studies have begun to examine the potential of the language of a given interaction or experimental session as a source of short-term cross-linguistic phonetic influence. Simonet (2014), for example, examined the production of Catalan vowels by Spanish-Catalan bilinguals across two different sessions: unilingual and bilingual. In the unilingual session, all stimuli—utterances containing the target Catalan tokens—were drawn from Catalan, as were the session instructions. In contrast, stimuli in the bilingual session included utterances containing the target words from both Spanish and Catalan, in random order. It is important to note that the target tokens were embedded in meaningful utterances, and switches between Catalan and Spanish took place only at the utterance level. As such, the tokens under consideration were all non-switched tokens. Results from the production experiment revealed that Catalan vowels differed between the monolingual and bilingual sessions, with Catalan vowels becoming more Spanish-like in the bilingual session. Similarly, in a cued picture-naming paradigm, Olson (2013) compared the VOT of English and Spanish tokens produced in a monolingual session (i.e., 95% English—5% Spanish or 95% Spanish—5% English) with tokens produced in a bilingual session (i.e., 50% English—50% Spanish). Results showed differences in phonetic production of VOT depending on the nature of the session. Non-switched productions in the bilingual contexts, specifically in a participant's L1, shifted in the direction of the opposite language, with English VOTs becoming more Spanish-like and Spanish VOTs becoming more English-like. A similar effect may be seen in perception, in which the auditory context (i.e., English-like or Spanish-like acoustic features) surrounding a given ambiguous token serves to engage a language-specific perceptual system (Gonzales and Lotto 2013; for perceptual boundaries and language modes, see Casillas and Simonet (2018)). Further research has shown that while bilingual experimental sessions may foster cross-linguistic phonetic influence, the nature of this transfer may be dependent on a speaker's proficiency or dominance (Amengual 2018). Notably, Amengual (2018) found that bilinguals were more likely to experience cross-linguistic transfer resulting from a bilingual paradigm in their less dominant language.

These results from experimental paradigms also find some preliminary support from a more naturalistic paradigm. In her study of bilingual English-Arabic speaking children, Khattab (2002) collected naturalistic data in both English and Arabic-oriented language sessions. The results showed that while children clearly differentiated between the two phonetic systems, particularly with respect to /r/, English tokens produced during the Arabic sessions underwent a degree of phonetic transfer, becoming decidedly more Arabic-like. Subsequent analysis suggests that such transfer may relate to the language dominance of the interlocutor (Khattab 2009), with Arabic-accented English used during interactions with Arabic-dominant listeners. These results suggest that the use of two languages in the same interaction may serve to promote a degree of cross-linguistic influence.

Further evidence for cross-linguistic phonetic influence arising from the use of two languages in a single interaction can be seen in work on the phonetics of code-switching. *Code-switching* refers to the alternation between two or more languages or language varieties in a single discourse (e.g., Myers-Scotton 1997). As such, code-switching represents a clear point in an interaction in which both languages are simultaneously (or nearly simultaneously) activated and serves as a potential short-term source of cross-linguistic phonetic influence. Unlike the bilingual sessions detailed above (e.g., Khattab 2002; Olson 2013; Simonet 2014), which varied the ratio of the language used in a given session or block but examined non-switched productions, research focused on the phonetics of code-switching has generally focused on the potential for phonetic transfer at or near the point of switch. A growing body of research has begun to establish that code-switching impacts phonetic production at the segmental level, most notably inducing a degree of phonetic transfer (although, for a lack of transfer, see Grosjean and Miller (1994) and Muldner et al. (2019)). The exact nature of the phonetic influence found has varied. Several studies have found evidence of unidirectional transfer at the point of switch (Antoniou et al. 2011; Balukas and Koops 2015; Bullock et al. 2006), with Language A shifting in the direction of Language B, but Language B failing to show evidence of transfer. Other studies have found bidirectional transfer (Bullock and Toribio 2009; González-López 2012; Olson 2016; Schwartz et al. 2015), with Language A shifting in the direction of Language B and Language B shifting in the direction of Language A (for an account of unidirectional vs. bidirectional transfer, see Olson (2019)). This cross-linguistic phonetic influence has been found across a variety of paradigms, including naturalistic (e.g., Balukas and Koops 2015) and read speech (e.g., Antoniou et al. 2011), and for different types of code-switches (e.g., for single-word insertions, see Olson (2016); for alternational code-switching, see Bullock and Toribio (2009)). These shifts are largely phonetic in nature, rather than phonological, and bilinguals generally do not implement phonological categories of the opposite language. Thus, code-switching, which activates both of a

bilingual's languages in a compressed timeframe, appears to serve as a short-term source for bilingual phonetic influence.

An additional case for short-term phonetic influence driven by linguistic factors can be seen in the production of cross-linguistic cognates. Cognates, words that have significant cross-linguistic overlap in meaning, phonology, and orthography (Amengual 2018), may result in the activation of both language systems, and as such represent a possible short-term source of cross-linguistic influence. Cognates have been shown to be produced with a degree of phonetic transfer. Amengual (2012), for example, showed that Spanish-English bilinguals produced longer (i.e., more English-like) VOTs in Spanish for cognate words than non-cognate words, a finding that held for heritage speakers (i.e., early bilinguals) and both late L1 English—L2 Spanish and L1 Spanish—L2 English bilinguals (for Spanish-Catalan bilinguals, see Amengual (2016)). Again, as cognates may activate both languages, they can be seen as a short-term source of cross-linguistic phonetic influence.

Changes in the language of the paradigm or interaction (e.g., Simonet 2014), code-switching (e.g., Olson 2016), and cognate status represent cases in which a short-term or immediate shift in the linguistic content of an interaction favors a degree of cross-linguistic phonetic influence. Yet, there is some evidence that non-linguistic changes in the external environment may also impact linguistic behavior. Hay and Drager (2010), for example, found that including region-specific objects in the experimental environment, such as a stuffed kangaroo (i.e., Australia-specific) or kiwi (New Zealand-specific), impacted vowel perception. The authors suggest that objects in the "ambient environment" can impact participant phonetic perception (p. 889). Other studies have found that visually salient characteristics of a speaker may influence phonetic perception (e.g., for intelligibility and accentedness, see Babel and Russell (2015)). For example, in a perceptual experiment, Koops et al. (2008) found that listeners' phonetic perception of stimuli reflecting an on-going, age-graded phonetic change (i.e., PIN-PEN unmerger in Houston, TX, USA) depended on perceived speaker age. Paralleling the change in progress in the local community, listeners were more likely to assume merged phonetic categories for older speakers and unmerged categories for younger speakers. The impact of such external factors on phonetic perception also extends to non-visible social information, such as supposed geographic origin (e.g., Niedzielski 1999). In these cases, it is not the linguistic content of an interaction or paradigm that shifts, but rather the surrounding environment and/or perceived interlocutor.

It is worth considering these short-term sources of cross-linguistic influence within the framework of a bilingual's language modes (e.g., Grosjean 1998, 2001, 2008). Bilinguals have the ability to operate along a linguistic continuum from operation entirely in Language A (i.e., monolingual mode) to operation in Language B (i.e., monolingual mode), including a variety of bilingual modes in which each of the two languages may be used to differing degrees. Language mode has been described in terms of the relative activation of each of the bilinguals two languages, with monolingual mode involving the activation of only (or predominantly) one language and a balanced bilingual mode involving the roughly equal activation of the two languages (Grosjean 2008). Grosjean (2008) notes that a speaker's language mode may be impacted by a variety of factors, including the "form and content of the message," the language act, the interlocutors, and the general situation of the interaction. All of these factors may be considered "short-term," in that they are subject to change from day-to-day or even interaction-to-interaction for a single bilingual speaker. Moreover, shifts in a speaker's (or listener's) position on the language mode continuum may impact their language production (or perception) patterns (e.g., Soares and Grosjean 1984). The previous findings of differing levels of cross-linguistic phonetic influence presented above can be reconceptualized as resulting from differing language modes. Short-term sources of cross-linguistic influence can be seen as variables that cause an immediate shift in a bilingual's position along the language mode continuum. Thus, short-term variables, including the language(s) of a given interaction or paradigm, the use of code-switching, and changes in the surrounding environment, may effectively serve to manipulate the relative activation (or suppression) of a bilingual's two languages, with more equal activation resulting in greater degrees of cross-linguistic phonetic influence.

#### *2.3. Research Questions*

Previous research has established that bilinguals, including late L2 learners, establish different phonetic norms for their two languages. Moreover, there appear to be both long-term (e.g., acquisition and immersion) and short-term (e.g., bilingual mode, code-switching) sources of phonetic cross-linguistic interaction. While long-term immersion in a given linguistic environment has been shown to impact the degree of cross-linguistic influence, the current study examines the potential for linguistic environment to act as a short-term source of cross-linguistic phonetic influence. That is, whether a change in linguistic environment results in an immediate shift in the degree of cross-linguistic phonetic influence. Two specific research questions are addressed:

RQ1: Does a short-term change in the linguistic environment impact phonetic production? For this study, a short-term change in linguistic environment is operationalized as a single speaker moving from an English-dominant linguistic environment to a Spanish-dominant linguistic environment (or vice-versa).

**Hypothesis 1.** *Drawing on previous research that has shown an impact of both long-term and short-term sources of cross-linguistic phonetic influence, it was anticipated that a shift in linguistic environment would result in a corresponding shift in phonetic production. Specifically, L1 English—L2 Spanish speakers would produce Spanish tokens with more English-like phonetic features in an English-dominant linguistic environment and more Spanish-like phonetic features in a Spanish-dominant environment.*

RQ2: Does proficiency in the L2 interact with linguistic environment?

**Hypothesis 2.** *Given the previous finding that as L2 experience and proficiency increase, L2 phonetic production shows less evidence of L1 to L2 cross-linguistic influence, it was anticipated that speakers with greater proficiency in the L2 may show smaller e*ff*ects of a change in linguistic environment on their phonetic production.*

To assess the role of linguistic environment on phonetic production, the same participants were tested in both an English-dominant (Indiana, USA) and Spanish-dominant environment (Madrid, Spain). To focus squarely on the potential for linguistic environment to serve as a short-term source of cross-linguistic influence and limit the long-term effects of acquisition and immersion, participants were tested immediately prior to leaving one environment (i.e., less than 72 h pre-departure) and immediately upon arrival in the second environment (i.e., in the first 72 h in the new environment). The order of sessions was counterbalanced, such that one group was tested first in the English-dominant linguistic environment and then in the Spanish-dominant linguistic environment, and the other group received the opposite session order (i.e., Spanish-dominant environment then English-dominant linguistic environment).

Two different levels of phonetic analysis were conducted: an acoustic analysis of the voice onset time cue (VOT) and a global accent rating (GAR). These two measures were chosen to provide both a fine-grained measure of a relevant segment that differs cross-linguistically between English and Spanish (i.e., VOT), as well as a more global measure of perceived accent to capture potential shifts in other features (e.g., vowels, suprasegmental features, etc.) of production (i.e., GAR).

#### **3. Methodology**

#### *3.1. Participants*

Twenty English-speaking (L1) learners of Spanish (L2) participated in the current study. All participants were enrolled in a six-week immersive study abroad program in Madrid, Spain (Spanish-dominant environment) during the summers of either 2015 or 2018. The study abroad program included host family stays, classes at the local university, and several day trips to surrounding cultural sites, all of which were conducted in the L2. Participants were all learners at the intermediate to advanced level, enrolled in 3rd or 4th year university language courses. All participants gave informed

consent prior to beginning the task and the protocol was approved by the Institutional Review Board at Purdue University (Protocol #: 1303013396).

Immediately prior to the main task, participants completed the Bilingual Language Profile (BLP; Birdsong et al. 2012). The BLP relies on self-assessments of a participant's language history, language use, language proficiency, and language attitudes in each of the relevant languages to provide a composite language score. All participants are native speakers of English, having learned English from birth (*M* age of acquisition = 0.0, *SD* = 0.0) and Spanish after the age of 5 (*M* = 12.1, *SD* = 3.0). Across all components, participants self-rated English as higher than Spanish (Table 1).


**Table 1.** Unweighted results of the Bilingual Language Profile (BLP) subcomponents.

<sup>a</sup> For each scale, higher ratings correspond to a more engagement with that component of a given language.

Following the BLP scoring procedures (for details, see Birdsong et al. (2012)), a composite language score was computed for each participant in each language, giving equal weight to each subcomponent. The possible composite language score, henceforth referred to as proficiency, ranges from 0, corresponding to no proficiency in the language, to 218, indicating high proficiency in the language. As expected, participants reported high language scores for English (*M* = 204.6, *SD* = 7.9). In contrast, participants reported lower and more varied Spanish language scores (*M* = 77.5, *SD* = 18.4). Figure 1 illustrates the distribution of participants with respect to their Spanish language score. All participants are considered to be English-dominant. All participants reported normal speech and hearing.

**Figure 1.** BLP Spanish language score by participant.

To assess the effect of linguistic environment, but counter-balance session order, participants (*N* = 20) were divided into two mutually exclusive groups. The first group (*n* = 12) was tested first in the English-dominant environment and then in the Spanish-dominant environment. The second group (*n* = 8) was tested first in the Spanish-dominant environment and then in the English-dominant environment. Again, to limit any possible long-term sources of cross-linguistic influence, participants were tested immediately prior to departure from Environment A (< 72 h) and immediately upon arrival in Environment B.<sup>2</sup>

<sup>2</sup> In short, one group of participants was tested en route to the host country of the study abroad program (i.e., prior to leaving Indiana, USA, and upon arrival in Madrid, Spain) and one group of participants was tested en route to the home country (i.e., prior to leaving Madrid, Spain, and upon arrival in Indiana, USA). While this presents a potential confound, such that one group may have had a six-week "advantage" by participating following six weeks of immersion in the host country, statistical analysis controlled for between-participant differences with the random effects structure, effectively comparing each participant to her or himself.

#### *3.2. Stimuli*

Stimuli for the read-aloud task were modified Spanish versions of utterances (*N* = 5) used in a global accent rating task (e.g., Flege 1988; Riney and Flege 1998). Relevant for the VOT analysis, embedded within these utterances were a number of tokens with word-initial voiceless stops. English and Spanish both employ a bi-partite distinction between voiceless and voiced stops in word initial position. VOT, defined as the temporal difference between the release of the oral closure and the onset of vocal fold vibration, has been shown to be a reliable cue to this voicing distinction (e.g., Lisker and Abramson 1964). Although both English and Spanish make use of this phonological distinction between voiceless and voiced phonemes, the phonetic cues differ. Specifically for voiceless stop consonants, English stops are produced with long-lag VOT (30–100 ms), while Spanish is produced with short-lag VOT (0–30 ms). Given this cross-linguistic difference, English-speaking learners of Spanish are tasked with acquiring and maintaining separate the Spanish-like short-lag VOT norms. A number of authors have noted that English-speaking learners of Spanish may produce Spanish voiceless stop consonants with English-like VOTs (e.g., Hammond 2001). Figure 2 shows the spectrogram and waveform for the word <*calle*> [kaݯe] 'street,' produced by a native English speaker (left) and native Spanish speaker (right). While the use of English-like VOT in Spanish voiceless stops is unlikely to cause issues of intelligibility (Lord 2005; Munro and Derwing 1995), it may impact the perception of speaker accentedness. Given both the gradient nature of VOT, and the previous evidence that bilinguals, including L2 learners, are able to effectively distinguish between English and Spanish VOT, VOT may serve as a sensitive measure of cross-linguistic influence between the two phonetic systems. A total of eight words contained word-initial voiceless stops, with the following distribution: /p/ = 2, /t/ = 2, /k/ = 4.

**Figure 2.** Spectrogram and waveform of <*calle*> 'street' produced by a native English speaker (**left**) and a native Spanish speaker (**right**). The difference in VOT is visible in the phoneme /k/.

Relevant for the GAR task, the five utterances contained a wide variety of segments (i.e., consonants and vowels). Several segments were included that differ significantly in English and Spanish phonetic production (e.g., word-initial <r>, which is produced as // in English and /r/ in Spanish), which could potentially serve as markers of English-accented Spanish. In addition, several segments were included that differ across different dialects of Spanish (e.g., <z>, <ci>, and <ce>, are produced as /s/ in most dialects of Spanish, but as /θ/ in Peninsular Spanish), which could potentially serve as markers of the local (Peninsular) dialect of Spanish.

Example (1) provides several sample stimuli. Target phonemes for the VOT analysis are underlined.


#### *3.3. Procedure*

To assess the potential impact of linguistic environment, each speaker participated in two experimental sessions, each conducted in a different linguistic environment. One experimental session was conducted in an English-dominant linguistic environment (Indiana, USA), and one session was conducted in a Spanish-dominant linguistic environment (Madrid, Spain). In Indiana, USA, English is the home language of approximately 91.1% of the population, with Spanish spoken in the home by only 4.7% of the population (U.S. Census Bureau 2018). In Spain, 90% of the population speaks Spanish as a native language, while only 2.2% of the population speaks English as a native language (Instituto Nacional de Estadística 2019), although statistics were not available for the Community of Madrid. Moreover, both Indiana, USA, and Madrid, Spain, are considered to be "single language" environments in which little code-switching is present (Green and Abutalebi 2013). While the experience of each individual participant may vary, a finding captured by the language background questionnaire, the two environments are clearly distinguishable by the language of the broader environment.

Interaction with the experimenter was intentionally conducted in both languages in each session. The experimenter was a native speaker of midwestern American English who had spent several years living in Madrid, Spain and was proficient in the local Spanish dialect. Written instructions, provided before the start of the oral production task, were comprised of both English and Spanish. With the exception of the location of the two sessions, other experimental factors were maintained as equal as possible, using identical equipment, instructions, and consent forms and conducted with the same experimenter.

Following the collection of language background information, target utterances were presented visually using SuperLab v.5 (Cedrus Corporation 2015) and each utterance was repeated three times during each session. Utterances were recorded in a quiet room with a head-mounted microphone using Audacity v. 2.2.2 recording software. Both the instructions and the recording equipment were the same in each of the two different environments.

#### *3.4. Voice Onset Time Analysis*

A total of 960 tokens were considered in the initial VOT analysis (20 speakers × 8 tokens × 3 repetitions × 2 session = 960 tokens). Twenty-five tokens were classified as missing, and an additional 29 tokens were eliminated for a variety of speech errors (i.e., false start on target word) and recording errors (i.e., noisy recording). Lastly, outliers were eliminated (*n* = 5), defined as those tokens with VOT values greater than 3SD above and below the mean. A total of 901 tokens were included in the final VOT analysis.

VOT was defined as the temporal difference between the release of the oral closure and the onset of vibration of the vocal folds (e.g., Lisker and Abramson 1964). Tokens were measured using Praat (Boersma and Weenink 2018), with particular attention to the waveform. Tokens were coded blindly by a trained research assistant who was unaware of the linguistic environment in which the utterance was produced.

Statistical analysis was conducted using R statistical software (R Core Team 2013) and the lme4 package (Bates et al. 2015). Following recommendations by Barr et al. (2013), the maximal random effects structures that permitted model convergence were used. The significance criterion was set at |t| > 2.00. Power analysis was conducted with the simr package (Green and MacLeod 2016).

#### *3.5. Global Accent Rating Analysis*

Five native Spanish speakers from the target region (Madrid, Spain) were recruited as raters for the GAR task. All raters were native speakers of the target dialect, and using the BLP (Birdsong et al. 2012), were considered highly dominant in Spanish. The rating procedure was largely based on work by Riney and Flege (1998).

Two of the three repetitions per utterance were selected from each session for presentation to the native speakers. When possible, preference was given to the second and third repetitions of the stimuli. For productions containing speech errors, such as pauses or fillers, the first repetitions was substituted. Raters could listen to each presentation multiple times, if needed. The intensity of each utterance was scaled via script in Praat (Boersma and Weenink 2018) to 65 dB. The order of presentation was fully randomized. A total of 400 learner-produced utterances were selected for presentation to the native raters.

Raters were asked to provide accent ratings for each utterance on a 9-point Likert scale in which 1 corresponded to a "very strong non-native accent" and 9 a "native accent." Each accent rating was converted to a z-score on a by-rater basis to normalize for the different ranges of values used by each rater. A total of 2000 ratings were provided by native speakers (5 utterances × 2 repetitions × 2 sessions × 20 participants × 5 raters).

Statistical analysis was again conducted using R statistical software (R Core Team 2013) and the lme4 (Bates et al. 2015) and simr (Green and MacLeod 2016) packages. The maximal random effects structures that permitted model convergence were used (Barr et al. 2013).

#### **4. Results**

#### *4.1. Voice Onset Time*

An initial mixed effects model was conducted with VOT (ms) as the dependent variable and linguistic environment (English-dominant environment vs. Spanish-dominant environment) as the fixed effect. Random effects included participant and item (i.e., word), with random slopes and intercepts. Examination of a Q-Q plot confirmed that the residuals of the model were normally distributed. Contrary to the initial hypotheses, results from this initial model demonstrated no significant effect of linguistic environment on VOT, with similar VOTs produced in the English- (*M* = 44.2 ms, *SD* = 19.4 ms) and Spanish-dominant (*M* = 43.6 ms, *SD* = 18.8 ms) environments. The results for the fixed effects are available in Table 2 (for random effects and model equation, see Appendix A). Figure 3 shows the VOTs produced in each linguistic environment, separated by initial phoneme. Again, while we expect some differences in VOT across place of articulation (Cho and Ladefoged 1999), the key comparison is between VOT in the English and Spanish environments.


**Table 2.** Voice Onset Time (VOT) Model Fixed Effects.

**Figure 3.** VOT (ms) by linguistic environment and place of articulation.

To ensure that the lack of a significant effect of linguistic environment on VOT production was not the result of an underpowered study, a power analysis was conducted using the simr package (Green and MacLeod 2016). Results of a simulation-based power analysis, with a medium effect size (*d* = 0.5) and based on 500 simulations, showed that the current study design surpassed the 80% power threshold (power for predictor linguistic environment = 99.8%, CI = [98.9, 100]). The outcome of the power analysis suggests that the lack of a significant effect of linguistic environment is not likely to be the result of an underpowered study design.

Related to the second research question, namely whether the effect of linguistic environment is conditioned by a given participant's proficiency in the target language, a second mixed effects model was conducted with the dependent variable of VOT. Fixed effects included linguistic environment (English-dominant environment vs. Spanish-dominant environment), proficiency, and their interaction. Proficiency was included as a continuous variable, with proficiency values determined by each participant's overall BLP language score for Spanish (Birdsong et al. 2012).3 Random effects included participant and item, with random intercepts and random slopes by linguistic environment. More complex random effects structures, specifically random slopes by each of the two fixed effects, did not permit model convergence. Examination of a Q-Q plot confirmed that the residuals of the model were normally distributed. Results of this second model showed (see Table 3) that there was no significant effect of either linguistic environment or proficiency on VOT production. Moreover, there was no significant interaction between these two fixed effects, suggesting that linguistic environment was not a factor, regardless of a given participant's level of proficiency (for random effects and model equation, see Appendix B). Figure 4 illustrates these results. Again, lower VOT corresponds to more native-like pronunciation. Note that, while proficiency was included as a linear predictor in the model, Figure 4 grouped participants by relative proficiency for the purposes of visualization. Worth noting, the general expected trend is visible in Figure 4, with participants with higher proficiency in Spanish showing lower, more Spanish-like VOTs. However, it is clear that there is no effect of linguistic environment.<sup>4</sup>



<sup>3</sup> Following suggestions by an anonymous reviewer, subsequent analysis was conducted with proficiency as a three-way categorical variable. Two different group cut-offs were considered. First, parallel to the subgroups in Figure 4, three approximately equal-sized proficiency groups were considered: low (n = 6), mid (n = 7), and high proficiency (n = 7). Second, three proficiency groups were identified using visual analysis of participant BLP Spanish score distributions: low (n = 5), mid (n = 12), and high proficiency (n = 3). Model comparison, following the procedure outlined below, showed that neither categorical approach to proficiency significantly improved model fit for either the VOT (equal sized groups: χ2(4) = 1.380, *p* = 0.848; unequal sized groups: χ2(4) = 7.427, *p* = 0.115) or the GAR analysis (equal sized groups: χ2(4) = 3.517, *p* = 0.475; unequal sized groups: χ2(4) = 5.690, *p* = 0.224) relative to a model without proficiency. As such, proficiency, regardless of operationalization, does not appear to significantly influence VOT or interaction with linguistic environment for this group of learners.

<sup>4</sup> As the main goal of this project was to examine the effect of linguistic environment, the main analysis compares the productions of participants in two different linguistic environments. The two groups (i.e., US-Spain and Spain-US) served to counter-balance session order. As such, some group differences are possible as the Spain-US group was tested following six weeks in the Spanish linguistic environment. Addressing the possible effect of group, a subsequent model was conducted with linguistic environment and group as fixed effects,. Model comparison showed that the inclusion of group significantly improved model fit (χ2(2) = 7.395, *p* = 0.025). This is not unexpected, given that overall, the Spain-US group (*M* = 39.6 ms, *SD* = 18.3 ms) produced significantly shorter VOTs than the US-Spain group (*M* = 46.8 ms, *SD* = 19.1 ms), *t*(787) = 5.693, *p* < 0.001. To confirm that the impact of linguistic environment was similar for each group, separate models were conducted for each group. The model structure was parallel to the main model above. Results suggested that linguistic environment did not significantly impact VOT for either the US-Spain group (*b* = −2.616, *t* = −1.136) or the Spain-US group (*b* = 4.799, *t* = 1.923).

**Figure 4.** VOT (ms) by linguistic environment and proficiency. For the purposes of illustration only, participants were grouped into low (*n* = 6), mid (*n* = 7), and high proficiency (*n* = 7) groups by the continuous Spanish language component score of the BLP.

Finally, for completeness, a model comparison was done to assess the contribution of each of the variables of interest. Model comparison was conducted by comparing the model involving the two fixed effects (i.e., linguistic environment and proficiency), with the random effects structure previously detailed above, to submodels created by dropping one of the fixed effects but maintaining a similar random effects structure. Results of the model comparison show that there was no significant difference between the most complex model (log likelihood = −3706.4) and the submodel without the fixed effect of linguistic environment (log likelihood = <sup>−</sup>3706.4, <sup>χ</sup>2(2) = 0.041, p = 0.979). Similarly, there was no difference between the most complex model and the submodel without the fixed effect of proficiency (log likelihood <sup>=</sup> <sup>−</sup>3707.2, <sup>χ</sup>2(2) <sup>=</sup> 1.700, *<sup>p</sup>* <sup>=</sup> 0.428). As such, neither fixed effect contributed significantly to improving the model fit.

Taken as a whole, the results of the VOT analysis suggest that linguistic environment does not significantly impact the production of VOT. Moreover, for this group of learners, proficiency does not seem to be a relevant factor in the production of VOT.

#### *4.2. Global Accent Rating*

While the VOT analysis focuses on a specific segment, limited to only the voiceless stop consonants, the GAR analysis provides a more holistic metric of participant phonetic production. That is, while linguistic environment may not play a role in the production specifically of VOT, it is possible that other phonetic components, relevant to and noticeable by native speakers, may be modulated by environment.

Again, an initial mixed effects model was conducted with z-scored accent ratings as the dependent variable and linguistic environment as the fixed effect. Random effects included participant and item (i.e., utterance), with random slopes and intercepts. More complex random effects structures, particularly including rater as a random effect, did not permit model convergence. Again, each rating from the 9-point Likert scale was converted into a z-score on a by-rater basis. A visual analysis of the Q-Q plot confirmed that the residuals of the model were normally distributed. Results from this initial model closely parallel the results from the VOT analysis. Specifically, there was no significant impact of linguistic environment on accent ratings, with accent ratings for utterances produced in the English-dominant linguistic environment (*M* = -0.027, *SD* = 1.004) similar to those produced in the Spanish-dominant linguistic environment (*M* = 0.038, *SD* = 1.001). Full fixed effects results are seen in Table 4 (for random effects and model equation, see Appendix C).5 Figure 5 illustrates the

<sup>5</sup> Parallel to the by-group analysis for VOT, mixed effect model was conducted on GAR with linguistic environment and group as fixed effects. Model comparison showed that the inclusion of group did not significantly improve model fit (χ2(2) = 5.143,

global accent ratings by linguistic environment. Again, a higher accent rating corresponds to a more native-like production.


**Table 4.** Global Accent Rating Model Fixed Effects.

**Figure 5.** Global accent rating (z-scored) by linguistic environment.

To confirm that the lack of effect of linguistic environment on GAR was not due to an underpowered study, a power analysis was conducted using a simulation based-approach with the simr package (Green and MacLeod 2016). Results, based on 500 simulations with a medium effect size (*d* = 0.5), showed that the experiment exceeded the 80% threshold (power for predictor linguistic environment = 99.2%, CI = [97.9, 99.8]).

Considering the role of proficiency, a second model was conducted with linguistic environment and proficiency, as well as their interactions, as fixed effects. Participant and utterance were included as random effects, with random intercepts and slopes by linguistic environment, which was the maximal effects structure that permitted model convergence. Examination of a Q-Q plot confirmed that the residuals of the model were normally distributed. Results from the model with proficiency (for fixed effects, see Table 5; for random effects and model equation, see Appendix D) showed no significant effect of either linguistic environment or proficiency, and no significant interaction between the two fixed effects. Worth noting, the effect of proficiency trended in the expected direction, with participants who had higher self-rated Spanish language skills being rated as having more native-like accents.



Finally, a model comparison was conducted to examine the contribution of each of these fixed effects. The most complex model, with linguistic environment and proficiency as fixed effects, was compared to submodels created by dropping one of the fixed effects but maintaining the same random effects structure. Results from the model comparison showed that there was no significant

*p* = 0.076). As with VOT, results demonstrated that linguistic environment did not significantly impact the GAR for either the US-Spain group (*b* = 0.13, *t* = 1.169) or the Spain-US group (*b* = −0.03, *t* = −0.572).

difference between the complex model (log likelihood = −2407.7) and either the submodel without linguistic environment (log likelihood = <sup>−</sup>2408.0, <sup>χ</sup>2(2) = 0.7496, *p* = 0.687) or the submodel without proficiency (log likelihood <sup>=</sup> <sup>−</sup>2409.3.0, <sup>χ</sup>2(2) <sup>=</sup> 3.315, *<sup>p</sup>* <sup>=</sup> 0.191).

As with the analysis for VOT, the analysis from the GAR data suggests that linguistic environment did not impact native speaker ratings of learner productions. Moreover, while proficiency trended in the expected direction, it did not significantly interact with linguistic environment.

#### **5. Discussion**

The findings of this study add to the discussion on short-term sources of cross-linguistic influence and interaction. With respect to the first research question, and contrary to the initial hypothesis, the results showed that there was no significant effect of linguistic environment on cross-linguistic phonetic influence. The language of the broader community where an interaction took place had no relation to a bilingual's phonetic production. This result was found at both the fine-grained phonetic level, through an analysis of the VOT associated with Spanish voiceless stop consonants, as well as the more global level, as shown by the native speaker global accent ratings. Both VOT and perceived accent did not differ based on the linguistic environment of the session. Power analyses suggested that this lack of significant results is unlikely to be attributed to an underpowered study.

With respect to the second research question, namely the potential role of proficiency in modulating cross-linguistic phonetic influence, the results showed no significant role of proficiency for this particular group of speakers, and importantly, no interaction with the variable of linguistic environment. For all participants, regardless of proficiency, linguistic environment did not play a significant role in determining the degree of cross-linguistic influence or transfer. Again, this finding is contrary to the initial hypotheses.

The first hypothesis, specifically that linguistic environment would impact phonetic production, was driven by a robust body of research that has shown that bilingual phonetic production, and the degree of cross-linguistic phonetic interaction, is impacted by both long-term and short-term factors. Directly related to this hypothesis, we have seen that long-term immersion, either through immigration (Piske et al. 2001), travel (Sancier and Fowler 1997), or study abroad (Casillas 2020; Lord 2010; Solon and Long 2018), impacts phonetic production. Broadly, this research has shown that, over time, both a speaker's L2 and L1 (Bergmann et al. 2016; Major 1992; Stoehr et al. 2017) shift in the direction of the language of the broader community. Considering short-term sources of cross-linguistic phonetic influence, previous work has highlighted several short-term sources, including the language of a given interaction (e.g., Amengual 2018; Olson 2013; Simonet 2014), the use of code-switching (Antoniou et al. 2011; Balukas and Koops 2015; Bullock et al. 2006; Olson 2016), the presence of salient region-specific extra-linguistic cues in the interactional environment (Hay and Drager 2010), and visible (e.g., Babel and Russell 2015; Koops et al. 2008) and non-visible (Niedzielski 1999) social information about an interlocutor. Considering this line of research, it was anticipated that linguistic environment would impact production in the short-term, with phonetic targets shifting in the direction of the broader linguistic environment. Namely, it was anticipated that tokens produced in Madrid, Spain, would become more Spanish-like and tokens produced in Indiana, USA, would become more English-like. Given the lack of support for this hypothesis, it is worth considering several possible explanations.

One possible explanation for the lack of a short-term impact of linguistic environment is that the immediate local context of an interaction may be more relevant than the broader environment in which an interaction takes place. In much of the previous research on short-term sources of cross-linguistic phonetic influence, the source of the influence is present either in the interaction itself, either real (i.e., the language(s) required by the paradigm (e.g., Simonet 2014)) or imagined (i.e., visible or non-visible sociolinguistic cues (e.g., Babel and Russell 2015)), or is present in the physical environment that immediately surrounds participants (i.e., region-specific cues in the experimental setting (Hay and Drager 2010)). In each of these cases, the source of the short-term phonetic influence is in the speaker's immediate context. As such, the findings of the current study suggest a possible primacy of this immediate context relative to the broader linguistic environment. In short, the immediate context of the interaction is more relevant for phonetic production and perception than the broader linguistic environment, and short-term sources of cross-linguistic phonetic influence are local, rather than global. In the current study, the immediate context was maintained as similar as possible across the two experimental sessions. The same experimenter greeted participants, and the language of both the interaction with the experimenter (i.e., bilingual/code-switched interaction) and the written instructions were the same in both sessions. If there exists a primacy of the immediate context for short-term sources of cross-linguistic phonetic influence, these local characteristics of the interaction may have been more relevant than the linguistic environment of the broader community.

A second possible explanation for the lack of an impact of linguistic environment on phonetic production in the current study is related to the study population. While previous research has shown that L2-learners' phonetic productions are impacted by both long-term and short-term sources of cross-linguistic phonetic influence, the participants in the current study have relatively low proficiency in the L2. Directly related to their phonetic systems, the mean VOT values produced by these participants (for all productions *M* = 43.9 ms, *SD* = 19.1 ms) remain well outside the norms for native speakers (e.g., Lisker and Abramson 1964) and both early and late bilinguals (Amengual 2012). As such, it is possible that these participants are not sufficiently proficient in the L2 to respond to short-term sources of cross-linguistic influence at the phonetic level. The possible role of proficiency as a mitigating factor is echoed in previous work on L2 development during long-term engagement with a given language (e.g., study abroad), in which lower-proficiency speakers evidence less change in the L2 than higher-proficiency speakers during the immersion experience (for discussion of a threshold hypothesis in study abroad, see Lafford and Collentine (2006)). As such, a speaker's proficiency level may serve to modulate the impacts of short-term sources and effectively limit the role of linguistic environment in the current population. Moreover, the population for this study was fairly homogenous at the phonetic level, as illustrated by the minimal differences in VOT between the highest and lowest proficiency groups (see Figure 4; mean difference = 6.1 ms). This degree of homogeneity may further explain the lack of an impact of proficiency and the failure to support the second hypothesis.

Finally, it is worth returning to the language mode framework to provide an account for the role of both long-term and short-term sources of cross-linguistic influence. Grosjean (2001) provides a variety of factors that influence language mode: the interlocutors, the situation and physical location, the function of the language act, the type of stimuli and task, etc. It is worth noting that all of these factors can be considered as short-term sources of cross-linguistic influence and are subject to change within and between different interactions. Grosjean (2001) notes that language mode "concerns the level of activation of two languages" (p. 42), and the short-term variables may be the primary drivers of shifts in language mode. In contrast, the long-term sources of linguistic interaction, including acquisition, changes in proficiency, and immersion, are not listed as factors that impact language mode. As language mode has been described as a continuum, from monolingual operation in Language A to monolingual operation in Language B, short-term factors may serve to adjust a participant's position along their existing continuum. Long-term factors, such as proficiency, may serve to manipulate the nature of the endpoints of this continuum. Additional support for this interpretation comes from work in bilingual lexical access, notably from picture-naming tasks. This line of work has shown that short-term factors, such as the ratio of one language to another, impacts lexical access in both production (e.g., Gollan and Ferreira 2009; Olson 2015) and perception (e.g., Olson 2017), but that these effects are modulated by more long-term oriented factors like proficiency and language dominance (e.g., Olson 2015; Schwieter and Sunderman 2008). In terms of activation, long-term factors may ultimately manipulate a given language's baseline activation level or range of possible activation, while short-term factors manipulate the comparative level of activations of the two languages within their possible ranges.

#### *Future Directions*

Future research should continue to systematically examine the differential impacts of both longand short-term sources of cross-linguistic phonetic influence. Building directly on the current study, and particularly in light of the failure of the results to support the initial hypotheses, future research may seek to expand upon the current study with a more heterogeneous population. Of particular interest may be to examine participants across a wide range of proficiencies, from learners to highly proficient early bilinguals, as they move through different linguistic environments. Moreover, developing a better understanding of other individual factors, such as participant's engagement with the local context (i.e., whether they maintain significant use of the L1 or immediately begin to engage in the L2), may serve to further our understanding of variability in cross-linguistic phonetic interaction. Second, it is acknowledged that the data set for the acoustic analysis is limited, both in terms of the number of tokens and the variety of features examined. Future research may seek to replicate these findings with a larger data set and a variety of phonetic features. For example, a more robust analysis across different places of articulation, precluded here by the size of the data set, may also be of interest. Notably, there appears to be some slight advantage for /t/ relative to the other places of articulation (see Figure 3), which would suggest that different phonemes may undergo different levels of cross-linguistic influence (for a discussion of potential differences in the perceptual prominence of voiceless stops by place of articulation, see Ruch and Peters (2016, p. 28)). Furthermore, research may seek to disentangle the possible effects of the immediate interactional context and the broader environment of that context, exploring the possible notion of a primacy of immediate or local factors as short-term sources of cross-linguistic influence.

#### **6. Conclusions**

This study examined the potential for linguistic environment, conceptualized as the language norms of the broader community in which an interaction or experiment takes place, to serve as a short-term source of cross-linguistic influence. To assess the role of linguistic environment, bilinguals (i.e., English speaking learners of Spanish) produced Spanish utterances in two sessions: an English-dominant linguistic environment (Indiana, USA) and a Spanish-dominant linguistic environment (Madrid, Spain). Productions were analyzed at the fine-grained acoustic level, though an acoustic analysis of voice onset time, as well as more holistically through native speaker global accent ratings. Results showed that linguistic environment did not significantly impact either measure of phonetic production. Moreover, there was no interaction of proficiency with linguistic environment, suggesting that the linguistic environment was not a relevant factor, regardless of participant proficiency in their second language.

The current findings, notably the lack of an impact of the broader linguistic environment in determining cross-linguistic phonetic influence, may suggest a primacy of the local factors (i.e., characteristics of the interaction and the immediately surrounding area) over broader, global factors (i.e., the linguistic environment of the broader community surrounding the interaction) as sources of short-term cross-linguistic interaction. Further research is needed to confirm these results and continue to explore the role of both long-term and short-term sources of cross-linguistic phonetic influence.

**Funding:** This research was funded in part by the Purdue Research Foundation.

**Acknowledgments:** I am grateful for the technical work of Serae Neidigh on this project. All errors are my own.

**Conflicts of Interest:** The author declares no conflict of interest. The funding sponsors had no role in the design of the study; in the data collection, analyses, or interpretation of the data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**


**Table A1.** Voice Onset Time Model Random Effects.

Equation:

$$VOT\_{ijk} = \beta\_0 + \beta\_1 \ast I \text{(Enviromement}\_{ijk} = B\text{)} + \text{Particinant}\_{\rangle} + \text{Item}\_k + \epsilon\_{ijk}$$

#### **Appendix B**

**Table A2.** Voice Onset Time Model with Proficiency Random Effects.


Equation:

$$\begin{array}{lcl} \mathit{VOT}\_{ijk} &=& \beta\_0 + \beta\_1 \ast \mathrm{I}\{\mathit{Env}\,\text{romment}\_{ijk} = B\} + \beta\_2 \ast \mathit{Projiction} + \beta\_3\\ &\ast \mathrm{I}\{\mathit{Env}\,\text{romment}\_{ijk} = B\} \ast \mathrm{Projiction} + \gamma\_{\rangle} \ast \mathrm{I}\{\mathit{Env}\,\text{romment}\_{ijk} = B\} \\ &=& B\rangle + \eta\_k \ast \mathrm{I}\{\mathit{Env}\,\text{romment}\_{ijk} = B\} + \mathrm{Partif}\{\mathit{runnt}\_{\rangle} + \mathrm{I}\{\mathit{Env}\_k + \epsilon\_{ijk}\}\} \end{array}$$

#### **Appendix C**

**Table A3.** Global Accent Rating Model Random Effects.


Equation:

$$GAR\_{ijk} = \beta\_0 + \beta\_1 \circ l \\
\text{(Enviromement}\_{ijk} = \, ^\circ B\text{)} + \text{Particinant}\_{j} + \text{Item}\_k + \epsilon\_{ijk}$$

#### **Appendix D**

**Table A4.** Global Accent Rating Model with Proficiency Random Effects.


Equation:

$$\begin{array}{lcl} \text{GAR}\_{ijk} &= \beta\_0 + \beta\_1 \ast \text{I}\{\text{Enviornment}\_{ijk} = B\} + \beta\_2 \ast \text{Projiciency} + \beta\_3\\ &\ast \text{I}\{\text{Enviornment}\_{ijk} = B\} \ast \text{Projciency} + \gamma\_j \ast \text{I}\{\text{Enviornment}\_{ijk}\}\\ &= B\rangle + \eta\_k \ast \text{I}\{\text{Enviornment}\_{ijk} = B\} + \text{Particpent}\_k + \text{Item}\_k + \epsilon\_{ijk} \end{array}$$

#### **References**

Amengual, Mark. 2012. Interlingual influence in bilingual speech: Cognate status effect in a continuum of bilingualism. *Bilingualism: Language and Cognition* 15: 517–30. [CrossRef]


Jarvis, Scott, and Aneta Pavlenko. 2008. *Crosslinguistic Influence in Language and Cognition*. New York: Routledge.


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Does Teaching Your Native Language Abroad Increase L1 Attrition of Speech? The Case of Spaniards in the United Kingdom**

#### **Robert Mayr 1,\*, David Sánchez <sup>1</sup> and Ineke Mennen <sup>2</sup>**


Received: 30 August 2020; Accepted: 16 October 2020; Published: 22 October 2020

**Abstract:** The present study examines the perceived L1 accent of two groups of native Spaniards in the United Kingdom, Spanish teachers, and non-teachers, alongside monolingual controls in Spain. While the bilingual groups were carefully matched on a range of background variables, the teachers used Spanish significantly more at work where they constantly need to co-activate it alongside English. This allowed us to test the relative effect of reduced L1 use and dual language activation in first language attrition directly. To obtain global accentedness ratings, monolingual native Spanish listeners living in Spain participated in an online perception experiment in which they rated short speech samples extracted from a picture-based narrative produced by each speaker in terms of their perceived nativeness, and indicated which features they associated with non-nativeness. The results revealed significantly greater foreign-accent ratings for teachers than non-teachers and monolinguals, but no difference between the latter two. Non-native speech was associated with a range of segmental and suprasegmental features. These results suggest that language teachers who teach their L1 in an L2-speaking environment may be particularly prone to L1 attrition since they need to co-activate both their languages in professional settings and are regularly exposed to non-native speech from L2 learners.

**Keywords:** L1 attrition; speech; foreign accent; accent perception; Spanish; English; bilingual; teacher

#### **1. Introduction**

A growing body of research on speech development in early and late bilinguals has documented changes occurring in a speaker's first language (L1) pronunciation that affect areas, such as vowels, consonants, and intonation patterns (e.g., (de Leeuw 2019; Fowler et al. 2008; Mayr et al. 2019; Mennen 2004; Nodari et al. 2019)). Such changes can take place rapidly, affect novice second language (L2) learners (Chang 2012, 2013; Kartushina et al. 2016a), and may be fully (Kartushina and Martin 2019) or partially (Chang 2019) reversed. Alternatively, they may occur over time in proficient L2 learners who are long-term residents in an L2-speaking environment (e.g., (de Leeuw et al. 2018a; Dmitrieva 2019; Mayr et al. 2012; Ulbrich and Ordin 2014)). Only the latter scenario is usually referred to as "L1 attrition", that is, the non-pathological and non-age related decrease in an individual's proficiency in a previously learnt language ((Köpke and Schmid 2004; Schmid 2010), but see (Schmid and Köpke 2017) for a broader definition).

While L1 attrition of speech has been widely documented, not all individuals who are long-term residents in an L2-speaking environment exhibit observable changes to their native accent (de Leeuw et al. 2018a; Major 1992; Mennen 2004). The specific factors that facilitate or hinder attrition of speech are, however, still poorly understood, and few studies have systematically investigated relevant predictor variables (but see (Hopp and Schmid 2013)). One of the ongoing debates in this context is whether attrition is predominantly caused by reduced L1 use or by cross-linguistic interactions arising from contexts of dual language activation (de Leeuw et al. 2010; Schmid 2007; Stoehr et al. 2017). The present study aims to contribute to this issue by examining the perceived L1 accent of two groups of native Spanish speakers in the United Kingdom: (1) Spanish language teachers, who use their L1 regularly in professional settings and frequently need to switch between Spanish and English, and (2) non-teachers, who virtually never use the L1 in the workplace, with the two groups exhibiting similar use of the L1 in social situations. This design allowed us to test the role of low L1 use and regular dual language activation in L1 attrition of speech directly.

#### *1.1. Plasticity in Native Speech, Phonetic Drift and L1 Attrition*

Research into late bilingualism has until recently been primarily concerned with L2 acquisition where prevailing notions have been that a critical period (Lenneberg 1967) and processes of fossilization (Selinker 1972) constrain ultimate attainment in the L2. Whether this putative end state is maturationally constrained or conditioned by increasing entrenchment is still subject to ongoing debates (e.g., (Bylund et al. 2013; Piske et al. 2001); nonetheless, traditional perspectives on bilingualism have largely ignored the L1, assuming it to be stable and unlikely to undergo significant development (e.g., (Gregg 2010)).

Such suggestions, however, are not supported by empirical findings which show that bilinguals' L1 speech patterns typically differ from those of monolinguals (see (Kartushina et al. 2016b) for an overview, and below). Moreover, they are at odds with a holistic view of bilingualism, which argues that the L1 and L2 do not exist in isolation but constantly interact with each other (Grosjean 1989). In line with this account, the Speech Learning Model (SLM) (Flege 1995; Flege and Bohn 2020) posits that the L1 and L2 share a common phonological space and influence each other, which may lead to cross-linguistic assimilation and dissimilation patterns, both of which will differ from those of monolinguals. Moreover, the SLM claims that the more experienced an L2 learner is, the greater the effect of the L2 will be on the L1 (Flege 1995).

A static view of the L1 has also been challenged by the advocates of Dynamic Systems Theory (e.g., (de Bot et al. 2007)). According to this account, language constitutes a system with multiple components that are continually in a state of flux. These components are interconnected and sensitive to feedback, both from internal stimuli (i.e., other components within the system), and social and environmental factors. Thus, native speech patterns are dynamic and subject to change throughout the lifespan. Indeed, there is widespread evidence from longitudinal studies that show that even monolinguals modify their L1 accent in response to changes in the norms of their speech community (Harrington 2006; Harrington et al. 2000; Sankoff and Blondeau 2007). Amongst these, a particularly well-known example is the work of Harrington and his associates which showed systematic changes over several decades in the Queen's vowel realizations during her annual Christmas address.

Changes in L1 accent have also been widely documented in longitudinal work on bilinguals. For example, Sancier and Fowler (1997) present the case study of a Brazilian Portuguese-English bilingual who regularly travelled between Brazil and the United States (see also (Tobin et al. 2017) for a recent extension to Spanish-English bilinguals). They found that her voice onset time (VOT) values in both languages were longer after several months in the United States and shorter after months in Brazil, a change to which native Portuguese listeners were receptive. The authors ascribe the observed variation to what they call a "gestural drift" (more recently "phonetic drift" (Chang 2012, 2013), suggesting that L1 phones begin to adopt characteristics of the ambient language as a result of their similarity to L2 phones, the speakers' propensity to unintentionally imitate what they hear, and the effect of recency on memory. Since phonetic drifts of this kind do not coincide with a decline in L1 proficiency, they are not considered instances of attrition (see also (Chang 2012, 2013; Kartushina et al. 2016a)).

In contrast, an extensive body of literature has documented pervasive changes in L1 accent in bilinguals who are long-term residents in an L2-speaking environment. At the phonetic level, such instances of L1 attrition have been shown to affect the production of VOT in plosives (Flege 1987; Major 1992; Mayr et al. 2012; Stoehr et al. 2017), formant frequencies in vowels (Bergmann et al. 2016; Guion 2003; Mayr et al. 2012), laterals (de Leeuw et al. 2013; de Leeuw 2019) and rhotics (de Leeuw et al. 2018b; Ulbrich and Ordin 2014), and the realization of tonal alignment (de Leeuw et al. 2012; Mennen 2004). Attrition has also been shown to affect L1 perception (Ahn et al. 2017; Dmitrieva 2019) and may result in the neutralization of native phonological contrasts (Cho and Lee 2016; de Leeuw et al. 2018a).

Moreover, there is ample evidence that listeners are receptive to changes in L1 accent and may perceive speakers as foreign accented in their native language (Bergmann et al. 2016; de Leeuw et al. 2010; Hopp and Schmid 2013). For example, de de Leeuw et al. (2010) examined the global foreign accent in the L1 of native German speakers who were long-term residents in Anglophone Canada or the Netherlands. The results revealed that they were perceived to be significantly less native-like than native control speakers in Germany, irrespective of geographical setting. Similarly, the native German speakers in Anglophone North America in Bergmann et al. (2016) were perceived as significantly less native-like in their L1 than control speakers in Germany, with 40% of attriters rated below the monolingual range.

Together, the extant literature hence suggests that L1 attrition of speech is widespread and may be observed both in the productions of bilinguals and in their global foreign accent ratings. Nevertheless, attrition is not inevitable since not all individuals who are long-term residents in an L2-speaking environment end up with changes to their L1 accent. For example, de Leeuw et al. (2018a) showed that while one of their Albanian-English bilinguals completely neutralized the L1 phonemic contrast between light and dark laterals, and two additional ones did so only in coda position, others produced their laterals entirely like Albanian monolinguals. Similarly, in Mennen's (2004) study of tonal alignment, four out of five of her Dutch learners of Greek exhibited changes in their L1 alignment patterns, but one speaker did not, producing tonal alignment entirely natively in both languages (see also (de Leeuw et al. 2013; Major 1992)). Finally, instances of individual variation were found in studies of accent perception (Bergmann et al. 2016; de Leeuw et al. 2010). Thus, while 14 bilinguals in de Leeuw et al. (2010) received a clear non-native rating, 20 were consistently perceived as native.

#### *1.2. L1 Use and Dual Language Activation in L1 Attrition*

One of the variables that may account for such individual variation in L1 attrition of speech is language use. For example, Flege et al. (1997) showed that Italians in the United States had stronger foreign accents in L2 English if they used Italian a lot than if they used it rarely. Similarly, Lloyd-Smith et al. (2020) found a strong effect of Italian use scores on the perceived nativeness in Italian heritage speakers in Germany, while the age at which the heritage language was introduced was inconsequential. Stangen et al. (2015), in turn, found high non-native accents in the majority language German for Turkish heritage language speakers in Germany with high use of Turkish (see also (Kupisch et al. 2014)).

Similar effects of language use have also been documented in attrition contexts. Thus, Stoehr et al. (2017) examined VOT production in two groups of late Dutch-German bilinguals living in the Netherlands, L1 German speakers and L1 Dutch speakers. Native German speakers were exposed to their L1 only at home, whilst speaking Dutch in other environments, whereas the native Dutch speakers had more contact with their L1 given its status as the majority language, only coming into contact with L2 German at home. The study found that L2-immersed bilinguals produced nativelike L2 plosives, yet also exhibited L2-like characteristics in their L1 productions. Conversely, bilinguals living in the L1 environment did not produce nativelike L2 plosives but maintained nativelike L1 VOTs. Together, the results suggest that being immersed in an L2-speaking environment can be advantageous for L2 speech learning, but reduced L1 use may increase the likelihood of L1 attrition.

The idea that low L1 use should lead to attrition is based on the premise, consistent with exemplar theoretic and usage-based approaches, that language use reinforces memory representations, and that its absence may lead to retrieval difficulties (Bybee 2001). Nevertheless, the role of L1 use in attrition is not straightforward. First, a number of studies have shown that changes to L1 accents can occur despite continued high L1 use (Chang 2012; Mayr et al. 2012; Mennen 2004). For instance, Mayr et al. (2012), who investigated L1 attrition of speech in Dutch-English twin sisters, documented changes in L1 accent in the L2-immersed twin despite regular high use of her native Dutch. Mennen (2004), in turn, showed in her study of Dutch-Greek bilinguals in the Netherlands that L1 phonetic changes can even occur in an L1-speaking environment provided the frequency of L2 use is high. Second, L1 use and exposure must be seen as distinct from L2 immersion, in that residence in an L2-speaking environment can co-occur with wide and varied patterns of L1 communication. As such, simple measures of frequency and quantity of L1 contact may not be sufficient, since "[ ... ] among bilinguals, L1 use does not necessarily equal L1 use" (Schmid 2007, p. 137). That is to say, L1 use encompasses a diverse range of situations that do not fit comfortably within a single definition, and therefore cannot be considered a single predictor of attrition.

One of these concerns situations that require co-activation of the L1 and L2. Thus, in de Leeuw et al.'s (2010) study, native German speakers in Anglophone Canada and the Netherlands were more likely to be perceived as foreign-accented in the L1 if they used German in contexts in which code-switching was likely to occur. Bilinguals who reported a high amount of L1 contact in situations with minimal expected code-switching, on the other hand, were less likely to be perceived as non-native, suggesting that L1 contact of this type may promote stability of pronunciation. Note, however, that in this study, participants were not directly asked whether they code-switched in specific settings. Rather, the authors postulated ex post facto that code-switching was more likely to occur in certain settings. These included L1 use with family members and friends in Canada and the Netherlands and use in church settings; in contrast, code-switching was deemed less likely to occur in work settings, during visits to Germany, and during telephone conversations and written correspondence with native German speakers.

These findings are consistent with a large body of evidence that has shown cross-linguistic interactions to occur in contexts of dual language activation, such as code-switching, where inhibition of the non-target language is particularly difficult (Green 1998). The state of activation of a bilingual's two languages at a given point in time is referred to as language mode (Grosjean 2001) and can range from bilingual mode, where both languages are fully activated, to monolingual mode, where the non-target language is inhibited as much as possible, although never entirely, based on sociolinguistic factors. Studies of phonetic code-switching have shown unidirectional interactions, in which the speech patterns of only one language are affected by those of the other one (e.g., (Muldner et al. 2019; Olson 2013)) as well as bidirectional interactions, in which both languages mutually affect each other's speech patterns (e.g., (Bullock and Toribio 2009; Piccinini and Arvaniti 2015)), with few studies revealing no effect of switching (but see (Grosjean and Miller 1994)).

#### *1.3. The Present Study*

The present study sought to build on previous work that has examined the role of L1 use and dual language activation in L1 attrition by investigating the perceived L1 accent of two groups of native Spanish speakers in the United Kingdom, (1) Spanish language teachers, and (2) non-teachers, alongside monolingual controls in Spain. As such, it is the first to examine L1 attrition of speech across specific professional groups. To the best of our knowledge, only one other study on L1 speech production has included individuals who teach their native language in an L2-speaking environment, that is, Chang (2019). However, unlike the present study, the speech of the L1 English speakers in that study, who taught their native language to L2 learners in Korea, was not compared to that of a group of non-teachers. Moreover, the focus of that study was the effect of bilinguals' L2 use on L1 pronunciation patterns.

The case of teachers is particularly pertinent, given the high proportion of foreign citizens who work teaching their native languages: Of an estimated 116,000 Spaniards in the United Kingdom between 2013 and 2015, nearly 10% were working in education (Office for National Statistics 2017). While other migrants may also have frequent L1 contact, the experience of language teachers is quite distinct, given their high levels of L1 exposure and use under specific circumstances. Thus, language teaching is one of the few professions in which language is not merely a medium of communication, but also its object. As such, individuals who teach their native language to L2 learners may have what Chang (2019, p. 108) refers to as an "instructional orientation" towards the L1, which would typically encompass "high metalinguistic awareness and explicit knowledge of rules, norms, and standards" (ibid.). Moreover, the need for them to provide a clear, carefully articulated model for their students' pronunciation patterns means that they may be particularly concerned about retaining a native-like accent. Finally, teaching one's native language necessitates sustained high use of the L1. Together, these factors suggest that the L1 accent of individuals who teach their native language may be especially protected from attrition.

On the other hand, teaching one's L1 in an L2-speaking environment requires regular use of the L2, not only in social contexts but also professionally. Thus, even if foreign language teachers aim to maximize the use of the target language in the classroom, regular recourse to the ambient language, and the use of both languages in alternation, is virtually inevitable (Littlewood and Yu 2011; Turnbull and Dailey-O'Cain 2009). Moreover, recent pedagogical approaches, notably "translanguaging" (Cenoz and Gorter 2019), have moved away from strict adherence to monolingualism and actively embrace the use of more than one language in language classrooms, in line with Cook's (2008) notion of "multicompetence" (see also (Illman and Pietilä 2018)). Individuals teaching their native language, therefore, need to keep both their L1 and L2 fully activated for extended periods, and hence operate in a sustained bilingual language mode in the classroom (Grosjean 2001). As discussed previously, this has been shown to enhance the likelihood of cross-linguistic interactions, and as a result changes to individuals' L1 accent (cf. (de Leeuw et al. 2010)).

In addition, language teachers are regularly exposed to L1-influenced pronunciations in their students' L2 productions. However, the effect that foreign-accented input has on their native speech patterns is unclear. On the one hand, experimental studies examining phonetic convergence in native–non-native dyads have failed to document instances of native speaker accommodation towards the accents of non-native speakers (Kim 2009; Kim et al. 2011), suggesting that language teachers may be impervious to the influence of their students' accented speech patterns. On the other hand, in these studies, accommodation is based on singular events during which rapid phonetic adjustments are assessed in conversations with unfamiliar individuals, and hence they do not allow conclusions to be drawn about the effects of repeated exposure to, and interaction with, familiar foreign-accented speakers in professional educational settings. It is certainly plausible that sustained accented input of this kind may affect the representations of teachers' L1 speech sounds, in line with Chang's (2019) Incidental Input Hypothesis, which argues that ambient input is incidentally processed and cannot be ignored. Moreover, evidence from both adults who were raised in bilingual homes (Bosch and Ramon-Casas 2011) and bilingual children in immersion school settings (Caldas 2006; Mayr and Montanari 2015) supports the idea that foreign-accented input may affect L1 pronunciation patterns. Thus, Bosch and Ramon-Casas (2011) showed that Catalan-Spanish bilinguals who were raised with both languages and received inconsistent phonetic input produced Catalan/e-ε/ less accurately as adults than bilinguals raised in Catalan-only homes. Caldas (2006), in turn, reported that his daughters' L1 French was English-accented, which he attributed to their exposure to non-native speech at their dual language school in Louisiana. In contrast, his son, who was solely educated through the medium of English, but like his sisters received native French input in the home, had a native-like accent in French. Similarly, Mayr and Montanari (2015) found that the two Italian-English-Spanish trilingual children in their study had native-like VOT patterns in Spanish, but English-accented ones in Italian, even though both languages contain a prevoiced—short lag VOT contrast. The authors attributed this finding to the fact that the children were regularly exposed to English-accented Italian from their classmates in their Italian-English dual language school in Los Angeles, while they only learnt Spanish from their monolingual Mexican nanny.

Based on these considerations, the present study sought to answer three inter-related research questions. First, it aimed to find out whether Spanish speakers who teach their native language in an L2-speaking environment are perceived as more or less native-like in their L1 than non-teachers in the same L2 environment who rarely use it. Second, it sought to determine to what extent perceptions of non-nativeness are characterized by individual variation. Finally, it attempted to identify the specific accentual features that are associated with non-native speech in native Spanish teachers and non-teachers who are long-term residents in an L2-speaking environment.

#### **2. Method**

An accent rating experiment was carried out in which monolingual Spanish listeners, resident in Spain, were exposed to short extracts of Spanish speech from a picture-based narrative produced by two groups of native Spanish speakers in the United Kingdom, language teachers, and non-teachers, alongside monolingual controls in Spain. Listeners were asked to state whether they detected a non-native accent in the speech samples and to provide an indication of their level of confidence in their judgement. Moreover, if they considered a sample to sound non-native, they were prompted to identify the accentual features that had led them to this conclusion.

#### *2.1. Participants*

Two groups of consecutive bilingual Spaniards living in the United Kingdom were recruited to participate in the study: (1) Spanish language teachers (BIL-T, *N* = 10, 9 females), and (2) non-teachers (BIL-NT, *N* = 9, 5 females). Those in the latter group practise a diverse range of professions, ranging from social work to accountancy and nursing, and none habitually use Spanish in their communication at work or at home. The participants in BIL-T, in turn, were either employed as Spanish teachers in schools (*N* = 5) or in university settings (*N* = 5). Further to being long-term residents—that is, having lived continuously in the UK for at least five years—an inclusion criterion for both of these groups was that migration took place after the age of 18. In this way, any differences identified in their speech can be attributed to attrition as opposed to incomplete L1 acquisition (Schmid 2014).

In addition to the two bilingual groups, a group of monolingual Spaniards residing in Spain participated in the study (MON, *N* = 8, 7 females). The speakers in this group had never lived anywhere other than Spain, had never spoken a language other than Spanish at home, as a medium of education or at work, and reported low levels of proficiency in English or any other language. As such, they meet Best and Tyler's (2007) definition of functional monolinguals as "not actively learning or using an L2" (p. 16).

Participants were recruited through ELE-UK (www.eleuk.org) and the Instituto Cervantes (www.ce rvantes.es), both of which are institutions dedicated to the teaching of the Spanish language, and through Spanish departments at English universities as well as via existing networks in the United Kingdom and in Spain. They came from a range of regions in Spain with no systematic differences across the groups: Andalusia (BIL-T: 2, BIL-NT: 1, MON: 1), Asturias (BIL-NT: 1, MON: 1), Castile-La Mancha (BIL-NT: 1, MON: 1), Catalonia (BIL-T: 2, BIL-NT: 1, MON: 1), Galicia (BIL-T: 3), Murcia (BIL-T: 1), Madrid (BIL-T: 1, BIL-NT:1, MON: 1), Basque Country (BIL-T: 1), Valencia (BIL-NT: 4, MON: 4).

All subjects gave their informed consent for inclusion before they participated in the study. The research reported in this manuscript was reviewed and approved by the Cardiff School of Health SciencesResearch Ethics Committee, Cardiff Metropolitan University, United Kingdom (ethics reference number: UG-265).

Initial contact was established by email and, in order to ensure groups were matched for key variables, demographic and linguistic background information was collected by means of an online questionnaire created using Qualtrics XM (Qualtrics 2019). A summary of participant characteristics is included in Table 1. Comparisons on all variables in the table were made across the two bilingual groups, while comparisons across all three groups were only made on the first three variables in the table, that is, education, English proficiency, and chronological age, as well as on gender distributions.


**Table 1.** Participant characteristics.

Notes: AOA = age of arrival in the UK; LOR = length of residence.

#### 2.1.1. Comparisons across the Two Bilingual Groups

The two bilingual groups were carefully matched on a range of background variables1. Thus, they did not differ from each other in gender distribution (Chi-square test: χ2(1) = 2.898, *p* = 0.089), chronological age (BIL-T (mean: 41.60, SE: 3.11); BIL-NT (mean: 33.56, SE: 2.16); Independent *t*-test: *t*(17) = 2.08, *p* = 0.053), age of arrival in the UK (BIL-T (mean: 28.20, SE: 2.78); BIL-NT (mean: 24.44, SE: 0.93); Independent *t*-test: *t*(17) = 1.223, *p* = 0.238) or length of residence (BIL-T (mean: 13.10, SE: 2.18); BIL-NT (mean: 8.89, SE: 1.84); Independent *t*-test: *t*(17) = 1.458, *p* = 0.163). Moreover, they were matched in terms of their highest level of education (BIL-T (median: 6.00, min-max: 5.00–7.00); BIL-NT (median: 5.00, min-max: 4.00–7.00); Mann–Whitney test: *U* = 27.500, *p* = 0.126), using a seven-point Likert scale ranging from 1 (less than secondary school education) to 7 (doctorate), as well as their self-reported competence in English (BIL-T (median: 4.00, min-max: 3.00–5.00); BIL-NT (median: 3.00, min-max: 3.00–5.00); Mann–Whitney test: *U* = 35.000, *p* = 0.374), based on a six-point Likert-type scale ranging from 1 (less than basic knowledge of English) to 6 (Native or near-native proficiency) in line with the classifications of the Common European Framework of Reference for Languages (Council of Europe 2001).

The bilingual groups were also matched on some of their language use patterns. Thus, they did not differ in their estimated use of Spanish and English in social situations outside their home and work in the UK (Spanish: BIL-T (mean: 33.00, SE: 5.16); BIL-NT (mean: 23.78, SE: 6.96); Independent *t*-test: *t*(17) = 1.079, *p* = 0.296; English: BIL-T (mean: 63.30, SE: 5.72); BIL-NT (mean: 76.22, SE: 6.96); Independent *t*-test: *t*(17) = 1.446, *p* = 0.166), the amount of time they spent in Spain per year (BIL-T (median: 1.00 (<1 month), min-max: 1.00 (<1 month) to 2.00 (1–3 months)), BIL-NT (median: 1.00 (<1 month), min-max: 1.00 (<1 month) to 2.00 (1–3 months)); Mann–Whitney test: *U* = 43.500, *p* = 0.879), the frequency of spoken contact with family and friends in Spain, for example, via telephone conversations (BIL-T (median: 1.00 (once or twice a week), min-max: 1.00 (once or twice a week) to 3.00 (less than once a month)); BIL-NT (median: 1.00 (once or twice a week), min-max: 1.00 (one or twice a week) to 3.00 (less than once a month)); Mann–Whitney test: *U* = 38.500, *p* = 0.492), or the frequency of written contact with family and friends in Spain, for example, email correspondence

<sup>1</sup> To compare groups on scalar variables, such as chronological age, we ran parametric tests (independent *t*-test; one-way ANOVAs); for comparisons on ordinal variables and Likert scales, we ran non-parametric tests (Mann–Whitney test; Kruskal–Wallis test); the relation between nominal variables, in turn, was explored using chi-squared tests. When running independent samples *t*-tests across the two bilingual groups on the use of English and Spanish at work as well as on the use of Spanish at home, the variances turned out not be equal based on Levene's tests. In these cases, the *t*-values, *p*-values, and degrees of freedom were adjusted accordingly.

(BIL-T (median: 2.00 (once or twice a day), min-max: 1.00 (multiple times a day) to 4.00 (once or twice a month)); BIL-NT (median: 2.00 (once or twice a day), min-max: 1.00 (multiple times a day) to 3.00 (once or twice a week)); Mann–Whitney test: *U* = 27.500, *p* = 0.129).

In contrast, crucially, the two groups differed from each other in terms of their language use patterns in work, and to a lesser extent at home. Thus, the BIL-T group used English significantly *less* at work (BIL-T (mean: 54.20, SE: 7.48), BIL-NT (mean: 95.11, SE: 2.77); Independent *t*-test: *t*(11.390) = 5.130, *p* < 0.0005) and at home (BIL-T (mean: 41.50, SE: 12.26), BIL-NT (mean: 79.67, SE: 10.96); Independent *t*-test: *t*(17) = 2.30, *p* = 0.034) than the BIL-NT group, but Spanish significantly more at work (BIL-T (mean: 40.60, SE: 6.66), BIL-NT (mean: 4.33, SE: 2.69); Independent *t*-test: *t*(11.862) = 5.047, *p* < 0.0005) than the BIL-NT group. On the other hand, the two groups did not differ significantly from each other in their use of Spanish at home (BIL-T (mean: 33.50, SE: 12.03), BIL-NT (mean: 9.22, SE: 4.72); Independent *t*-test: (*t*(11.674) = 1.878, *p* = 0.086). Note that two of the BIL-T speakers and one of the BIL-NT speakers live by themselves and therefore indicated no use of any language in the home. Note also that the BIL-T speakers, but not the BIL-NT speakers, indicated occasionally using a language other than Spanish or English that was not specified further. This accounted for circa 5% of the use patterns at work and 7% at home.

#### 2.1.2. Comparisons across the Monolingual Group and the Two Bilinguals Groups

Finally, comparisons were made across all three groups, that is, BIL-T, BIL-NT, and MON. They differed significantly on self-rated competence in English (MON (median: 1.00, min-max: 1.00–2.00), BIL-T (median: 4.00, min-max: 3.00–5.00); BIL-NT (median: 3.00, min-max: 3.00–5.00); Kruskal–Wallis test: χ2(2) = 10.16, *p* = *0*.006) with a Dunn's post-hoc test revealing significantly lower scores for the MON group than BIL-T (*p* = 0.002) and BIL-NT (*p* = 0.016). Moreover, while the MON speakers did not differ from the other two groups in terms of gender distribution (Chi-squared test: χ2(2) = 3.873, *p* = 0.144), they differed in chronological age (MON (mean: 31.63, SE: 1.29), BIL-T (mean: 41.60, SE: 3.11); BIL-NT (mean: 33.56, SE: 2.16); One-way ANOVA: *F*(2,24) = 4.810, *p* = 0.018) and formal education level (MON (median: 4.50, min-max: 2.00–6.00), BIL-T (median: 6.00, min-max: 5.00–7.00); BIL-NT (median: 5.00, min-max: 4.00–7.00); Kruskal–Wallis test: χ2(2) = 6.74, *p* = 0.034), with the MON group significantly younger (*p* = 0.030) and less well educated (*p* = 0.029) than the BIL-T group, but not the BIL-NT group (*p* > 0.05).

#### *2.2. Speech Materials*

Participants audio-recorded themselves telling the story "I will help you" (Abbott et al. 2015) in Spanish. To do this, they were given access to an adapted version of the picture book online, which contained 17 pictures, but with all words removed. Participants could view the pictures as many times as they wished to ensure they understood the story before recording. Recordings were completed with a mobile phone or computer in a quiet environment, avoiding background noise, to promote optimum quality for subsequent use in the accent rating experiment. They were asked not to plan the exact wording beforehand and to imagine telling the story to a monolingual Spanish child. This approach was chosen to obtain quasi-spontaneous speech, whilst ensuring comparable samples in terms of lexical and grammatical content, and thus minimizing the likelihood of judgements resulting from differences in linguistic complexity (Schmid and Hopp 2014).

From each of the 27 narratives, a randomly selected speech sample of approximately 15 s was extracted in PRAAT (Boersma and Weenink 2019). This duration was considered sufficient for listeners to make a reliable judgement (de Leeuw et al. 2010; Flege 1984; Schmid and Hopp 2014). In order to minimize the likelihood that the listeners' judgements are based on areas other than pronunciation, samples were carefully screened to ensure they contained no lexical or grammatical errors and constituted grammatically complete utterances. Long pauses and hesitations were also avoided. A one-way ANOVA revealed no statistically significant difference in sample duration across groups (Mean BIL-T: 16.33 (SD: 1.14); Mean BIL-NT: 15.53 (SD: 0.763); Mean MON: 15.89 (SD: 0.524);

*F*(2,24) = 1.979, *p* = 0.160), nor in terms of speaking rate, as measured in syllables per second (Mean BIL-T: 5.21 (SD: 0.649); Mean BIL-NT: 5.54 (SD: 0.852); Mean MON: 5.70 (SD: 0.502); *F*(2,24) = 1.201, *p* = 0.318). To further reduce variability across samples, peak intensity was normalized, using PRAAT software (Boersma and Weenink 2019).

#### *2.3. Listeners*

The samples were presented as part of an online questionnaire created in Qualtrics XM software (Qualtrics 2019), which was distributed via an anonymous link to students at the Faculty of Education, University of A Coruña as well as existing networks across Spain. A total of 28 native Spanish listeners (20 females) with a mean age of 32 (SD: 11.25) completed the online accent rating experiment. Competence in English was controlled for with none of the listeners reporting higher than intermediate proficiency (mean 2.5, SD 0.75) comparable to the MON speakers' scores (cf. Table 1). Like the MON speakers, the listeners had never lived outside Spain and had never spoken a language other than Spanish at home, as a medium of education, or at work. Like the speakers, they come from a variety of regions, including Andalusia (*N* = 6), Castile-La Mancha (*N* = 1), Catalonia (*N* = 3), Extremadura (*N* = 1), Galicia (*N* = 3), Madrid (*N* = 1) and Valencia (*N* = 13).

#### *2.4. Experimental Procedure*

As the experiment was conducted online, listeners were given detailed written instructions regarding the task at hand. They were asked to use headphones, and an audio test was incorporated into the questionnaire to ensure adequate browser and volume settings had been selected. Participants were informed they would hear samples from fluent Spanish speakers, though no indication of whether they were native or not was given. Following the method established by Moyer (1999) and adopted in various studies on bilingual populations since then (e.g., (Bergmann et al. 2016; de Leeuw et al. 2010; Lloyd-Smith et al. 2020)), samples were played in random order and after each recording listeners were instructed to give a binary rating of the speaker's accent (native/non-native), indicating subsequently their degree of confidence (confident/neither confident nor not confident/not confident). They were further instructed to select "non-native" in the event they detected a non-native accent, however slight. Listeners heard each sample only once and were asked to guess if unsure, indicating their lack of confidence accordingly.

For samples rated "non-native", a follow-up question was included immediately after the rating was given, requesting details of what aspects of pronunciation had created a perception of non-nativelikeness, as well as any specific words that sounded non-native. In addition to the rating task, the questionnaire contained a range of demographic and language background questions to ensure the listeners met the inclusion criteria.

No time limit was imposed for responding and listeners controlled the pace at which they progressed through the samples. They were encouraged to take as many breaks as they deemed necessary. The average duration for the experiment was 25 min.

#### *2.5. Analysis*

In line with previous accent rating experiments (Bergmann et al. 2016; de Leeuw et al. 2010; Moyer 1999), listeners' responses were converted to a six-point scale in which a "native" rating marked as "confident" appeared at one end of the scale (1) and a "confident" rating as "non-native" at the other (6). As such, the lower the numerical foreign accent rating (FAR), the nearer to nativelike the speaker was perceived to be. The experimental data were subsequently transferred to a CSV file for statistical analysis. In order to assess whether the groups differ in their FAR, linear mixed-effects models were run in R (R Core Team 2018) using the LmerTest function (Kuznetsova et al. 2017). To analyze the features identified by the listeners, content analysis was used (Krippendorff 2018). This first involved screening responses for relevant phonetic information. Comments that did not relate to accentual features were disregarded. Items referring to accentual features, in turn, were initially coded as relating

to either segmental or suprasegmental phenomena before being assigned to more specific subcategories. These were then quantified. As a measure of reliability, coding was repeated on all 174 comments, yielding an agreement score of 95.98%. Divergences between the two sets of analysis only concerned a small number of comments with unclear/ambiguous meanings. For example, reference to "una pronunciación muy marcada, muy fuerte" (a very marked pronunciation, very strong) was coded as referring to rhythm/stress in the first analysis, but as being too general to include in the re-analysis. These comments were discarded from further analysis.

#### **3. Results**

#### *3.1. Accent Rating*

To assess inter-rater reliability, we ran a Cronbach's alpha analysis across the ratings made by the 28 listeners. The results revealed a value of 0.81, which suggests a high degree of homogeneity. Figure 1 depicts the distribution of FAR scores across the three groups.

Inspection of the figure suggests that the samples were predominantly perceived to be native-like, with median scores of "1" for the participants in BIL-NT and MON, and of "2" for the participants in BIL-T, although the scores in all groups exhibited a certain degree of variation. Overall, a total of 221 of the 28 × 27 = 756 samples were rated as non-native, that is, 29.23%, with 107 (i.e., 14.15%) attracting the highest FAR score of "6", that is, "non-native with certainty". To examine whether the FAR scores differed across the groups, linear mixed-effects models were run in R (R Core Team 2018), with "group" as fixed effect and "participant" as random intercept. Using the LmerTest function (Kuznetsova et al. 2017), the Satterthwaite approximation was used to obtain degrees of freedom, from which *p*-values could be calculated.

**Figure 1.** Distribution of FAR scores by group.

Our initial model was run on all 756 ratings and across the three groups. The results, depicted in Table 2, revealed highly significant between-group differences (*p* < 0.001). This analysis was subsequently followed up with pairwise comparisons, with a Bonferroni-adjusted α-level of 0.0167.


**Table 2.** Results of linear mixed-effects models: FARs.

The results revealed significantly higher and thus less native-like FAR scores for the participants in BIL-T than in BIL-NT (*p* < 0.001) and MON (*p* < 0.001). The difference between the latter two, in contrast, was not significant (*p* = 0.053). Together, these results suggest that the L1 accent of Spaniards in non-teaching professions in the UK was perceived as equally native-like as that of monolinguals resident in Spain. Spaniards teaching their L1 in educational settings in the UK, in contrast, whilst also attracting relatively low FAR scores, were perceived as significantly less native-like, suggesting a certain degree of L1 attrition.

#### *3.2. Perceived Non-Native Features*

All 28 listeners provided comments on the samples they deemed non-native; however, this was only the case for 174 of the 221 samples (i.e., 78.73%), while 47 of the non-native ratings were left uncommented. Following a careful screening, 71 of the 174 comments were removed from the analysis as they were too general, referring, for example, just to "pronunciation of some words" or "the speaker's accent", and an additional three were removed that referred to features unrelated to pronunciation, for example, lexical or grammatical choice. The remaining 100 comments were analysed further; of these, 84 referred to a single feature, while 13 referred to 2 features, and 2 to 3 features, for a total of 116 feature tokens. Table 3 shows a breakdown of the features identified, alongside illustrative examples. Since they did not exhibit any systematic differences between the speakers in BIL-T and BIL-NT, the data were pooled.


**Table 3.** Perceived non-native features.


**Table 3.** *Cont*.

Inspection of the table shows that judgements of non-nativeness were based on both segmental and suprasegmental features, albeit with a preponderance of the former. Amongst segments, listeners most commonly perceived consonantal items as non-native, notably realizations of /s/ and rhotic consonants, but some also referred to vowel deviations and phoneme omissions. Comments on suprasegmental items predominantly referred to intonation, mostly expressed in terms of "melodía" (melody) or "musicalidad" (musicality), but there were also some mentions of rhythm/stress and speaking rate.

#### *3.3. Individual Variation*

Finally, in addition to the analysis at the group level, we investigated individual variation. This was done by converting median FARs into a categorical rating of "clearly native" (between 1.0 and 2.5), "uncertain" (greater than 2.5 but less than 4.5), and "clearly non-native" (between 4.5 and 6.0) following de Leeuw et al.'s (2010) approach. The categorizations for the participants in the three groups are shown in Table 4.


**Table 4.** Categorization of nativeness by group.

Inspection of the table shows that, as one would expect, all MON speakers were consistently classed as "clearly native". In contrast, in line with previous work on attrition (e.g., (de Leeuw et al. 2010, 2018a; Mennen 2004)), the results for the two bilingual groups were more varied. Thus, although the BIL-NT speakers were not found to differ from the MON ones at the group level, as we have seen, the analysis of individual classifications shows that one BIL-NT speaker was considered "uncertain" and another one "clearly non-native". At the same time, while 4 in the BIL-T group were classed as "clearly non-native" and one as "uncertain", half of them were considered "clearly native". As a result, teaching one's native language in an L2-speaking environment does not automatically lead to perceived attrition in L1 speech; it merely appears to increase its likelihood. Table 5 displays the characteristics of the participants identified as non-native.

**Table 5.** Characteristics of participants perceived as non-native.


Note: AN = Andalusia; CT = Catalonia; GA = Galicia; PV = Basque Country; the figures in parenthesis denote the number of comments per feature.

As the table shows, all participants considered "clearly non-native" were female, aged between 27 and 48 years, and considered their English competence as upper intermediate to near-native. They had moved to the United Kingdom in their twenties or thirties and had been living there between 5 and 18 years. Of the 28 listeners, 20 or more considered the four BIL-T speakers as non-native; slightly fewer listeners, that is, 17, classified BIL-NT\_1 as non-native. The latter also received a slightly lower FAR and was hence perceived as less clearly non-native than the four BIL-T speakers. Finally, the table

shows that the Spanish accent of each of these participants was associated with multiple non-native features. All were perceived to produce their L1 with non-native intonation patterns and realizations of /s/, and all but one, that is, BIL-T\_2, were perceived to realize Spanish rhotics non-natively.

#### **4. Discussion**

This study aimed to gain a better understanding of the role of L1 use and dual language activation in the perceived attrition of native speech patterns. To this end, we examined the L1 Spanish accent of two groups of native Spanish speakers who are long-term residents in the UK, Spanish language teachers and non-teachers, alongside monolingual Spanish speakers in Spain in an accent perception experiment. The results revealed significantly greater non-native ratings for the teachers than the non-teachers and the monolinguals, but no difference between the latter two, with listeners' impressions of non-nativeness based on a range of segmental and suprasegmental features. An analysis of individual patterns, in turn, showed a fair amount of variation, with half of the speakers in BIL-T perceived as "clearly native" and one of the BIL-NT speakers as "clearly non-native". In what follows, the implications of these findings will be discussed.

To begin with, let us consider why the participants in BIL-T were perceived as significantly more foreign-accented in their L1 than monolinguals in Spain. At first glance, this finding is surprising. After all, they regularly use their L1 both in work and outside of it, and Spanish plays an essential role in their professional identity. As Chang (2019, p. 108) states, being a language teacher typically comes with an "instructional orientation", and likely coincides with a particular concern for retaining native-like proficiency in the L1, including its accent, although this was not formally assessed here. One might expect these factors to provide a certain degree of protection from attrition. However, this was not the case in the present study, at least not at the group level.

The likely reason for the perceived attrition in the teachers' L1 accent is dual language activation, which, in turn, is a direct consequence of the specific professional setting in which they operate. In other words, it is essentially impossible for foreign language teachers who teach their L1 in an L2-speaking environment to activate only their L1 during classroom activities and only their L2 outside of it, and hence function in alternate monolingual language modes (Grosjean 2001). Instead, both their languages need to be highly active for most or all of the time, resulting in them operating in a sustainable bilingual language mode. This will be true even if the extent of dual activation varies somewhat from context to context. For example, it is likely to be particularly high during activities that actively encourage a bilingual approach, such as translanguaging (Cenoz and Gorter 2019), while it will be comparatively lower during activities in which sole use of the target language is encouraged, in particular in students with high L2 proficiency levels. Nevertheless, whatever the specific circumstances, the very nature of foreign language classroom settings makes dual activation inevitable.

Crucially, dual activation has been shown to lead to cross-linguistic interactions in speech patterns. Such interactions have been widely attested in contexts of phonetic code-switching (Amengual 2018; Bullock and Toribio 2009; Muldner et al. 2019; Piccinini and Arvaniti 2015), where cognitive demands to inhibit the non-target language are particularly high. While they may initially occur in such circumstances, that is, during ad hoc dual language activation, over time they may give rise to more persistent accentual changes and become entrenched. This is likely to have happened to the teachers in the present study and is consistent with de Leeuw et al.'s (2010) finding that L1 attrition was more common in native German speakers in Anglophone Canada and the Netherlands who regularly used their L1 in contexts of code-switching than those who did not.

In addition, unlike the non-teachers, the teachers will have been systematically exposed to non-native Spanish accents via their students' productions. These may have either independently caused the observed changes in their L1 accent or enhanced the effects of their own concurrent use of the two languages, thereby reinforcing deviations from monolingual Spanish patterns. While the direct effect of sustained English-accented input in Spanish cannot be isolated in the present context, it will have led to an additional burden on teachers' inhibitory control mechanisms. The suggestion that

foreign-accented input can increase the likelihood of non-native speech patterns is certainly consistent with evidence from adults raised in bilingual homes (Bosch and Ramon-Casas 2011) as well as bilingual and multilingual children in immersion school settings (Caldas 2006; Mayr and Montanari 2015), although its role in L1 attrition of speech needs to be explored further in future research. Taken together, the results for the participants in BIL-T suggest that, ironically, it is the very nature of the professional context in which teachers operate, with its requirement to keep both languages active and the need to switch between them, that enhances the likelihood of L1 attrition.

The participants in BIL-NT, in contrast, do not face these cognitive demands in a professional setting. While they work in a diverse range of areas, such as nursing, social work, and accountancy, none of them involve professional use of Spanish. As a result, the BIL-NT speakers virtually exclusively use their L2 in work, and hence operate in a consistent monolingual English language mode. Their lower overall amount of L1 use (and greater amount of L2 use), compared with the BIL-T group, in turn, did not lead to perceived attrition since they were rated the same as monolingual controls in Spain. Previous research suggests a somewhat ambiguous role for overall amount of language use in L1 attrition: while some studies have shown an effect of reduced L1 contact on attrition of speech patterns (e.g., (Stoehr et al. 2017)), others either revealed no effect (e.g., (Hopp and Schmid 2013)), or exhibited mixed results. For example, Chang (2019) showed no greater overall persistence in L1 phonetic drift in English-Korean bilinguals with high L2 use compared to those with low L2 use—only one of three areas investigated yielded a significant effect. While it is conceivable that the complete lack of L1 use over many years may cause attrition, independent of other factors, due to the gradual loss of long-term memory representations, this was not the case here. After all, even though the participants in BIL-NT hardly ever used their L1 in work contexts, they indicated using it regularly in social interactions outside of work as well as in written and spoken forms of remote communication with family and friends in Spain. The reduction in L1 use that typically occurs in L2 immersion contexts is hence unlikely to cause L1 attrition of speech in and of itself. It appears that what is critical is the contexts in which the L1 is used (cf. (Schmid 2007)). In the present study, it may well be the absenc*e* of L1 use in the kinds of contexts in which the teachers use their native language professionally, that has protected the BIL-NT speakers' speech from attriting. At the same time, their high L2 competence will have protected them from experiencing L1 phonetic drift as a result of a novelty effect (Chang 2012, 2013).

These considerations notwithstanding, the results of the present study also show a fair amount of individual variation, with half of the participants in BIL-T being perceived as "clearly native-like" and one participant in BIL-NT as "clearly non-native". Moreover, while the BIL-T participants were rated as significantly more non-native than those in BIL-NT and MON, their median FAR was "2", that is, "native-like with medium confidence". This suggests that L1 attrition in the context of teaching one's L1 in an L2-speaking environment is by no means inevitable. Perhaps the five teachers in BIL-T who were rated as "clearly native" in Spanish developed enhanced inhibitory control which allowed them to counteract cross-linguistic interactions from dual language activation and exposure to foreign-accented speech by their students. This may have coincided with a range of factors relating to individual differences, such as attitudinal, socio-psychological, and cognitive ones. For example, they may ascribe particular importance to the retention of a native accent in Spanish. Or they may have a particular phonetic talent (e.g., (Jilka 2009; Lewandowski and Jilka 2019)). Moreover, they may actually be perceived as non-native, but only in settings not assessed here, for example, in casual encounters (Major 1992). By the same token, the absence of particular skills or attitudes may explain attrition in BIL-NT\_1's L1 accent. However, explanations of this nature remain wholly speculative as these variables were not investigated in the present study. Suffice it to say that L1 attrition of speech is a complex multi-factorial phenomenon (cf. (Kartushina et al. 2016b)) and that the patterns observed here must have been caused, in part, by factors other than dual language activation and language use. Although challenging, future work, based on a larger sample of potential attriters, is needed that systematically teases the various predictor variables for L1 attrition of speech apart and includes a more sophisticated approach to the assessment of L1 use. In the context of language teachers, this could

involve obtaining details on interaction patterns with students of varying levels of proficiency during different types of classroom activity, but also language use and code-switching patterns with fellow foreign language teachers outside the classroom.

While we have so far discussed differences between the teachers' and non-teachers' use of languages at work, their language use patterns at home also need to be considered. Our results showed that BIL-T not only differed from BIL-NT participants in their language patterns in the workplace, but also in their language patterns at home. Crucially though, the language differences at home only pertained to the use of the L2, which was used more frequently by the non-teachers than the teachers. In contrast, no differences were found between the two groups in their use of Spanish at home. This shows that the perceived attrition in the teachers' L1 accent, cannot be explained by a reduction in L1 use at home, given that their amount of L1 use was similar to that of the non-teachers.

Finally, let us consider the features that the listeners associated with non-native speech. They encompass a range of consonants, vowels, and prosodic phenomena, in particular realizations of /s/, rhotics and intonation patterns, in line with evidence that perceptions of non-nativeness arise from the interplay between segmental and suprasegmental characteristics (Ulbrich and Mennen 2016). Importantly, there were no systematic differences in the features associated with non-nativeness in the BIL-T and BIL-NT speakers. Moreover, the speech of all speakers who were identified as "clearly non-native" was characterized by multiple non-native features and at both segmental and suprasegmental levels. This suggests that listeners did not erroneously mistake them as non-native due to their unfamiliarity with individual features that are associated with native dialectal variation, such as the phenomenon of seseo/ceceo in the context of /s/ (Martínez-Celdrán et al. 2003). While the features identified must have been perceptually salient for the listeners, their relative importance to the impression of non-nativeness remains unclear. Moreover, the listeners' judgements may have been influenced by accentual patterns that they were not consciously aware of or that they were unable to verbalize. It is also difficult to ascribe the features to specific types of interaction with L2 English, for example, assimilation or dissimilation patterns (cf. SLM (Flege 1995; Flege and Bohn 2020)), due to a lack of detail in the comments provided. Future research exploring the salience of features in global accent ratings is needed that extends the work presented here, using a more sophisticated methodology, such as an interactive interview-based approach (Mayr et al. 2020) or one that allows listeners' judgements to be linked directly to specific items in the speech samples (Montgomery and Moore 2018).

#### **5. Conclusions**

The present study examined the role of L1 use and dual language activation in L1 attrition by investigating the perceived L1 accent of two groups of native Spanish speakers in the United Kingdom: (1) Spanish language teachers, who use their L1 regularly in professional settings that require frequent switching between Spanish and English, and (2) non-teachers, who virtually never use their L1 in the workplace. In addition, the study included a control group of monolingual speakers in Spain. As such, this study is the first to examine L1 attrition of speech systematically in a specific professional group. The results of a global accent rating experiment revealed significantly greater non-native ratings for the teachers than the non-teachers and the monolingual controls, but no difference between the latter two. Listeners' impressions of non-nativeness, in turn, were based on a range of segmental and suprasegmental features, notably /s/, rhotics and intonation. These results suggest that language teachers who teach their L1 in an L2-speaking environment may be particularly prone to L1 attrition. This is likely due to a need to co-activate both their languages in professional settings as well as regular exposure to non-native speech from L2 learners. In contrast, low L1 use was not associated with non-native features in the non-teachers' Spanish accents. Together, the findings hence suggest that cross-linguistic interaction is more likely to lead to L1 attrition of speech than reduced L1 use in and of itself. However, since not all teachers were perceived as non-native, future research based on a larger sample is needed that assesses the factors further that facilitate or hinder L1 attrition in such educational settings.

**Author Contributions:** Conceptualization, R.M., D.S. and I.M.; methodology, R.M. and D.S.; formal analysis, R.M. and D.S.; writing—original draft preparation, R.M. and D.S.; writing—review and editing, R.M., D.S. and I.M.; visualization, D.S. and R.M.; supervision, R.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


Bylund, Emanuel, Kenneth Hyltenstam, and Niclas Abrahamsson. 2013. Age of acquisition effects or effects of bilingualism in second language ultimate attainment'. In *Sensitive Periods, Language Aptitude, and Ultimate L2 Attainment*. Edited by Gisela Grañena and Mike Long. Amsterdam: John Benjamins, pp. 69–101.

Caldas, Stephen J. 2006. *Raising Bilingual Biliterate Children in Monolingual Cultures*. Clevedon: Multilingual Matters.


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **The E**ff**ect of Instructed Second Language Learning on the Acoustic Properties of First Language Speech**

**Olga Dmitrieva 1,\*, Allard Jongman <sup>2</sup> and Joan A. Sereno <sup>2</sup>**


Received: 1 September 2020; Accepted: 20 October 2020; Published: 26 October 2020

**Abstract:** This paper reports on a comprehensive phonetic study of American classroom learners of Russian, investigating the influence of the second language (L2) on the first language (L1). Russian and English productions of 20 learners were compared to 18 English monolingual controls focusing on the acoustics of word-initial and word-final voicing. The results demonstrate that learners' Russian was acoustically different from their English, with shorter voice onset times (VOTs) in [−voice] stops, longer prevoicing in [+voice] stops, more [−voice] stops with short lag VOTs and more [+voice] stops with prevoicing, indicating a degree of successful L2 pronunciation learning. Crucially, learners also demonstrated an L1 phonetic change compared to monolingual English speakers. Specifically, the VOT of learners' initial English voiceless stops was shortened, indicating assimilation with Russian, while the frequency of prevoicing in learners' English was decreased, indicating dissimilation with Russian. Word-final, the duration of preceding vowels, stop closures, frication, and voicing during consonantal constriction all demonstrated drift towards Russian norms of word-final voicing neutralization. The study confirms that L2-driven phonetic changes in L1 are possible even in L1-immersed classroom language learners, challenging the role of reduced L1 use and highlighting the plasticity of the L1 phonetic system.

**Keywords:** American English; Russian; voicing; classroom learning; second language acquisition; first language drift

#### **1. Introduction**

Cross-linguistic phonetic interaction in bilingualism and language learning is believed to be bidirectional: the earlier acquired, more established language (L1) can be affected by the later acquired, often non-dominant, language (L2). This type of crosslinguistic interaction is known by many names: back-transfer, reverse interference, phonetic drift, and language attrition, to name a few. We define this type of interaction as phonetic changes in speakers' L1 brought about by use of L2 and refer to these changes primarily as L2-to-L1 (phonetic) effects or L1 drift.

A few prominent lines of inquiry dominated the previous research on L2-to-L1 effects, leaving the full scope of this phenomenon under-explored. Specifically, the majority of previous work has focused on proficient bilinguals or advanced second language learners and most speakers were studied in the situation of L2 immersion (Baker and Trofimovich 2005; Barlow et al. 2013; Bergmann et al. 2016; Caramazza et al. 1973; Chang 2012; De Leeuw 2019; De Leeuw et al. 2018; De Leeuw et al. 2010; Flege 1987; Fowler et al. 2008; Guion 2003; Harada 2003; Hopp and Schmid 2013; Kartushina and Martin 2019; Lang and Davidson 2019; Lev-Ari and Peperkamp 2013; MacLeod and Stoel-Gammon 2005; Major 1992; Mayr et al. 2012; Mora and Nadeu 2012; Mora et al. 2015; Sancier and Fowler 1997; Simonet 2010; Tobin et al. 2017; Ulbrich and Ordin 2014). Moreover, language pairings often involved Western European languages, such as English, Spanish, French, German, and Dutch, which tend to be

relatively similar phonologically and share the Latin alphabet. Finally, these studies on cross-language interaction have typically focused on sound classes that have distinct phonetic realizations in the respective languages, such as oral stops, distinguished across languages via voice onset time (VOT), or oral vowels, distinguished via formant frequencies.

The current study expands the scope of previous work on L2-to-L1 effects by examining a population of relatively inexperienced learners of a rarely studied Slavic second language (Russian). The L2 learners in the present study reside in the home country and are immersed in their native language (American English). Moreover, in addition to inquiring how comparable phonological categories can affect one another's acoustic realization across languages, the present study also considers the transferability of phonological processes from L2 to L1. We target the acoustic realization of word-initial voiced and voiceless stops in native speech of American learners of Russian, to determine whether it has been affected by exposure to Russian. In addition, we investigate word-final stops, fricatives and affricates in learners' English to establish whether their productions show an effect of the Russian final devoicing rule.

In the following sections, we discuss the theoretical underpinnings of the L2-to-L1 phonetic effects (Section 1.1), provide a brief overview of the previous literature on the topic (Section 1.2), and introduce the details of the present study (Section 1.3).

#### *1.1. Mechanism of L2-to-L1 E*ff*ects*

Among the theoretical models put forth to account for the production of second language speech, the speech learning model (or SLM/SLM-r; Flege 1995, 2003; Flege and Bohn 2020) explicitly predicts bidirectional phonetic interactions and outlines their general mechanism.

SLM postulates that sound categories of learners' first and second languages coexist in the same phonological space, which a-priori creates a possibility for mutual influence. Moreover, SLM proposes that a mechanism of 'equivalence classification' affects the perception of L2 sounds that are acoustically non-identical but similar to existing L1 categories. As a result, corresponding L1 and L2 sounds are joined under the same category and their acoustic properties are predicted to affect each other, such that L2 sounds are realized in an L1-like manner, and L1 sounds are produced similarly to L2 ones—a situation known as category assimilation.

Flege's own work (e.g., Flege 1987) and much of the subsequent research, however, demonstrated that phonetically separate sound categories are nevertheless maintained across languages in the speech of bilinguals, with one or both deviating from the monolingual norm in the direction of assimilation to the other language (Baker and Trofimovich 2005; Caramazza et al. 1973; Chang 2012; Flege and Eefting 1987a, 1987b; Fowler et al. 2008; Harada 2003; Major 1992; Sancier and Fowler 1997; Sundara and Baum 2006). This cross-language separation suggests that bilinguals are able to discern the acoustic–phonetic differences between the cross-language equivalents even when they are merged under the same category. Moreover, this ability is an important condition of the L2-to-L1 effects. If bilinguals perceived L2 sounds as indistinguishable from the ones in their L1, there could not be any influence of L2 on the production of L1.

This brings us to the importance of sufficient experience with L2 required for L2-to-L1 effects to take place. The assumption that L2 experience plays an important role has dominated most of the literature on L2-to-L1 phonetic effects. A large proportion of previous cross-sectional studies reported that the L1 was affected by the L2 either exclusively or to a greater degree in participants with longer L2 exposure and/or higher L2 proficiency (Baker and Trofimovich 2005; Bergmann et al. 2016; Dmitrieva et al. 2010; Flege 1987; Guion 2003; Herd et al. 2015; Huffman and Schuhmann 2016; Lang and Davidson 2019; Major 1992; Peng 1993; Schmid 2013; Schuhmann and Huffman 2015; Tobin et al. 2017).

A reduction in L1 use may also be responsible for the observed L2-assimilatory changes in the L1 speech of bilinguals (De Leeuw et al. 2010; Kartushina and Martin 2019; Mora and Nadeu 2012; Mora et al. 2015; Sancier and Fowler 1997). Indeed, phonetic changes in L1 were typically detected in circumstances that simultaneously provided greater L2 exposure and limited the use of L1, i.e., in bilinguals who were immersed in the L2-dominant environment. Moreover, in some cases it was proposed that continuous L1 use prevented L1 drift despite intensive L2 exposure (De Leeuw et al. 2010; Tobin et al. 2017).

To summarize, current theoretical models predict the emergence of L2-to-L1 phonetic effects in experienced L2 speakers, with a possible added condition of reduced L1 use. In the following section, we review relevant studies which serve to refine these general predictions.

#### *1.2. Previous Research on L2-to-L1 Phonetic E*ff*ects*

Research consistently demonstrated that greater L2 experience and proficiency lead to a greater likelihood of L2-to-L1 phonetic effects for sequential bilinguals and adult language learners. For example, Flege (1987) demonstrated that the VOT of English [t] was significantly more French-like in the speech of Americans residing in Paris, compared to American students and teachers of French who were residing domestically. The French [t] of speakers of French residing in Chicago was also significantly different from that of French monolinguals, in the direction of assimilation to English, indicating the combined effect of proficiency, experience, and immersion. Later, Lang and Davidson (2019) showed that only Americans residing in Paris, but not American students on a short-term study abroad in France, experienced a drift in native vowel acoustics in the direction of L2 norms, confirming the important role of long-term immersion.

L2 pronunciation proficiency has also been linked more directly to changes in L1 phonetics. Major (1992) reported a positive correlation between L2 proficiency and L1 drift for American immigrants to Brazil: the closer they approximated Portuguese VOT norms in their production of L2 voiceless stops, the more they deviated from native norms in their L1 productions, in the direction of assimilation to L2 (although see Kartushina and Martin (2019) who report a negative L2 proficiency–L1 drift correlation).

The age of L2 acquisition also plays a role in promoting L2-to-L1 phonetic effects. In Guion (2003), only early and mid but not late Quechua–Spanish bilinguals revealed an effect of L2 (Spanish) on native vowel acoustics. Similarly, in Baker and Trofimovich (2005) an L2-to-L1 cross-language influence in vowels was uncovered only for early but not late Korean–English bilinguals, suggesting the importance of accumulated L2 experience.

Estimating L2 experience via length of residence in an L2-dominant environment, Dmitrieva et al. (2010) showed that Russian speakers of English with greater L2 experience were more likely to realize final obstruents in Russian in a more English-like manner, with less devoicing. Similarly, Bergmann et al. (2016) established that native speech of long-term German immigrants to Canada was perceived as more accented by their monolingual compatriots as a function of a longer residence abroad.

While the effects of L2 exposure and L2 proficiency are almost inevitably conflated in research on long-term immigrants, Chang (2012, 2013) was able to disentangle the two. Chang (2012) demonstrated L1 drift in several phonetic parameters, including VOT and vowel spectrum, in beginner American learners of Korean after only a short immersion in Korean during a study abroad program. Crucially, these participants achieved only elementary proficiency in Korean by the end of the six-week program, while L1 drift was observed already at week two. This work suggests that L2 proficiency by itself is not a necessary condition of L2-to-L1 phonetic effects, but L2 exposure which comes about due to L2 immersion may be.

Chang's work does share the element of L2 immersion, providing a level of L2 input that is both abundant and authentic, even if it is primarily overheard, with much of the previous literature. There is evidence that even overhearing-type exposure to another language may have important and long-lasting consequences. Au et al. (2002) and Knightly et al. (2003) showed that individuals who, early in life, were exposed to Spanish without learning it ('overhearers'), upon enrolling in an L2 Spanish class, demonstrated near-native like VOTs in Spanish, compared to learners in the same class who did not overhear Spanish earlier in life. Moreover, Chang (2019b) demonstrated that L1 drift persisted for L2 learners immersed in L2 even when they no longer actively used the second language,

suggesting that ambient language exposure in adulthood as well as in early childhood is an important factor affecting language production. Caramazza et al. (1973) also reported an interaction between English and French, exclusively in the direction of English affecting French, for residents of Canada who spoke only one of the two languages but were presumably exposed to both.

This work raises an important question about the minimum amount of L2 exposure required to trigger L2-to-L1 phonetic effects. Clearly, the intensive and authentic exposure provided by L2 immersion can be sufficient. However, what about non-immersion-type exposure to L2? Research on L2 learners in non-immersion situations is relatively scarce but it provides some indication that an even more fleeting introduction to another language may trigger phonetic changes in L1.

Traditional classroom learners of additional languages have been largely overlooked when it comes to L2-to-L1 effects. A number of studies examining non-immersion population were often conducted with small numbers of participants, thus arriving at somewhat inconclusive results. Huffman and Schuhmann (2016) examined four beginner American learners of Spanish and reported little evidence of L2-to-L1 phonetic effects. Between weeks 2 and 6 of language instruction, learners demonstrated no changes in the VOT of native voiced or voiceless stops. Only the frequency of prevoicing in English suggested a tendency to dissimilate away from Spanish: three participants decreased or eliminated prevoicing from their English voiced stops. Schuhmann and Huffman (2015) did show that after a period of explicit phonetic training, three out of five learners of Spanish shortened their English voiceless stops' VOT, indicating assimilation to Spanish. Herd et al. (2015), the only large-scale (N = 40) cross-sectional study of classroom learners known to us, demonstrated that near-native and advanced learners of Spanish produced English voiced stops with more negative VOTs than beginner learners. The near-native, advanced, and intermediate learners also produced more peripheral English vowels than beginner learners did—a difference also compatible with the effect of Spanish on English. This study indicates that L1 drift is possible in more experienced classroom L2 learners but, in the absence of the monolingual control group, it was not established whether beginner learners also modified the acoustics of their native speech in the direction of second language norms. Overall, the available studies on L2-to-L1 effects in classroom learners provide limited evidence that L2 exposure in the classroom may be sufficient to trigger L1 drift.

Another reason to study classroom learners is the fact that they continue to reside in the home country while acquiring their L2. Most foreign language courses in US colleges provide active instruction 3–5 h a week. For the remainder of the time, learners use their L1. The amount of reduction in L1 use and exposure is most likely negligible in these circumstances.

This aspect of classroom language learning is important because the reduction in L1 use associated with L2 immersion could play an important role in creating conditions for the L2-to-L1 phonetic effects. Conversely, continued L1 use has been suggested to promote and protect the 'authenticity' of L1 speech (Kartushina et al. 2016b). For example, Bergmann et al. (2016) demonstrated a negative correlation between the amount of L1 use and the degree of perceived non-native accent in the L1 speech of long-term German immigrants to North America. De Leeuw and colleagues also showed that the German of immigrants to Canada and The Netherlands was less likely to be perceived as non-native sounding if they had a high amount of contact with other Germans in a monolingual mode (De Leeuw et al. 2010). Moreover, Mora and colleagues (Mora and Nadeu 2012; Mora et al. 2015) reported that greater use of L1 Catalan promoted more monolingual-like Catalan vowels in Catalan–Spanish bilinguals. Although Tobin et al. (2017) did not detect any L1 drift in the native speech of Spanish learners of English after a 3–4 months period of L2 immersion in the United States, they explained this result by the lack of a sufficient reduction in L1 use.

The dominant, and thus more frequently used, language is also believed to be protected from the cross-linguistic influence. For example, Kartushina and Martin (2019) showed that, in balanced Catalan–Spanish bilinguals, both languages were affected by immersion in English but in Spanish-dominant bilinguals only Catalan vowels drifted towards English (see also Caramazza et al. (1973) and Mack (1989)).

To summarize, much previous research indicates that while advanced L2 proficiency is not a necessary condition for L2-to-L1 phonetic effects, greater L2 exposure and experience promote L1 drift. Immersion-type exposure to L2 is particularly conducive to L1 drift. Moreover, the reduction in L1 use, which typically co-occurs with L2 immersion and L2 dominance, is another possible condition for L2-to-L1 phonetic effects.

The population of classroom learners, which has not been widely studied with respect to L1 drift, provides an essential complement to previous work on immersed learners; a comparison that leads to a better understanding of the role of L2 immersion and reduced L1 use in bidirectional cross-language interaction. The following section describes the present study designed to address the question of L1 drift in classroom language learners.

#### *1.3. Present Study*

The present study aims to determine whether exposure to a second language via classroom learning can lead to phonetic changes in the native speech of the learners. The second language studied by our participants is Russian.

Russian is a relatively unusual choice for American learners and a comparatively difficult language to acquire for English speakers. In a ranking of languages encompassing four different difficulty categories, the US Foreign Service Institute placed Russian in category III, among 'hard' languages with significant linguistic and/or cultural differences from English (https://www.state.gov/foreignlanguage-training/), and specified that approximately 1100 class hours are required to reach general professional proficiency in speaking and reading (S3 and R3). This amounts to 14 semesters of study, assuming a fairly typical five hours per week over a 16-week semester study pattern. Thus, although participants for the present study were recruited from the second through to the sixth semesters of Russian study, it is reasonable to assume that most had not managed to reach advanced proficiency in this amount of time.

Unlike more frequently studied languages such as French, German, Italian, and Spanish, Russian does not share the same writing system with English. This makes L1 English–L2 Russian a qualitatively different and novel language pairing to consider. In particular, we ask whether, L1 drift is as likely in pairings of languages with fewer linguistic, orthographic, and cultural similarities as among more similar languages.

We consider the voice onset time of word-initial voiced and voiceless stops as the phonetic aspect potentially subject to L1 drift. In addition to this commonly studied parameter, we examine onset f0—pitch at the beginning of the post-consonantal vowel—as a secondary correlate of voicing. Secondary correlates have rarely been studied in L2 learners and we know little about their propensity to drift towards L2 in L1 speech.

Russian realizes its initial prevocalic [+voice] stops as robustly prevoiced (with negative VOT) and its initial prevocalic [−voice] stops as voiceless unaspirated (short lag VOT) (Ringen and Kulikov 2012). English realizes its initial prevocalic [+voice] stops as a combination of weakly prevoiced (about 30% for the population, Dmitrieva et al. 2015) and voiceless unaspirated stops (70%), and its initial prevocalic [−voice] stops as voiceless aspirated (long lag VOT) (Lisker and Abramson 1964). This phonetic difference between Russian and English stop voicing is usually not taught explicitly in Russian language courses, as was confirmed by Purdue University Russian language instructors.

The expected pattern of L1 drift, based on previous research, includes a well-documented tendency towards VOT shortening in voiceless English stops. It is also possible that the prevoicing period in English [+voice] stops could be lengthened under the influence of Russian. Finally, the proportion of prevoiced to voiceless unaspirated stops among English [+voice] segments could change towards a greater frequency of prevoicing, in assimilation with Russian.

With respect to onset f0, the two languages demonstrate a congruent covariation of f0 with phonological categories (lower f0 after [+voice] stops) but an incongruent covariation with phonetic VOT categories: first, f0 is lower after prevoiced stops than after voiceless unaspirated stops in Russian but there is no such difference in English because short lag and lead VOT stops are variants of the same phonological category (Kulikov 2012; Dmitrieva et al. 2015). Thus, exposure to Russian could lead to f0 lowering after prevoiced stops in participants' English speech. Second, English voiceless unaspirated stops are characterized by low onset f0, as they are phonologically voiced, while Russian voiceless unaspirated stops are characterized by high onset f0, as they are phonologically voiceless. Thus, an L2-to-L1 effect in this case would involve the relative raising of onset f0 after voiceless unaspirated stops in the English of Russian learners.

Finally, we investigate the temporal indices of voicing in word-final obstruents: preceding vowel duration, consonant constriction duration, and duration of voicing during constriction. This additional area of interest was selected because of important differences between English and Russian in the way phonological and phonetic voicing is treated in final obstruents. English, for the most part, maintains phonetic differences between phonologically voiced and voiceless final obstruents, although there is a gradient tendency to devoice in this position, especially for fricatives (Davidson 2016). Russian, on the other hand, features categorical devoicing in word-final position. We aim to investigate the possibility of L2-to-L1 influence on the basis of phonological rules which apply in the L2. We hypothesize that learners' L1 may adopt this phonological process from the L2 (Barlow et al. 2013; Simonet and Amengual 2020).

We further hypothesize that such influence may be especially likely for areas of L1 phonology that trend towards change, in particular if change is in the direction of the L2 process, in this case, devoicing (see Barlow et al. (2013) and Bullock and Gerfen (2004) for similar reasoning). Thus, English speakers exposed to Russian may be expected to demonstrate a stronger tendency to devoice in word-final position than is observed for monolingual English speakers.

To summarize, the present study examines L1-immersed classroom language learners in order to extend previous investigations of L2-to-L1 effects to populations not characterized by extensive L2 exposure and reduced L1 use due to L2 immersion. To establish the phonetic effects of their L2, Russian, on their L1, English, we examine the acoustic properties of word-initial stops (VOT and onset f0), and word-final obstruents (temporal indices of final voicing).

Following previous research, we conduct two types of comparisons: that between learners' L1 and L2, in order to determine whether the two systems are distinct or merged with respect to the select acoustic properties (a within-subject comparison) and those between learners' and monolinguals' L1s, in order to determine whether L2-to-L1 effects have taken place in learners' speech (a between-subject comparison). We believe that it is important to conduct both comparisons in order to demonstrate that a degree of phonetic learning has taken place in these speakers' L2, and that L1 drift, if present in their speech, is consistent with the nature of phonetic learning they achieved in their L2 speech. By establishing the degree of L2 phonetic learning for our participants, we further our understanding of the conditions under which L1 drift can be expected to occur. Moreover, the cooccurrence of L1 drift and L2 phonetic learning for the same features supports the notion that L2 phonetic learning is what triggers L1 drift.

To determine that L1 drift is a relatively stable feature of learners' native speech as opposed to the short-term effect of producing speech in the two languages in immediate succession, we analyzed the effect of the order of language elicitation.

We also examine the relationship between the extent of individual L1 drift and L2 proficiency in order to test the hypothesis that magnitude of drift in L1 is linked to the degree of pronunciation gains in L2.

Thus, the three main objectives of the present research are: (1) to determine whether phonetic learning has taken place in the Russian speech of learners; (2) to determine whether L1 drift has taken place in the English speech of learners; and (3) to determine whether the degree of phonetic learning/pronunciation gains were correlated with the degree of L1 drift.

#### **2. Materials and Methods**

#### *2.1. Participants*

Twenty native speakers of American English learning Russian as a second language participated in the study: eleven men and nine women, between the ages of 19 to 24 years (M = 20.6, SD = 1.3). They were recruited and recorded in two locations: Purdue University (14 participants) and the University of Kansas (6 participants). Participants filled out a language background questionnaire after the recording. All reported English as their first and native language. All participants reported learning Russian mainly through college classroom instructions and only four participated in a 2–4 months-long Russian study abroad program some time during the year preceding their enrollment in the study. On average, they studied Russian for 5 semesters by the time of participation (SD = 3, R = 1.5–12). The amount of class time varied by level, e.g., from five hours a week for semesters 1 through 4 of Russian, to three hours a week, starting from the 5th semester (Purdue campus).

Participants reported using Russian mostly in class or with classmates, on average for four hours per week (ranging from one to 6 h). Four participants reported using Russian with a family member but only up to one hour a week. The most commonly reported type of engagement with Russian was reading (M = 2 h/week, R = 1–6 h/week). Writing in Russian was the second most common activity (M = 2 h/week, R = 0.5–4 h/week). Only about half of the participants reported listening to Russian radio or watching Russian TV (M = 3 h/week, R = 1–6 h/week).

Participants' average self-reported Russian fluency was 'fair' ('3' on a 7-point scale), and the degree of accentedness in Russian was 'moderate' ('3' on a 7-point scale). All participants studied additional modern languages in classroom settings (the majority of participants studied only one additional language per person), most commonly Spanish, French and German (for 5 semesters on average, across these three languages). Achieved proficiency was 'fair' on average ('3' on a 7-point scale). Only three participants reported 'good' or 'very good' knowledge of an additional language (German and Spanish).

Eighteen native speakers of American English from the same dialectal area (Midwest) participated in the study as the control group: four men and fourteen women, between the ages of 18 and 57 (M = 25.8, SD = 9.8). These participants were recruited at Purdue University from the same undergraduate student population. They self-identified as native and monolingual speakers of Midwestern English without significant knowledge of other languages. Although all had some experience of learning a second language in instructional settings (most often Spanish or French), this experience was current or recent for only three participants.

None of the participants in either experimental or control group reported a hearing or speech impairment, and all were compensated for participation with course credit or cash. The study was approved by the Purdue University and University of Kansas Institutional Review Boards, protocols 1409015219 and 00003743, respectively.

#### *2.2. Elicitation Materials*

Elicitation materials consisted of English and Russian minimal and near-minimal monosyllabic pairs contrasting word-initial and word-final voicing.

The 44 English pairs consisted of 18 stop-initial (e.g., cap–gap), 18 stop-final (e.g., mop–mob), 6 fricative-final (e.g., safe–save), and 2 affricate-final (e.g., rich–ridge) pairs. There was a total of 75 experimental items (some words were used in the word-initial and the word-final condition). Bilabial, alveolar, and velar stops were represented in equal numbers and final fricatives were labiodental (2 pairs) and alveolar (4 pairs). Preceding and following context was largely limited to the vowels [æ], [α], and [Λ]. There was no significant difference in lexical frequency between voiced and voiceless members of the pairs (COCA Corpus, Davis (2008)). Forty-eight mono- and disyllabic distractor items were also included. A complete list of English target stimuli is provided in Appendix A, Tables A1 and A2.

The 42 Russian pairs consisted of 18 stop-initial (e.g., [kostj]–[gostj] 'bone'–'guest'), 18 stop-final (e.g., [xrjip]–[grjib o ] 'wheeze'–'mushroom'), and 6 fricative-final pairs (e.g., [rjis]–[prjiz o ] 'rice'–'prize'), for a total of 84 experimental items. Bilabial, dental, and velar stops were represented in equal numbers and final fricatives were labiodental (1 pair) or alveolar (4 pairs). Preceding and following vowels were mid-low [e], [a], and [o] in about two-thirds of cases, the rest contained high vowels [i], [u], or [. ł]. There were no significant differences in lexical frequency between voiced and voiceless stimuli (Russian National Corpus 2003). Forty-five mono- and disyllabic distractor items were also included. A complete list of target stimuli is provided in Appendix A, Tables A3 and A4.

#### *2.3. Procedure*

Participants recorded at Purdue University were seated in front of the computer screen in a double-walled sound-attenuated booth. E-prime 2.0 (Psychology Software Tools, Pittsburgh, PA) was used to display the words for elicitation. The words appeared on the screen one by one, in a random order. Each word stayed on the screen for 2 s and was followed by 0.5 s of blank screen. Participants were instructed to pronounce each word the way they speak normally. The whole list was presented three times to each participant with short breaks offered between the blocks. The recording was performed using an Audio-Technica AE4100 cardioid microphone and a TubeMP preamp connected directly to a PC.

For participants recorded at the University of Kansas, a similar procedure was used. PowerPoint software was used to present the prompts on the screen, in a random order for each participant, with each word displayed on the screen for 1.5 s, followed by 1.5 s of blank screen. Recordings were performed in an anechoic chamber, using an Electro-Voice N/D 767a microphone and Marantz PMD671 digital recorder.

This computer-controlled stimulus presentation elicits an appropriately consistent rate of speech across and within participants. The order of Russian and English conditions was counterbalanced across participants, with a brief break between conditions. Due to technical issues, only one repetition of each item was recorded for one experimental participant, and only English data were collected from another experimental participant.

#### *2.4. Measurements*

For initial stops, voice onset time (VOT) and onset f0 were measured. For final obstruents, preceding vowel duration, duration of consonantal constriction, and duration of voicing during constriction were measured. Segmentation was performed manually based on Praat (Boersma and Weenink 2018) waveform, and spectrogram representations and using standard segmentation criteria. Measurements were collected using custom-written Praat scripts.

VOT was measured from the onset of consonantal release until the onset of voicing. Onset f0 was measured at the vowel onset as soon as the Praat autocorrelation algorithm detected periodicity. Obtained f0 values were examined for algorithm errors and corrected manually if necessary. Normalization was performed by converting f0 values to semitones relative to each participant's individual mean onset f0, using the formula 12ln(x/individual mean onset f0)/ln2, based on the semitone normalization procedure in Boersma and Weenink (2018). After normalization, outliers more than two standard deviations away from the normalized grand mean onset f0 were removed from further analysis (97% of onset f0 measurements were retained). The resulting values represented the deviation of each onset f0 value, on the logarithmic scale, from each participant's mean, now represented as 0.

Duration of the preceding vowel, duration of the closure for stops/affricates, frication portion for fricatives/affricates, and duration of voicing during constriction were measured for final obstruents.

#### **3. Results**

All the reported Linear Mixed Models (LMM) were implemented in SPSS 26.0 with the same random effects structure: a random intercept for subject and for item. Significance of the fixed factors and interactions was assessed via ANOVA tests. All pairwise comparisons were performed with Sidak correction. To avoid averaging across positive and negative VOT values, separate statistical models were fit to stops with prevoicing and stops with positive VOT.

We report results for initial stops first, followed by results for final obstruents. Within each of those sections, we begin by reporting the comparison between learners' Russian and English speech, to determine whether the two languages were produced by learners in a phonetically distinct way and to establish the degree of phonetic learning in their L2. We then proceed to report the comparison between learners' English speech and English speech of monolingual controls to test for L1 drift in learners' speech. We finish by reporting the correlations between the degree of phonetic learning in each learner's L2 and the magnitude of L1 drift in his or her English speech.

#### *3.1. Initial Stops*

#### 3.1.1. Learners' Russian vs. English

The goal of this analysis was to establish whether learners' Russian productions were acoustically distinct from their own English speech.

Positive VOT: Positive VOT of initial stops was analyzed using an LMM with Language (Russian vs. English), Voicing, Place of Articulation (included to account for systematic variability in VOT duration as function of place of articulation) and the two-way Language by Voicing interaction as fixed factors. The results demonstrated a significant effect of Language, F(1, 66.85) = 12.45, *p* = 0.001, Voicing, F(1, 66.65) = 690.39, *p* < 0.001, Place of Articulation, F(2, 66.46) = 34.13, *p* < 0.001, and a significant Language by Voicing interaction, F(1, 66.58) = 24.70, *p* < 0.001.

The effects of Voicing and Place of Articulation were due to a longer positive VOT for voiceless than for voiced stops and an increase in VOT in the following order: labial < coronal < dorsal, where every pairwise comparison was statistically significant. The effect of Language was due to longer VOTs in English (M = 45 ms, SD = 30 ms) than in Russia (M = 43 ms, SD = 28 ms). This effect was mostly driven by differences between the voiceless stops, while voiced stops were produced with more comparable VOTs across languages (see Figure 1). The significant Language by Voicing interaction confirms the magnitude of this asymmetry. The shortened VOT of initial voiceless stops produced in Russian indicated that learners were in the process of acquiring the phonetics of the Russian voiceless category, by targeting shorter lag productions. However, with a mean of 57 ms their Russian realizations were only 13 ms away from their English long lags (M = 70 ms) and still far from being true Russian-like voiceless stops.

Given that Russian voiceless productions were, on average, shorter in VOT than English ones, another important question is how many instances of learners' Russian stops could be categorized as 'short lag'. Using a relatively generous cut-off of 40 ms (to accommodate for the lower rate of speech in isolated word production and the longer VOT of velar stops) to demarcate the boundary between short lag and long lag voiceless stops (Lisker and Abramson 1964), we calculated the proportions of such realizations in participants' Russian and English speech. Figure 2 shows the distribution. While only 5% of short lags were detected among English voiceless stops, in Russian the proportion rose to 28%, indicating that appreciable VOT shortening affected almost a third of Russian productions. This asymmetry was significant in a chi-square test, χ2(1, N = 2155) =200.61, *p* < 0.001.

**Figure 1.** Mean positive voice onset time (VOT) of voiced and voiceless initial stops in learners' English and Russian.

**Figure 2.** Percentage of short lag productions among [−voice] stops in learners' English and Russian.

Negative VOT: Negative VOT of initial [+voice] stops was analyzed using an LMM with Language and Place of Articulation as fixed factors to establish whether the duration of prevoicing differed between learners' Russian and English speech. The results showed a significant effect of Language, F(1, 101.85) = 21.10, *p* < 0.001. Russian [+voice] stops were characterized by a longer prevoicing period (M = 100 ms, SD = 37 ms) than English initial [+voice] stops (M = 78 ms, SD = 27 ms). Thus, although both English and Russian license prevoiced stops as representatives of the [+voice] category, they were phonetically distinct in the realizations of these American learners of Russian.

The frequency of prevoicing is another cross-linguistically distinguishing aspect, since all Russian [+voice] stops are supposed to be produced with prevoicing. To determine the extent to which learners reached this objective, we calculated the proportion of prevoiced realizations among [+voice] Russian and English stops, shown in Figure 3. While only 6% of English voiced stops were realized with prevoicing, 33% of Russian realizations were prevoiced, suggesting that learners were producing prevoiced realizations of Russian [+voice] stops, although well below the rate of native speakers.

**Figure 3.** Percentage of prevoiced productions among [+voice] stops in learners' English and Russian.

Onset f0: Figure 4 demonstrates the distribution of normalized onset f0 values in the English and Russian speech of the same participants. A few differences between the languages are apparent. First, prevoiced stops form a more substantial category in Russian than in English, with the distribution visibly shifted towards lower f0 values, compared to the English prevoiced distribution. Second, the Russian [−voice] distribution is less compact than the English one in terms of VOT range, encompassing a span of shorter values (up to 0 ms VOT) and, as a result, overlapping with [+voice] stops produced at short lags. The two distributions (Russian [+voice] short lags and Russian [−voice]) nevertheless maintain a separation in terms of f0 values, with visibly lower f0 of [+voice] short lags.

**Figure 4.** Scatterplots of onset pitch at the beginning of the post-consonantal vowel (f0) and VOT values for learners' English and Russian initial stops.

To examine the alignment of onset f0 values with both the phonological voicing and VOT categories in the two languages of learners, their normalized f0 values were analyzed in an LMM with Language, Voicing, and Language by Voicing interaction as fixed factors. In this analysis, Voicing was a hybrid category with three levels ([+voice] stops were split into those with prevoicing and those without): [+voice] prevoiced, [+voice] short lag, and [−voice]. This was motivated by the fact that the two types of [+voice] stops have very distinct VOT implementations and may be expected to behave differently in Russian where prevoiced stops form a separate phonological category to the exclusion of short lag stops.

The results demonstrated a significant effect of Language, F(1, 104.85)=25.52, *p*<0.001, and Voicing, F(2, 192.07) = 104.79, *p* < 0.001, and a significant Language by Voicing interaction, F(2, 190.10) = 4.48, *p* = 0.013. Onset f0 was significantly lower in Russian than in English. The effect of Voicing was driven by significantly higher onset f0 after voiceless stops compared to either prevoiced or short lag [+voice] stops, without a significant difference between the latter. The interaction between Language and Voicing was triggered by the divergent behavior of f0 after prevoiced stops. As shown in Figure 5, Russian prevoiced stops lowered f0 even more than English prevoiced stops.

**Figure 5.** Mean onset f0 across the three categories of initial stops in learners' English and Russian.

To investigate this tendency further, we compared prevoiced and [+voice] short lag stops in English and in Russian separately, in LMM analyses with a single fixed effect: VOT category. The difference was significant only in Russian, F(1, 1068.69) = 25.63, *p* < 0.001, where prevoiced stops triggered lower onset f0 than short lag stops, being members of the same phonological category. This result suggests that learners of Russian were developing an awareness of prevoicing as a separate phonological category in Russian and attempting to single it out with a distinct f0 pattern.

We were further interested in onset f0 of [−voice] short lags. The question of interest here is whether, when producing Russian short lags, learners transfer all the co-varying properties of English initial short lags, including low f0.

We compared f0 values of Russian [+voice] short lags, Russian [−voice] short lags (<40 ms), and Russian [−voice] long lags (>40 ms) in an LMM model with a single fixed factor with these three levels. The effect was significant, F(2, 70.33) = 34.83, *p* < 0.001, and the results of pairwise comparisons demonstrated that f0 was significantly higher after both voiceless categories than after the voiced one, without a significant difference between the two voiceless categories.

Interim summary: The results demonstrated that learners were attempting to approximate Russian phonetic norms by producing (a) shorter VOTs in Russian [−voice] stops, (b) longer prevoicing in Russian [+voice] stops, (c) more instances of [−voice] stops with short lag VOT in Russian than in English and (d) more instances of [+voice] stops with prevoicing in Russian than in English. The acoustics of learners' Russian stops were significantly different from their English stops. However, they were clearly not reaching native-like phonetic norms (all short lag [−voice] stops and all prevoiced [+voice] stops).

Onset f0 findings indicate that learners were able to manipulate the two correlates of voicing—VOT and onset f0—separately from each other. In particular, they did not transfer the low onset f0 associated with initial short lag stops in English to Russian when producing Russian short lags. Instead, they assigned onset f0 values in accordance with the phonological membership of the intended stop, equally successfully in Russian and in English. One result that deserves special notice is the significantly lower onset f0 assigned to Russian prevoiced [+voice] stops compared to Russian [+voice] short lags. These two sets of realizations were not distinguished via onset f0 in native English; thus, the difference is specific to the Russian productions of learners.

#### 3.1.2. Learners' English vs. Monolingual Controls' English

The goal of this analysis is to determine whether learners' productions of initial English stops were affected by exposure to Russian. This effect would be revealed if significant differences were demonstrated in the acoustic realization (in terms of VOT and onset f0) of initial English stops by the two speaker groups (learners' English vs. monolingual English).

Positive VOT: Positive VOT of initial stops was analyzed using an LMM with Group (Learners vs. Monolinguals), Voicing, Place of Articulation, and Group by Voicing interaction as fixed factors. The results demonstrated a significant effect of Group, F(1, 35.87) = 6.73, *p* = 0.014, Voicing, F(1, 32.64) = 2364.47, *p* < 0.001, PA, F(2, 32.43) = 33.75, *p* < 0.001, and a significant Group by Voicing interaction, F(1, 3828.06) = 182.54, *p* < 0.001. The effects of Voicing and Place of Articulation demonstrated a longer VOT for voiceless stops and an increase in VOT in the following order: labial < coronal < dorsal, where every pairwise comparison was statistically significant.

The effect of Group demonstrated a longer overall VOT produced by monolingual participants (M = 59 ms, SD = 35 ms) than by learners (M = 45 ms, SD = 30 ms) (this effect was driven primarily by voiceless stops). The significant interaction between Group and Voicing was due to the fact that voiced stops were produced with comparable VOT values across the two groups, while voiceless stops had longer VOT for monolinguals than for learners. Moreover, as Figure 6 shows, learners' mean voiceless VOT in English was situated between that of monolinguals and their own Russian productions. The shortened voiceless VOT of learners' English is compatible with an influence of Russian, where the voiceless category is realized via short lag VOT.

To assess the possible role of elicitation order, an LMM was conducted on English data from learners only, with Order of language elicitation (Russian first or English first), Voicing, and Voicing by Order as fixed effects (all subsequent analyses of Order were conducted with the same model structure). The results confirmed the effect of Voicing, F(1, 34.075) = 658.189, *p* < 0.001, but showed no main effect of Order. The Voicing by Order interaction was significant, F(1, 2027.94) = 15.89, *p* < 0.001, due to the fact that in the Russian-first condition, learners' [+voice] English stops were pronounced with longer VOT than in the English-first condition. This result agrees with the observation that learners produced relatively long VOTs for short lag [+voice] stops in Russian (see Figure 6), thus their Russian pronunciation tendencies for these types of stops spilled over into English when it was spoken next. The results therefore revealed no evidence that drift towards shorter VOTs in learners' English was triggered by speaking Russian immediately prior to speaking English.

We were also interested in assessing how many of the English voiceless stops produced by learners could be categorized as 'short lag' as a result of this drift. We used a cut-off value of 40 ms, which categorized 99% of the voiceless stops produced by control speakers as long lags. Interestingly, as shown in Figure 7, the proportion of short lags was slightly higher for learners than for monolinguals. This asymmetry was significant in a chi-square test, χ2(1, N = 2028) = 30.24, *p* < 0.001. Thus, about 5% of stops produced by learners were on the margins of the long lag category, moving into the short lag territory.

**Figure 6.** Mean positive VOT of voiced and voiceless initial stops in the English of monolingual controls and learners; learners' Russian is provided for comparison.

**Figure 7.** Percentage of short lag productions among [−voice] stops in monolingual controls' and learners' English; learners' Russian is provided for comparison.

Negative VOT: Negative VOT of initial stops was analyzed using an LMM with Group and Place of Articulation as fixed factors to determine whether the duration of prevoicing in learners of Russian differed from that of monolingual controls. Neither factor was a significant predictor of prevoicing duration.

VOT categories in voiced stops: Figure 8 shows that monolingual controls produced prevoicing with almost equal frequency as learners did in their Russian productions (about 30%). In comparison, learners' English was almost devoid of prevoicing, with only 6% prevoiced stops. This asymmetry between controls and learners' English was significant in a chi-square test: χ2(1, N = 2230) = 158.38, *p* < 0.001.

**Figure 8.** Percentage of prevoiced productions among [+voice] stops in monolingual controls' and learners' English; learners' Russian is provided for comparison.

Moreover, when learners' English productions were split by order of language elicitation, the Russian-first condition resulted in only 3% prevoicing in English, compared to 9% in the English-first condition. These results point towards a possibility of divergence from Russian in learners' English speech; a divergence which can be amplified if Russian is elicited first. Exposure to Russian, where prevoicing marks a separate phonological category, led learners to decrease the incidence of allophonic prevoicing in their English speech.

Onset f0: Figure 9 demonstrates the distribution of onset f0 values and VOT values in English speech of learners and monolingual controls. The two groups present relatively comparable pictures, with the exception of a more substantial prevoiced distribution in controls than in learners and a greater separation between positive VOT categories in monolinguals than in learners. As discussed above, these tendencies are likely due to learners' drift towards shorter VOT in their voiceless productions due to convergence with Russian, and a decrease in the incidence of prevoicing as an expression of divergence from Russian. Learners also demonstrate greater variability in onset f0 values, especially in [−voice] long lag stops.

Onset f0 values in English productions of learners and monolingual controls were compared in an LMM analysis with Group and Voicing (three categories: [+voice] prevoiced, [+voice] short lag, and [−voice]) and Group by Voicing interaction as fixed factors. The results showed a significant effect of Voicing, F(2, 108.63) = 108.10, *p* < 0.001, due to a significant difference between [−voice] stops and the two [+voice] categories, with no significant difference between the latter two. No other effects or interactions reached significance. This result suggests that experience with Russian did not significantly affect the way learners realized onset f0 in their English speech.

Interim summary: The results revealed significant differences between learners and monolinguals which can be attributed to convergence and divergence with Russian in learners' English. These differences affected only VOT, while English onset f0 was not affected by exposure to Russian.

**Figure 9.** Scatterplots of onset f0 and VOT values in English productions of learners and monolingual speakers.

Specifically, VOT of learners' voiceless stops was shortened, moving towards their own Russian productions. Learners' voiced stops, in contrast, tended towards divergence: not in terms of VOT values (VOT of [+voice] stops, including duration of prevoicing, was not affected) but in terms of prevoicing frequency. Learners produced significantly fewer prevoiced stops in English than monolingual controls. Combined with the fact that learners' Russian prevoiced stops were marked by extra-low onset f0, these findings suggest that learners were targeting prevoiced stops as a distinct non-native category.

There was some evidence for the role of elicitation order: A decrease in the number of prevoiced stops in English was amplified if Russian elicitation occurred immediately prior.

#### 3.1.3. Individual Variability in Drift

The initial stop measures, which showed evidence of L1 drift, included positive VOT of voiceless stops and the frequency of prevoicing in voiced stops. Therefore, we focused on these parameters in evaluating individual drift and its covariation with subjective and objective measures of Russian pronunciation proficiency.

To estimate the magnitude of L1 drift in each participant, we subtracted each learner's mean English voiceless VOT from the grand mean voiceless VOT of all monolingual participants. The resulting value represented how much each learner deviated in their voiceless VOT from the average monolingual norm (greater values represent greater deviation from English monolingual long lag norms and greater approximation of Russian short lag norms).

These values were checked for correlation with each learner's average Russian voiceless VOT. The prediction tested is that learners who were more successful in shortening their Russian voiceless VOT are also expected show a greater amount of drift in their English voiceless VOTs.

The results of the two-tailed Pearson correlation analysis showed a significant negative correlation between the individual drift and individual Russian voiceless VOT (r = −0.613, *p* = 0.005). Figure 10 shows that participants with the shortest Russian voiceless VOTs demonstrated the greatest L1 drift in the direction of Russian short lag norms.

We also checked the magnitude of drift parameter for correlations with self-estimated Russian speaking proficiency and self-estimated accentedness (subjective measures), where proficiency and accentedness scores on a seven-point scale were treated as continuous parameters, but neither revealed significant co-dependencies. A similar analysis was conducted for individual frequency of prevoicing across learners' two languages, with no significant results.

Interim summary: Correlation analyses of initial stop data indicated that for VOT of voiceless stops, the magnitude of individual L1 drift was significantly correlated with pronunciation proficiency in Russian, when the latter was measured objectively via acoustic analysis.

**Figure 10.** Correlation between individual L1 drift in English voiceless VOT (*y*-axis; larger values indicate greater drift towards Russian-like short lag VOT) and individual Russian voiceless VOT (*x*-axis).

#### *3.2. Final Stops*

#### 3.2.1. Learners' Russian vs. English

Similar to the analysis of initial stops, the examination of the acoustic correlates of final voicing assessed whether and to what extent the learners approximated the goal of neutralizing the voicing distinction in word-final position in Russian, while maintaining the contrast in English.

Vowel duration: Duration of the vowel preceding final obstruents was analyzed using an LMM with Language, Voicing, and Language by Voicing as fixed factors. The results demonstrated a significant effect of Voicing, F(1, 90.19) = 10.01, *p* = 0.002, and a significant Language by Voicing interaction, F(1, 90.18) = 13.20, *p* < 0.001. Vowels were significantly longer before voiced (M = 226 ms, SD = 67 ms) than before voiceless obstruents (M = 159 ms, SD = 52 ms). The difference was greater in learners' English than in their Russian (see Figure 11), explaining the Language by Voicing interaction, and indicating that learners were approximating a reduction in the voicing distinction in Russian, by shortening vowels in the voiced context. While this modification is consistent with partial devoicing of [+voice] final obstruents in Russian, complete neutralization was clearly not achieved.

Constriction duration: Stop closure duration and fricative frication duration were analyzed in two separate LMMs with Language, Voicing, and Language by Voicing as fixed factors. Affricates were absent from this analysis because none were included among the Russian stimuli.

The results for closure duration demonstrated a significant effect of Voicing, F(1, 69.67) = 411.59, *p* < 0.001, Language, F(1, 75.94) = 61.85, *p* < 0.001, and a significant Language by Voicing interaction, F(1, 69.67) = 6.05, *p* = 0.016. Stop closure was significantly longer in voiceless (M = 119 ms, SD = 41 ms) than in voiced stops (M = 76 ms, SD = 28 ms). Closures were also significantly longer in learners' Russian (M = 104 ms, SD = 40 ms) than in their English (M = 94 ms, SD = 41 ms). Finally, as Figure 12 demonstrates, the difference between voiced and voiceless closures was smaller in Russian than in

English, at the expense of longer [+voice] closures in Russian, explaining the significant Language by Voicing interaction. Again, longer [+voice] Russian closures are compatible with partial devoicing of those stops in participants' Russian speech.

**Figure 11.** Mean vowel duration before voiced and voiceless final obstruents in learners' English and Russian.

**Figure 12.** Mean voiced and voiceless final stop closure duration in learners' English and Russian.

Results for frication duration showed no significant effects beyond that of Voicing, F(1, 18.99) = 35.36, *p* < 0.001: voiceless fricatives were significantly longer (M = 273 ms, SD = 60 ms) than voiced ones (M = 210 ms, SD = 58 ms).

Voicing duration: The duration of the closure or frication portion of the final obstruent characterized by laryngeal voicing (glottal pulsing), in ms, was submitted to an LMM analysis with Language, Voicing, and Language by Voicing as fixed factors. The results demonstrated a significant effect of Language, F(1, 95.53) = 7.28, *p* = 0.008, and Voicing, F(1, 99.99) = 678.34, *p* < 0.001, and a significant Language by Voicing interaction, F(1, 99.99) = 6.47, *p* = 0.013. English final obstruents contained more voicing (M = 48 ms, SD = 59 ms) than Russian final obstruents (M = 36 ms, SD = 48 ms). Voiced final obstruents had more voicing (M = 83 ms, SD = 55 ms) than voiceless ones (M = 6 ms, SD = 16 ms). The interaction was due to the fact that the difference between voiced and voiceless obstruents was greater in participants' English than in their Russian speech, as shown in Figure 13. The voicing contrast between final obstruents in Russian was reduced via partial devoicing of the voiced obstruents.

**Figure 13.** Mean duration of laryngeal voicing during closure or frication portion of the final voiced and voiceless obstruents in learners' English and Russian.

Interim summary: The results for final obstruents overall demonstrated that learners attempted to implement voicing neutralization in their Russian productions. The preceding vowel duration, stop closure duration, and voicing during constriction demonstrated subphonemic but statistically significant tendencies towards partial devoicing of final [+voice] obstruents, although complete neutralization was not achieved. Thus, similarly to initial obstruents, we observed that participants aimed to implement appropriate phonetic differences between English and Russian final obstruents.

#### 3.2.2. Learners' English vs. Monolingual Controls' English

Similar to initial stops, the goal of the analysis was to determine whether learners' English productions were affected by exposure to Russian. If significant differences are detected between experimental and control groups in the acoustic implementation of final obstruents, these could suggest a 'drift' towards Russian norms.

Figure 14 shows the distribution of voiced and voiceless final obstruents, based on the constriction duration (closure duration for stops and frication duration for fricatives) and preceding vowel duration, in the English speech of monolingual controls and learners of Russian. Although this display does not contain all datapoints (affricates were excluded due to extra-long constrictions and a relatively small number of tokens), it provides a representative picture of the differences between monolingual speakers and learners. First, the learners' distribution demonstrates a greater amount of variability in terms of constriction duration. However, most importantly, learners' data demonstrate visibly less separation between the two voicing categories in both dimensions. This suggests that learners' English

is drifting towards a reduction in the final voicing contrast, at least in these two acoustic dimensions. The direction of the drift is consistent with an influence of Russian final voicing neutralization. To determine whether these tendencies were statistically significant, each dimension of contrast was subjected to statistical analysis.

**Figure 14.** Scatterplot of constriction duration and preceding vowel duration for final stops and fricatives in English speech of monolingual controls and learners.

Vowel duration: Vowel duration was analyzed in an LMM with Group (Learners vs. Controls), Voicing, and Group by Voicing as fixed factors. The results demonstrated a significant effect of Group, F(1, 36.01) = 4.24, *p* = 0.047, Voicing, F(1, 802.35) = 8.51, *p* = 0.004, and a significant Group by Voicing interaction, F(1, 5961.38) = 140.31, *p* < 0.001. Vowels were significantly longer before voiced (M = 250 ms, SD = 68 ms) than before voiceless obstruents (M = 166 ms, SD = 51 ms). Monolingual speakers produced longer vowels (M = 219 ms, SD = 75 ms) than learners (M = 199 ms, SD = 70 ms). The interaction revealed that the vowel duration difference as a function of consonant voicing was greater for monolingual speakers than for learners. This suggests that learners experienced drift towards Russian norms evidenced by partial devoicing of [+voice] obstruents in terms of preceding vowel duration (see Figure 15).

To address the role of elicitation order in the emergence of L1 drift, we conducted an LMM analysis of vowel duration within the group of learners with fixed factors of Voicing, Order (Russian-first or English-first), and Voicing by Order. The results confirmed a significant effect of Voicing, F(1, 641.98) = 16.09, *p* < 0.001, but showed no other significant effects.

Constriction duration: Duration of closure (for stop consonants and affricates) and duration of frication (for fricatives and affricates) were analyzed in two separate LMM analyses with Group (Monolinguals vs. Learners), Voicing, and Group by Voicing as fixed factors.

Analysis of closure duration demonstrated a significant effect of Voicing, F(1, 40.98) = 28.08, *p* < 0.001, Group, F(1, 35.82) = 9.26, *p* = 0.004, and a significant Group by Voicing interaction, F(1, 4457.49) = 82.83, *p* < 0.001. Voiceless stops and affricates had significantly longer closures (M = 125 ms, SD = 56 ms) than voiced ones (M = 80 ms, SD = 44 ms), but this difference was considerably more pronounced in monolinguals' than in learners' English, explaining the interaction. As Figure 16 shows, the average difference in closure duration between voiced and voiceless categories was smaller in learners' than in monolinguals' English, and was more comparable to the amount of contrast realized in learners' Russian speech. The reduction in contrast in learners' speech occurred by shortening voiceless closures.

**Figure 15.** Mean vowel duration before voiced and voiceless obstruents in monolingual controls' and learners' English; learners' Russian productions are provided for comparison.

**Figure 16.** Mean voiced and voiceless closure for final stops and affricates in monolingual controls' and learners' English productions; learners' Russian productions are provided for comparison.

The Order analysis for closure duration revealed no effects beyond the expected effect of Voicing, F(1, 38.01) = 68.86, *p* < 0.001.

Analysis of frication duration demonstrated a significant effect of Voicing, F(1, 65.38) = 9.80, *p* = 0.003, and a significant Group by Voicing interaction, F(1, 1756.68) = 32.18, *p* < 0.001. Voiceless fricatives and affricates presented significantly longer frication duration (M = 257 ms, SD = 62 ms) than voiced ones (M = 198 ms, SD = 54 ms) and the difference was more pronounced in monolinguals' than in learners' English (Figure 17).

**Figure 17.** Mean voiced and voiceless frication for final fricatives and affricates in monolingual controls' and learners' English productions.

The Order analysis showed no significant effects of Voicing, Order, or Voicing by Order. This result indicates that the order of language presentation did not affect the magnitude of drift in learners' L1 for fricative duration. The absence of significant Voicing effect also suggests that voicing-dependent differences in final frication duration could be completely neutralized in learners' English speech.

Voicing duration: The duration of glottal pulsing during the frication or closure portion of stops, affricates, and fricatives was analyzed in an LMM with Group, Voicing, and Group by Voicing as fixed factors. The results showed an effect of Voicing, F(1, 57.38) = 321.56, *p* < 0.001, and a Group by Voicing interaction, F(1, 5970.59) = 19.89, *p* < 0.001. Significantly more laryngeal voicing was detected during the constriction portion of phonologically voiced (M = 90 ms, SD = 57 ms) than voiceless (M = 7, SD = 17) final obstruents. This difference was more pronounced in the speech of monolingual participants than learners, explaining the interaction. As Figure 18 demonstrates, less laryngeal voicing was found in the learners' final [+voice] obstruents than in those of monolinguals, although the reduction was not quite as great as in learners' Russian speech.

The Order analysis for voicing duration confirmed the effect of Voicing, F(1, 57.38) = 321.56, *p* < 0.001, and revealed a Voicing by Order interaction, F(1, 5970.59) = 19.89, *p* < 0.001, which, unexpectedly, was due to greater neutralization of the final voicing contrast in the English-first condition.

Interim summary: The results demonstrated that learners' English has drifted towards Russian norms of final voicing neutralization in terms of preceding vowel duration, closure duration, frication duration, and voicing during constriction of the final obstruents. In most cases, the magnitude of contrast was reduced compared to monolingual controls but the contrast itself was nevertheless maintained. Only in the case of frication duration did the contrast between voiced and voiceless consonants approach complete neutralization in the English speech of learners.

The effects of order of language elicitation were few and inconsistent. In the case of preceding vowel duration, eliciting Russian first has increased the drift effect in English, while in the case of voicing during constriction, eliciting English first led to a greater drift effect in English.

**Figure 18.** Mean duration of voicing during constriction of final obstruents in monolingual controls' and learners' English; learners' Russian is provided for comparison.

#### 3.2.3. Individual Variability and Drift

The final obstruent measures which showed evidence of drift included preceding vowel duration, closure duration, frication duration, and duration of voicing during closure/frication. These parameters were evaluated for correlations between degree of individual drift and Russian pronunciation proficiency.

Russian pronunciation proficiency was evaluated objectively, as an individual acoustic difference between average voiced and average voiceless productions, and subjectively, as a self-reported speaking proficiency score and accentedness score.

The individual degree of drift was estimated by subtracting, for each participant, the average voiced-voiceless difference from the grand average monolingual voiced-voiceless difference (thus obtaining a 'difference of differences'). Similar calculations were conducted for vowel duration, closure duration, frication duration, and duration of voicing during closure/frication (in all calculations of the voiced-voiceless difference, a smaller value was always subtracted from a larger one, e.g., vowel duration before voiceless obstruents was subtracted from vowel duration before voiced obstruents but voiced closure was subtracted from the voiceless closure).

Calculated in this manner, the individual values for drift were larger if a participant produced a smaller amount of durational difference in a given parameter between the voiced and voiceless obstruent, i.e., if they drifted more in the direction of Russian final voicing neutralization, as compared to an average monolingual difference.

Two-tailed Pearson correlational analyses revealed a significant negative relationship between individual drift in vowel duration (r = −0.539, *p* = 0.017), closure duration (r = −0.666, *p* = 0.002), and voicing duration (r = −0.463, *p* = 0.046) and the voiced-voiceless difference in these parameters in their Russian speech. In other words, those participants who drifted the most towards Russian neutralization norms in their English were also the ones who neutralized the same parameter more successfully in their Russian speech. Figure 19 illustrates the relationship for closure duration. Interestingly, in all three analyses, it was always a subset of the four learners (# 6, 21, 22, and 23) who had lived in Russia who consistently demonstrated the strongest drift, the greatest pronunciation gains, or both (e.g., Figure 19).

The amount of drift in all of these parameters, except voicing duration, also correlated significantly with self-estimated Russian speaking proficiency (r = 0.526, *p* = 0.017 for vowel duration and r = 0.514, *p* = 0.020 for closure duration), such that participants with high self-estimated Russian proficiency were the ones who drifted towards Russian the most in their English. No measure of drift correlated with the self-estimated degree of accentedness in Russian.

**Figure 19.** Correlation between average individual L1 drift towards the reduction in voiceless-voiced closure duration difference in final stops/affricates and average individual difference between final voiceless and voiced closures in Russian (*y*-axis: larger values indicate greater drift towards Russian-like contrast neutralization; *x*-axis: larger values indicate greater distinction between voiced and voiceless obstruents). Labelled datapoints are participants who had lived in Russia.

Interim summary: For three acoustic correlates of the final voiced-voiceless distinction, correlational analyses revealed a significant dependency between the amount of L1 drift and the degree of pronunciation proficiency in L2, as evaluated objectively via acoustic measures. In two of these cases, a subjective measure of pronunciation proficiency, namely self-evaluated speaking proficiency in Russian along a one- to seven-point Likert scale, also correlated with the magnitude of drift, suggesting that, at least in some cases, such simple self-reported data align with the objective acoustic measures.

#### *3.3. Summary of the Results*

The comparisons between Russian and English productions of L2 learners demonstrated that learners, as a group, were attempting to implement the phonetic differences between the two languages (as they pertain to the implementation of voicing in particular) in their Russian productions. At the same time, while producing acoustically distinct targets in their two languages, learners' productions were still quite distinct from the Russian norms.

Despite the relatively modest gains in Russian pronunciation abilities, learners also demonstrated discernable effects of exposure to Russian in their English speech. Comparisons with a monolingual control group revealed significant differences in the majority of examined acoustic parameters. Most differences were compatible with the effect of Russian in the direction of convergence. Additionally, the individual magnitude of L1 drift in the direction of Russian norms was, in many cases, correlated with the degree of pronunciation gains in Russian speech: participants who could be considered more proficient speakers of Russian (as evaluated using acoustic measurements or self-reported speaking proficiency scores) demonstrated the greatest degrees of L1 drift.

#### **4. Discussion**

The present study examined native and second language speech of American learners of Russian in order to determine whether classroom exposure to L2 can lead to phonetic changes in learners' L1.

First, to confirm that classroom exposure to L2 resulted in phonetic learning, as evidenced by effective separation of L1 and L2 systems in the speech of the participants, and to evaluate the degree of this learning, we conducted a comparison between the acoustics of learners' Russian and English sounds. The results indicated that, for almost every measure taken, learners produced statistically distinct values in Russian and English. In Russian, their word-initial voiceless stops had shorter VOTs, their word-initial voiced stops had longer prevoicing, the frequency of prevoiced stops was higher, and prevoiced stops were characterized by extra-low onset f0, compared to English. Learners' word-final voiced obstruents were also partially devoiced in Russian, compared to English. All of these modifications were in the direction of approximating native Russian norms: short lag voiceless stops, robustly and near-exclusively prevoiced voiced stops, lower onset f0 after prevoiced stops, and word-final devoicing.

The fact that distinct productions were obtained across learners' L1 and L2 indicates that even at these relatively early stages of learning, taking place while immersed in L1, participants grasped the phonetic differences between similar phones across the languages and were attempting to implement them in their L2 speech. These results fit in with a wide array of similar findings for bilinguals and L2 learners, demonstrating not merged but distinct productions across the two languages (Baker and Trofimovich 2005; Flege and Eefting 1987a, 1987b; Fowler et al. 2008).

At the same time, it is clear that these learners were not near-native like in their L2 pronunciation by any measure. Their initial [−voice] stops were too aspirated and their initial [+voice] category was still dominated by short lag productions, instead of prevoiced ones. Their final [+voice] obstruents were also only slightly devoiced as opposed to fully devoiced, as expected in Russian speech. Therefore, using these acoustic measures, we can conclude that, at least with respect to pronunciation, these learners were not highly proficient/advanced in their L2.

The question remains whether we can expect L2-driven phonetic changes in the learners' native speech for these non-advanced speakers immersed in the L1. The answer given by the present results is 'yes'. A comparison of learners' English to the native English monolingual group revealed acoustically subtle but statistically significant differences, all compatible with the influence of Russian. Learners' initial [−voice] stops were characterized by shortened VOTs, indicating assimilation with Russian. Comparison of learners of Russian with the monolingual group also revealed a tendency to reduce the magnitude of the phonetic contrast between final voiced and voiceless obstruents. This reduction was implemented via partial devoicing of [+voice] final obstruents and is compatible with the effect of the Russian final devoicing process.

This finding suggest that L2 phonological rules can trigger phonetic changes in speakers' L1. The present outcome may also have been helped by the fact that American English is already gravitating towards variable final devoicing, most strongly in fricatives (Davidson 2016), thus facilitating this particular back-transfer from Russian. It is interesting that there was no significant frication duration difference between voiced and voiceless fricatives and affricates in learners' English. Thus, if learners have 'drifted' all the way into Russian-like complete neutralization (with respect to this parameter) then it was only in segments that are especially prone to devoicing in their L1. This finding warrants further research in order to learn more about the conditions under which phonological processes may seep from bilinguals' L2 into their L1.

Interestingly, English [+voice] stops were not affected, neither short lag nor prevoiced ones. Only the frequency of prevoicing showed a change, notably, in the direction of dissimilation from Russian. A similar effect was reported by Huffman and Schuhmann (2016), who indicated that some American learners of Spanish produced fewer or no prevoiced stops in their English after 6 weeks of classroom Spanish instruction. This result merits attention because it demonstrates that L1 phonetic changes in the direction of dissimilation from L2 are possible even at the beginning stages of L2 acquisition—a possibility not provided for in SLM, which predicts that only advanced learners will dissimilate after having created separate categories for L2 sounds. Moreover, this finding indicates that the dissimilatory or assimilatory direction of crosslinguistic interaction can be determined not only by

the stage of L2 acquisition but also by the sound category itself. Specifically, the present data suggest that L1 may tend towards dissimilation with L2 when L1 offers a choice of sub-phonemic variants of the same category only one of which is used in L2 to represent the same phonological category.

Overall, phonetic parameters affected by L1 drift were a subset of those used by participants to differentiate their L1 and L2, suggesting that L2 phonetic learning is a natural precursor of L1 drift.

The present evidence of L2-to-L1 phonetic effects in American learners of Russian indicates that even relatively dissimilar language pairings are subject to such phonetic interactions. Assimilatory changes in the acoustics of English obstruents suggest that despite great linguistic differences between the two languages overall and the use of different orthographic symbols for these sounds, Russian and English segments influenced each other. Orthography has been shown to play a powerful role in adult second language learning, which relies greatly on literacy and orthographic input, unlike first language acquisition (Bassetti and Atkinson 2015; Bassetti et al. 2015; Hayes-Harb et al. 2018). Nevertheless, the present study shows, in agreement with previous research on dissimilar language pairs such as English and Korean, that orthographic differences are not insurmountable obstacles for equivalence classification even in highly literate adult language learners. Equivalence classification between English and Russian obstruents, leading to cross-language assimilatory changes, could be facilitated by similarities in the phonological functioning of these segments in the respective languages. The two languages have similar sets of voiced and voiceless obstruents, which contrast initially and intervocalically but assimilate in voicing when in clusters and devoice, to different degrees, when in final position. Additionally, equivalent phonological categories across these two languages can be realized in phonetically identical ways, albeit in different contextual environments, e.g., English [−voice] stops can be implemented with short lag VOT word-medially before unstressed vowels, similarly to Russian [−voice] stops.

The fact that L2-to-L1 phonetic effects were detected for traditional classroom learners indicates that L2 immersion is not a necessary condition and that the amount of L2 experience and exposure received via classroom instruction can be sufficient to trigger such changes. Our participants did produce an acoustic distinction between L1 and L2 obstruents as a group, which suggest that this degree of phonetic learning may be required for L1 drift to initiate. Conversely, advanced pronunciation proficiency is most likely not a pre-requisite for L1 drift. Nevertheless, prior L2-immersion, which was not ongoing at the time of participation and was limited to 2-4 months, appeared to intensify the degree of L1 drift for these participants in some measures, in parallel with improving the authenticity of their L2 pronunciation.

Moreover, these findings strongly suggest that a significant reduction in the use of L1 is also not a necessary condition for L1 phonetic drift. It is very unlikely that participants in the present study experienced a substantial reduction in the amount or quality of L1 use as a result of studying Russian at the university. It is also unlikely that they were exposed to much spoken Russian through overhearing (concomitantly, also reducing the 'overhearing' of English), the way immersed learners are. Thus, even in the absence of considerable reduction in L1 use, first language can and does drift towards the phonetics of L2 in comparable sound categories. To complement this finding, other work demonstrated that even in bilingual or immigrant settings, where L1 use reduction is likely, L1 use does not always correlate with the quality of L1 pronunciation (Guion et al. 2000; Hopp and Schmid 2013). The diversity of results with regard to the effect of L1 use on L1 speech indicates that its role is not fully understood and merits further attention.

Additionally, L1 drift in the present work was not the result of an immediate 'spill-over' effect from one language to another. The order of Russian and English elicitation was counterbalanced, and the analysis of elicitation order effects showed that order did not condition the presence of the L1 drift, although it could sometimes increase its magnitude.

Although it is unlikely that learners' L1 use was substantially decreased by their enrollment in Russian courses, another possibility is that L1 inhibition, not L1 use reduction, is what paved the road for L2-to-L1 phonetic effects. Some authors have argued that successful L1 inhibition is important for effective L2 learning. For example, Linck et al. (2009) showed that immersed L2 learners fared better in acquired L2 proficiency but worse in L1 lexical retrieval than classroom learners, and argued that greater L1 inhibition in immersion settings was responsible. Moreover, Levy et al. (2007) demonstrated that even a short laboratory training session was enough to trigger L1 inhibition effects in lexical retrieval. This suggests that a relatively limited time of classroom L2 learning may also be sufficient to trigger L1 inhibition and, therefore, L1 drift. Furthermore, if laboratory training can induce L1 inhibition and L1 inhibition can trigger L1 drift, we could expect L2-to-L1 phonetic effects under laboratory conditions. This is precisely what was demonstrated by Kartushina et al. (2016a) who showed drift in L1 vowels towards similar non-native ones after short-term visual articulatory feedback training. Nevertheless, further research is necessary to fully understand the role of L1 inhibition in the susceptibility of the L1 sound system to the influence of the L2.

A related issue is the longevity of L1 drift observed in laboratory conditions, after a short L2 immersion, or in the course of classroom L2 acquisition. It is rather plausible that such effects may be short-lived. In fact, Kartushina and Martin (2019) showed that, four months after studying abroad, participants experienced a 'return drift' towards native-like phonetics in their L1 (see also Chang (2019b) for similar findings). It is possible that our learners would lose the effects of Russian on their L1 if or when they discontinue their Russian studies. Such short-term phonetic changes in L1 may be qualitatively different from language attrition, which is believed to develop over longer periods of time and have greater strength of 'inertia' in resisting the 'rebound' back towards native-like values when speakers return to L1 immersion and no longer actively use their L2 (Chang 2019a). Additionally, the 'novelty' effect may boost the cross-language drift at the early stages of language acquisition (Chang 2013). Ultimately, the observation that the L1 can respond flexibly and intricately to the changing circumstances of learners' language use and environment demonstrates its great plasticity and adaptability and argues against maturation-related limits on phonetic learning.

Finally, there are a number of factors we could not address in the present study, but which merit serious attention in future research. Among those is the role of exposure to a non-native like L1 in triggering L1 phonetic drift as well as factors such as motivation for learning and attitudes towards the L2 and its associated culture.

An implicit assumption in previous research examining L2-to-L1 phonetic effects has been that it is the exposure to and use of L2, per se, that triggers L1 phonetic drift. This assumption is supported in the current study by the observation that L1 drift co-occurred with a degree of phonetic learning in L2 sufficient to produce two distinct phonetic systems. However, under many scenarios of L2 learning and use, learners are also exposed, to varying degrees, to a non-native-like and accented L1. In the present case, all instructors in the Russian courses attended by our learners were native speakers of Russian. An informal survey of Russian instructors at Purdue University indicated that during class students may be addressed in English anywhere between 15% and 70% of the time, depending on the course level and individual proficiencies of class participants. This suggests that, especially at the beginner level of Russian instruction, learners are exposed to a considerable amount of Russian-accented English speech. Some acoustic characteristics of Russian-accented English would be very similar to the ones observed in native English that has drifted towards Russian (e.g., shorter voiceless VOTs of initial stops, partial or complete devoicing of final obstruents).

At present, we know very little about the role that such exposure may play in the development of apparent L1 drift. Nevertheless, some research suggests that L2-accented L1 input may contribute to non-native-like L1 productions in bilinguals. For example, Mora and Nadeu (2012) and Mora et al. (2015) suggest that the partial merging of Catalan /e/ and /ε/ vowels in the speech of their participants could be due, in part, to their exposure to Spanish-accented Catalan (see also Sebastián-Gallés et al. 2005). If L1 drift in classroom language learners is guided, in part or primarily by the Russian-accented English input provided by their instructors, drift may develop as L1 phonetic accommodation to the instructor. In this case, it could be impacted by factors shown to mediate accommodation, such as considerations of dominance, prestige, and speakers' disposition and attitudes towards each other,

and language distance (Babel 2012; Babel et al. 2014; Kim et al. 2011; Pardo 2006; Pardo et al. 2012; Pardo et al. 2013; Yu et al. 2013).

Additionally, learners' motivations for studying the chosen language and their global attitudes towards the associated culture and native speakers of their L2 could further mediate the propensity to drift. Previous research has shown that language attitudes and considerations of prestige can influence cross-language interaction in bilinguals (Gatbonton et al. 2005; Gatbonton et al. 2011; Giles et al. 1977; Law et al. 2019), but considerably more work is needed to determine the role of such factors in L2-to-L1 phonetic effects.

#### **5. Conclusions**

The study presented here examined the acoustic characteristics of word-initial and word-final obstruents in English and Russian speech of Americans studying Russian as a second language in the traditional college classroom. The results demonstrated that learners' native English productions were acoustically distinct from those of monolingual speakers of American English, primarily in the direction of assimilation to Russian acoustics. We interpret these results to indicate that advanced mastery of another language or extensive long-term exposure to L2 are not necessary for the phonetics of a second language to affect the first language of learners. The current results also suggest that a reduction in the use of L1, which typically accompanies L2 immersion, is also not a necessary condition of L1 drift. Overall, the results demonstrate that the acoustics of L1 production can undergo subtle but systematic adjustments as a result of new linguistic experiences such as second language learning in classroom settings.

**Author Contributions:** Conceptualization, O.D., A.J. and J.A.S.; methodology, O.D.; formal analysis, O.D.; investigation, O.D., A.J. and J.A.S.; data curation, O.D.; writing—original draft preparation, O.D.; writing—review and editing, A.J. and J.A.S.; visualization, O.D.; supervision, O.D.; project administration, O.D., A.J. and J.A.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to acknowledge the contributions of undergraduate research assistants at Purdue University: Alyssa Nymeyer, Audrey Bengert, Bethany Sexton, Emilie Watson, and Alexis Tews. We would also like to thank Russian language instructors at Purdue University, Olga Lyanda-Geller and Amina Gabrielova, for assistance in recruiting participants and providing information about teaching practices.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** English minimal pairs with voiced and voiceless plosives in word-initial position.


**Table A2.** English minimal pairs with voiced and voiceless obstruents in word-final position.



**Table A3.** Russian minimal pairs with voiced and voiceless plosives in word-initial position.

**Table A4.** Russian minimal pairs with voiced and voiceless obstruents in word-final position.


#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
