*Article* **Extraction from Present Participle Adjuncts: The Relevance of the Corresponding Declaratives**

**Andreas Kehl**

English Department, Universität Tübingen, Wilhelmstraße 50, 72074 Tübingen, Germany; andreas.kehl@uni-tuebingen.de

**Abstract:** In this article, I will argue that many of the theoretical approaches to extraction from participle adjunct islands suffer from the fact that the focus of investigation lies on perceived grammaticality differences in interrogative structures. Following approaches which make an explicit connection between extraction asymmetries and properties of the underlying proposition, I will argue that there is good evidence for the existence of similar differences in declarative adjunct constructions which can explain most of the grammaticality patterns observed for interrogatives. A crucial distinction to the majority of previous theories is the focus on acceptability rather than grammaticality, and the assumption that acceptability in declaratives is determined by a variety of semantic and syntactic complexity factors which do not influence how strongly extraction degrades the structure. This line of argumentation is more compatible with approaches to island phenomena that explain the low acceptability of some extractions by independent effects such as processing complexity and discourse function instead of syntactic principles blocking the extraction. I will also discuss a partially weighted, multifactorial model for the acceptability of declarative and interrogative participle adjunct constructions, which explains the judgment patterns in the literature without the need for additional, complex licensing conditions for extraction.

**Keywords:** adjunct islands; *wh*-extraction; locality; present participle; gradient acceptability; acceptability model

#### **1. Introduction**

Since the formulation of the Condition on Extraction Domain (CED, Huang 1982), more and more apparent counterexamples to this strict locality condition have surfaced, including extractions from subjects and adjuncts that are judged as grammatical. Compare the ungrammatical extraction from an adverbial clause in (1a) with the extraction from an adjunct headed by a present participle in (1b), which is considered grammatical; the (participial) adjunct predicate is shown in square brackets in most of the examples used in this article. An acceptable extraction from subject is shown in (2).

	- b. What*<sup>i</sup>* did John arrive [whistling t*i*]? (Borgonovo and Neeleman 2000, p. 200)

(Hofmeister and Sag 2010, p. 370)

Attested examples of extraction from participle adjuncts, as in (1b), are often found in the form of relativization, as in (3) from Santorini (2019) and (4) from a news article; in these cases, it is a nominal element that is associated with a gap site in the complement position of a participle adjunct instead of a *wh*-pronoun.

**Citation:** Kehl, Andreas. 2022. Extraction from Present Participle Adjuncts: The Relevance of the Corresponding Declaratives. *Languages* 7: 177. https://doi.org/ 10.3390/languages7030177

Academic Editors: Anne Mette Nyvad and Ken Ramshøj Christensen

Received: 22 December 2021 Accepted: 15 June 2022 Published: 8 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

	- b. "This is the game I grew up [watching \_\_ ]," Wilson added.

(Santorini 2019)

(4) Already uncomfortable with the policy—the USD 1.3 trillion legislation includes massive spending hikes that contradict prior GOP complaints about the debt—many Republicans were left dumbfounded by [a process]*<sup>i</sup>* that looked a lot like one they had won office [criticizing \_\_*<sup>i</sup>* ]. (Washington Post, 22 March 2018)

I follow Truswell (2011, p. 30) in referring to adjunct constructions, such as (1b), (3), and (4) as Bare Present Participle Adjuncts (BPPA); they are characterized by an untensed present participle as the head of the adjunct predicate, as well as the absence of an explicitly encoded subject or subordinators.

However, not all BPPA constructions allow extraction as easily as (1b): minimally different examples, such as (5), are reported to block the extraction.

#### (5) \*What*<sup>i</sup>* did John dance [imagining t*i*]? (Borgonovo and Neeleman 2000, p. 199)

There are, thus, two problems to be addressed: (i) apparently grammatical extractions from adjuncts that should be excluded by the CED (1a vs. 1b), and (ii) variation within an adjunct type where some extractions are allowed while others are not (1b vs. 5). The first problem has been addressed in the Minimalist literature in two ways: first, abandoning or modifying the original formulation of the CED to accommodate such cases,<sup>1</sup> and second, to reconsider the adjunct status of apparent counterexamples to the CED (e.g., Graf 2015). These approaches still assume that there is a syntactic principle at work which determines when extraction is possible. A major alternative to this syntactic perspective is taken by approaches which do not assume a syntactic principle behind extraction asymmetries, but rather more general principles. This includes approaches based on processing (e.g., Sag et al. 2008; Hofmeister and Sag 2010), information structure (Goldberg 2006, 2013), pragmatics (Chaves and Putnam 2020), and discourse functions (Abeillé et al. 2020). Such approaches line up with the Radical Unacceptability Hypothesis proposed in Culicover et al. (2022), to which I return at the end of this article.

I will focus on the second problem in the rest of this article.2 For reasons of space, the discussion focuses on examples with *wh*-extraction, but it should be kept in mind that there is growing evidence that different types of filler–gap dependencies yield different effects, so that so-called island constraints do not appear to be cross-constructionally active (Liu et al. 2022); see also Sprouse et al. (2016) and Abeillé et al. (2020) for findings and discussion, as well as Kehl (2021, experiment 1) for a comparison between declarative, interrogative, and relativized BPPA constructions. Differences between types of extractions become all the more relevant since much of the existing literature focuses on *wh*-extraction, whereas many attested examples are instances of relativization (Chaves and Putnam 2020; Santorini 2019). I will briefly address other dependency types at the end of Section 5.

The variation in the extraction behavior of interrogative BPPA constructions has resulted in several approaches that try to find an explanation for such patterns; the influential theoretical approaches in Borgonovo and Neeleman (2000) and Truswell (2007, 2011) propose licensing conditions to accommodate this island-internal variation. In this article, I will follow the discussion in Brown (2017) and Kehl (2021), agruing that the *grammaticality* patterns observed for extraction from BPPA constructions are actually a reflection of different degrees of *acceptability* which are already observable in the declarative counterparts and evoke the impression of grammaticality differences in interrogatives. Thus, I assume that the acceptability difference between (1b) and (5) is equal to that between the declaratives in (6):

	- b. John danced [imagining the Gobi Desert]. (Borgonovo and Neeleman 2000, pp. 199–200)

By extension, I also assume that the same acceptability contrast between verbs such as *arrive* and *dance* is visible in other dependency types, such as the relative clauses in (7), which are more similar to the attested data in (3) and (4) above:

	- b. This is [something]*<sup>i</sup>* that John danced [imagining \_\_*i*].

The basic idea behind this assumption, which I will argue for in the remainder of this article, is that differences between the two main verbs *arrive* and *work* result in different degrees of acceptability independently of whether the sentence form is declarative, relativization, or interrogative. In other words, the relative acceptability of the declaratives are good predictors for relative acceptability in different sentence forms; see also Chaves and King (2019), who find a relation between judgments of relevance and acceptability of subextraction from subjects. This line of research shifts the focus of attention to the semantic and/or pragmatic factors which affect acceptability in the underlying declarative structures. This comparison of extraction from island constructions to possible differences in the underlying declaratives ties into the growing body of research that does not focus on extraction constructions alone (among others, Abeillé et al. 2020; Brown 2017; Chaves and King 2019; Chaves and Putnam 2020). The relevance of drawing on more subtle differences in declaratives to explain differences at the fringes of grammaticality in extraction structures goes back to at least Kuno (1987), an idea that is picked up prominently in the pragmatic approach to extraction asymmetries in Chaves and Putnam (2020), but also the discussion of complexity differences in Culicover and Winkler (2022).

The discussion in this article centers around the question whether extraction asymmetries observed for BPPAs need to be captured by a grammatical principle, as proposed in Borgonovo and Neeleman (2000) and Truswell (2007, 2011), or whether these asymmetries can be explained independently. I argue for the second position and discuss the possibility of capturing judgment differences, such as (1b) vs. (5); the underlying idea is that the semantic compatibility between the two predicates in this construction affects acceptability both in the presence and absence of a dependency such as *wh*-extractionextraction. The discussion of this narrow set of examples is closely related to the more general proposal in Culicover and Winkler (2018, 2022) and Culicover et al. (2022) that many instances of such judgment differences in extraction phenomena can be accounted for without the need to introduce grammatical principles.

This article is structured as follows: I will first provide a short summary of the grammaticality patterns reported in Borgonovo and Neeleman (2000) as well as Truswell (2007) in Section 2 as a basis for the remainder of the discussion; in Section 3, I discuss the relations between the concepts of grammaticality and acceptability, as well as the potential mapping problems between gradient and binary judgments; I will then suggest a factorial design for the detection of island-internal variation that allows for an experimental validation of factors that are assumed to influence how strong extraction affects different types of declaratives; Section 4 examines previous experimental studies which compare declarative and interrogative BPPA constructions and whether their results speak for or against the conclusions in the theoretical literature; Section 5 then discusses factors which influence the acceptability of declarative BPPA constructions independently of extraction and combine these factors into an acceptability model for declarative and interrogative BPPA constructions; in Section 6, I take a brief look at evidence from related phenomena that also come to the conclusion that differences in declaratives have an impact on theory development; Section 7 concludes this article.

#### **2. Reported Grammaticality Patterns**

In this section, I will summarize the reported grammaticality patterns for extraction from participle adjuncts in two influential accounts: Borgonovo and Neeleman (2000) and Truswell (2007).3 Both accounts share the intuition that different grammaticality patterns exist in interrogatives which are not present in declaratives; this leads them to propose additional licensing mechanisms for extraction to accommodate these interrogative patterns. I will not go into the technical details of these accounts for reasons of space and because the focus of this article is on the relation between declaratives and interrogatives instead of the licensing mechanisms they propose. As I will show in Section 3, such a comparison uncovers problematic aspects of these accounts.

#### *2.1. Transparency Depends on Verb Types*

Borgonovo and Neeleman (2000) report on a grammaticality pattern that allows extraction from participial adjuncts modifying unaccusative and reflexive transitive main verbs, as in (8a) and (8b); in contrast, extraction from adjuncts modifying unergative and non-reflexive transitives, as in (8c) and (8d), results in ungrammaticality.


The main proposal resulting from this pattern is that some verb types are able to L-mark adjuncts by means of a syntactic reflexivity relation where the internal argument DP binds both the *θ*-roles of the adjunct predicate and the main verb. The adjunct will then count as L-marked and obeys the CED because it is properly governed. Only unaccusatives and reflexive transitives are able to L-mark the adjunct because the right structural configuration is only possible with an internal argument that is also the external argument of the adjunct predicate. Unergatives fail to L-mark the adjunct because they do not have an internal argument and do not project the necessary V -layer (Borgonovo and Neeleman 2000, pp. 212–13); L-marking is not possible for non-reflexive transitives because the external argument of the adjunct is also the external argument of the main verb. In both cases, extraction is banned by the CED because the adjunct is not L-marked.

Crucially, L-marking is a condition that is specific to the licensing of extraction: it does not have an effect in declaratives because it is irrelevant there. Therefore, the declarative sentences in (9) underlying the interrogatives in (8c) and (8d) are completely unmarked.

	- b. John hurt Bill [trying to fix the roof].

(Borgonovo and Neeleman 2000, pp. 199–200)

Because declarative BPPA constructions are unconstrained in terms of grammaticality differences, the source of ungrammaticality in interrogatives is caused by the extraction operation itself, which fails to be licensed if L-marking cannot be established for unergatives and non-reflexive transitives. The required adjustments to subjacency-based locality theory are modest and can be expressed in core-syntactic terms, even if the theory requires ternary branching to establish syntactic reflexivity between the verb, its internal argument, and the adjunct predicate. Still, a major problem with this account is that it does not consider any potential variation in the declarative counterparts and exclusively relies on extractionrelated factors to explain the pattern in interrogative BPPA constructions.

#### *2.2. Transparency Depends on Telicity*

A slightly different pattern is described in Truswell (2007), who focuses on the event structure of the BPPA construction. The key proposal is that extraction from an adjunct predicate is only licensed in the grammar if the adjunct fills an open or underspecified event position in the event structure of the matrix predicate. This means two things: (i) the matrix predicate needs to encode at least two subevents, and (ii) one of these is underspecified by the lexical semantics of the matrix predicate. The two event types that encode more than one subevent are achievements and accomplishments in terms of Vendler (1957); they are composed of a culmination point and a durative subevent leading up to this endpoint, which is optional for achievements; see Rothstein (2004). States and activities, on the other hand, either encode no event at all (states) or only a single subevent (activities). In case the adjunct can be interpreted as supplying more information about the underspecified subevent, the two predicates describe facets of a single event, mirroring the lexical semantics of a maximally complex verb (Truswell 2007, p. 1369). This amounts to the generalization that extraction from the adjunct is only possible if the matrix predicate is telic; this derives the predictions for the contrast in (10) with the atelic verb *work* (10a) and the telic *arrive* (10b):


These predictions are similar to those in Borgonovo and Neeleman (2000), but formulated in event-semantic terms, which are not exclusively tied to extraction. In addition to achievement matrix predicates, such as (10b), extraction is also possible from structures with accomplishment main verbs, such as in (11), provided that the adjunct can describe the cause of the matrix predicate:

(11) What did John drive Mary crazy [trying to fix t]? (Truswell 2007, p. 1356)

Like Borgonovo and Neeleman (2000), Truswell (2007) concludes that the corresponding declaratives in (12) do not show a similar pattern and that the grammaticality pattern in interrogatives is the result of extraction. Both accounts do consider declarative counterparts with respect to their grammaticality, but do not observe significant differences in acceptability.<sup>4</sup>

	- b. John arrived [whistling a song].

(see Truswell 2007, pp. 1369, 1373)

This means that the syntactic extraction operation needs to be sensitive to the distinctions between different event types, but also to the lexical semantics of the two predicates, as well as potential causal chains between them. Unless information about the aspectual type and causality are directly encoded syntactically, as, for example, in Borer (2005) and Ramchand (2008), this extraction pattern is impossible to explain in core syntactic terms. It is not an immediate problem that Truswell (2007) considers both sentences in (12) grammatical, but this focus on grammaticality requires the formulation of extraction conditions in event-semantic terms (or a post-syntactic event-semantic output filter, as suggested in Truswell 2011).

Both accounts sketched in this section agree that declarative BPPA constructions are relatively unconstrained with respect to grammaticality differences and that the pattern in interrogatives is a direct result of failures in the licensing mechanism for extraction. I will argue in the following section that this perspective overestimates the reported grammaticality differences in interrogatives, and at the same time underestimates potential differences

in the declarative counterparts. The main reason for these problematic aspects is rooted in the distinction between the concepts of grammaticality and acceptability, as well as the relation between gradient and binary judgments.

#### **3. Grammaticality, Acceptability, and the Relation between Declaratives and Interrogatives**

In this section, I will discuss problematic aspects of the exclusive focus on *grammaticality* differences in interrogatives without also considering potential *acceptability* differences in the declarative counterparts. The problem is one of mapping relations between binary grammaticality judgments and gradient acceptability judgments, because sentences that receive the same binary grammaticality marking may still show significant differences in acceptability that are not properly represented in all grammaticality judgments. For example, it is reasonable to consider both examples in (13) grammatical, but experimental evidence suggests that (13a) is less acceptable than (13b). Among others, the lower acceptability and negative impact on online sentence processing of additional arguments is shown in Jurka (2010, 2013), Polinsky et al. (2013), Brown (2017), and Culicover and Winkler (2022). An additional issue in (13a) is that there is a degree of ambiguity whether the adjunct refers to John or Bill. In connection with syntactic dependencies, the greater processing cost and, thus, reduced acceptability is predicted by Dependency Locality Theory (Gibson 1998, 2000); see also Section 4. 5

	- b. John arrived [whistling the Blue Danube].

(Borgonovo and Neeleman 2000, p. 200)

Subsection 3.1 describes the contrast between binary grammaticality and gradient acceptability judgments, as well as their relation in more detail; the focus here is on which conclusions can be drawn from these two measurements and the risk of not distinguishing between them properly. Subsection 3.2 proposes an adapted factorial experiment design that allows for the investigation of island-internal variation which includes a comparison to the declarative base position. Subsection 3.3 emphasizes the usefulness of including standardized reference fillers in acceptability judgment tasks for conceptual and methodological reasons.

#### *3.1. Gradient and Binary Judgments*

One of the core issues in the evaluation of the theoretical approaches in Borgonovo and Neeleman (2000) and Truswell (2007, 2011) lies in the distinction between the concepts of *grammaticality* and *acceptability* discussed in Chomsky (1965). Chomsky (1965) models this distinction as one between *competence* and *performance*: the former refers to those aspects of language that are part of a speaker's grammar, whereas the latter reflects the use of language that is also affected by other factors. Grammaticality is seen as a measure of whether a sentence is licensed by the grammar; this evaluation has often been considered to be a categorical distinction, even though Chomsky (1965, p. 11) already notes that it is probably "a matter of degree". Acceptability as a measure of naturalness and comprehensibility does not solely depend on grammaticality, but grammaticality is one of the factors that determine acceptability: a sentence that is considered grammatical can still show low acceptability because they are semantically or pragmatically anomalous, or because they are difficult to process (Chomsky 1965, p. 11). Ungrammaticality refers to the fact that a given structure cannot be computed by the grammar, or runs afoul at the interfaces, for example because not all uninterpretable features are checked and deleted during the derivation. Acceptability is partially fed by grammaticality, but also affected by additional factors that are independent of grammaticality: as is well known, there are sentences which can be generated by the grammar but can be anomalous semantically and/ or pragmatically, or pose processing difficulties that impact acceptability judgments. On the other hand, there are sentences which are grammatically ill-formed but appear intuitively acceptable, a phenomenon called 'illusions of grammaticality' in Phillips (2013, p. 106).

There is, thus, a mapping problem between grammaticality and acceptability because not all sentences that are considered grammatical are necessarily equally acceptable, also noted in Chomsky (1965, p. 11). Especially problematic are cases where acceptability is on the borderline or threshold of grammaticality: minimally different acceptable sentences run the risk of being assigned opposite grammaticality judgments, even if the relative distance in acceptability between them is smaller than the distance between two fully grammatical or ungrammatical sentences. I will elaborate on this problem in the remainder of this subsection.

Consider the two declarative BPPA constructions in (14), with an unaccusative (14a) and an unergative (14b) matrix predicate. The predictions of Truswell (2007) and Borgonovo and Neeleman (2000) agree on the fact that extraction from the adjunct in (14a) will be grammatical, whereas extraction from (14b) will not.

	- b. John worked whistling a funny song.

Let us assume a gradient Likert-type judgment scale with seven discrete points, and a binary categorization into grammatical and ungrammatical sentences. Let us also assume that gradient judgments on or above the middle of the gradient scale, i.e., every gradient judgment ≥ 4, will be mapped to the binary judgment 'grammatical', and that gradient judgments < 4 will be mapped to 'ungrammatical'. Thus, if (14a) is assigned a gradient judgment of 7 and (14b) a judgment of 5, both will be mapped onto a grammatical binary judgment; this is shown in (15).

	- b. (14b) → gradient judgment: 5 → binary judgment: grammatical

This is, in essence, what Borgonovo and Neeleman (2000) and Truswell (2007, 2011) assume about declarative BPPA constructions, with a focus on the outcome of the binary grammaticality judgment. For now, it is not immediately relevant why (14b) should be less acceptable on a gradient scale compared to (14a). The data reported in Brown (2017) and Kehl (2021) support the assumption that there is a statistically significant acceptability difference between the two, even if this difference might not be as pronounced as in this hypothetical example.

When extraction takes place from the adjunct, the gradient judgment will decrease for both structures because the formation and resolution of filler–gap dependencies is a cognitively costly operation and because interrogatives are semantically more complex than declaratives (Chaves and Putnam 2020; Hofmeister and Sag 2010; Wagers 2013). Since the extraction domain is an adjunct, this judgment decrease will probably be larger compared to extraction from a subcategorized complement, as predicted by the CED.6 The interrogative counterparts of (15) are shown in (16), without judgment marks.

	- b. What did John work whistling?

A final assumption made here, again supported by the experimental evidence in Brown (2017) and Kehl (2021), is that both structures are affected to the same degree by extraction, meaning that the decrease in the gradient judgment is identical; the gradient judgment for (16b) will then fall below the threshold in the middle of the scale, resulting in an ungrammatical binary judgment. For (16a), the gradient judgment remains on or above the threshold, yielding a grammatical binary judgment; this is shown schematically in (17).

	- b. (16b) → gradient judgment: 3 → binary judgment: \*ungrammatical

On the surface, this results in exactly the grammaticality patterns constructed in Borgonovo and Neeleman (2000) and Truswell (2007). However, what we are mostly interested in is whether the relative differences in gradient judgments between the two sentence pairs are identical or whether one is larger than the other. To check for this, we subtract the two declarative judgments from one another and compare this to the same difference between the interrogative counterparts. If the difference pairs are equal to each other (or at least not significantly different), then there is no need for additional licensing mechanisms for extraction because the gradient judgment differences in interrogatives can be predicted from the differences in declaratives in a linear additive way. This is shown in (18a) and is the simpler case because then the only explanation required is what causes the differences in declaratives plus the independent decrease caused by extraction. If, on the other hand, the relative differences are of different magnitudes, as shown in (18b) and (18c), then this requires an explanation for this additional difference that cannot be predicted from the gradient contrasts in declaratives. These patterns can be referred to as superadditive and subadditive. Depending on whether the difference between declaratives is smaller than that for interrogatives, as in (18b), or the other way round, as in (18c), this leads to the need either for additional licensing mechanisms or a repair mechanism, respectively.


This metric is similar to the differences-in-differences score employed in Sprouse et al. (2012, 2013), which isolates the effect sizes of individual factors and evaluates whether the combination of two factors negatively impact acceptability scores to a greater (or lesser) degree than the two individual factors.

Figure 1 illustrates the first two possibilities in (18): the left panel corresponds to (18a) where the gradient judgment differences are identical for both types of matrix predicate; the right panel shows the pattern where the decrease caused by extraction is larger for atelic matrix predicates than that for telic predicates (18b). I omit the case of (18c) for expository purposes. The shaded area in Figure 1 shows the range of the gradient scale that will be mapped onto an ungrammatical binary judgment. The experimental results in both Brown (2017) and Kehl (2021, experiment 2) correspond more closely to the pattern on the left rather than the one on the right, showing that the strength of the acceptability decrease in interrogatives is not influenced by the other factors they investigate. I will return to this discussion in Section 4 below.

**Figure 1.** Schematic illustration of the linear additive and superadditive gradient judgment patterns in (18); the shaded area shows the part of the scale that will be mapped to ungrammatical judgments.

The difference between these patterns is obscured if the sole focus is on the binary judgment because this ignores potential differences in gradient judgments for declaratives; in a sense, some information is lost in the mapping between gradient and binary judgments. On its own, this is not problematic, but it becomes so if used as a basis for the postulation of licensing mechanisms for extraction. One possibility to avoid the potential pitfalls of binary judgments is to broaden the data pool and gather binary judgments from multiple informants, which can then be converted to a gradient scale similar to Likert-scales by calculating the ratio of grammatical-to-ungrammatical responses (Bader and Häussler 2010, 2019). This method has been shown to result in similar patterns as judgments on discrete or continuous scales.

Taking a step back, the binary judgment differences in Borgonovo and Neeleman (2000) and Truswell (2007) can be converted to acceptability measures, meaning that the grammatical extractions are more acceptable than the ungrammatical ones. However, the formulation as grammaticality judgments runs the risk of leading to proposals about the architecture of the syntactic component and its interfaces with semantics and pragmatics. Therefore, I think that it is advisable to focus on acceptability first and then reason about the model of grammar that best fits with these results.

#### *3.2. A Factorial Design for Island-Internal Variation*

The procedure described in the previous section represents a modification of the factorial design for island effects in Sprouse et al. (2012, 2013) and Kush et al. (2018, 2019). The original design compares conditions in a way that allows to isolate the individual effects of two factors: the difference between extraction from matrix clauses vs. embedded domains and between extraction from non-island vs. island domains. See (19) for an illustration of the design:


This design allows quantifying three sets of contrasts and the respective effects they have on acceptability: (i) the contrast between (19a)–(19b) isolates a possible effect between extraction from the matrix clause vs. the embedded clause; (ii) the contrast between (19a)–(19c) detects whether the presence or absence of an island domain, in this case a *wh*island introduced by *whether*, affects acceptability; and (iii) the contrast between (19b)–(19d) compares the cost of extraction from a non-island vs. from an island domain (see Sprouse et al. 2013, p. 25). Often, theoretical approaches will focus on the contrast between (19b) vs. (19d) and conclude that *whether*-clauses are islands if this extraction feels less acceptable than the non-island. However, this leaves unaccounted the potential effect that the presence of a *whether*-clause has on acceptability independently of extraction.

To solve this, Sprouse et al. (2012, 2013) include this effect in the calculation of potential island effects: if the acceptability judgment for the 'worst' condition (19d) compared to the unmarked reference condition (19a) cannot be predicted from the differences between (19a) and (19b) and (19a)–(19c), then this additional acceptability decrease is called an 'island effect' which needs to be accounted for theoretically.

The same reasoning can be applied to investigate the validity of theoretical approaches such as those in Borgonovo and Neeleman (2000) or Truswell (2007): instead of comparing an island construction with a non-island, two instances of the same island type are tested in declarative and interrogative conditions. They differ minimally in one of the factors isolated in the literature, such as event structure or the verb type of the matrix predicate. This allows to examine whether such factors determine how strongly extraction degrades acceptability, as well as whether there are acceptability differences in the declaratives that are the source of the reported differences in interrogatives. An example of such a design is given in (20), based on the example sentences discussed in the previous section. This relatively simple 2 × 2 design manipulates the matrix predicate as telic or atelic, as well as the difference between declarative and interrogative sentences. The manipulation of other factors, also ones with more levels, is of course also possible; for a more complex 2 × 2 × 2 design that crosses the factors TELICITY, TRANSITIVITY, and EXTRACTION, see Brown (2017). For example, the simple comparison between declarative and interrogative sentence forms can be augmented to also include relative clauses and topicalization.


The statistical analysis will then compare the effects of the two factors, in this case telicity and extraction, as well as the interaction between them. The absence of a significant interaction indicates that the strength of extraction is not influenced by the factor that distinguishes the declarative conditions. If there is a significant interaction, additional licensing or repair mechanisms are called for, as explained in the previous section. Like the detection of island effects in the original factorial design in Sprouse et al. (2012, 2013), the question whether extraction from a 'suboptimal' adjunct island configuration leads to drops in acceptability that cannot be explained independently of extraction would lead to additional licensing requirements. Determining this need for licensing mechanisms should be at the core of investigations into island-internal variation and should be backed up with experimental data in addition to initial, intuitive judgments.

#### *3.3. The Use of Standardized Fillers*

The results of gradient judgment studies can sometimes be difficult to interpret. Typically, the experimental conditions are compared to each other in terms of significant differences between conditions in the data pool, or in terms of effect structures in the case of factorial designs. Although this is the main interest of an experimental study, i.e., to test hypotheses about acceptability contrasts and the influence of specific factors, it is also

of interest to compare where the experimental conditions are located on the continuum of gradient acceptability, regardless of whether this continuum is expressed in discrete Likert-type scales or truly continuous judgments as in Magnitude Estimation (Bard et al. 1996) or Thermometer judgments (Featherston 2020). One possibility is to add control conditions that are closely related to the construction under investigation, as implemented in Abeillé et al. (2020) with grammatical and ungrammatical controls.7 In her experiment on extraction from adjuncts in English that is closely related to the design in (20), Brown (2017) includes grammatical and ungrammatical controls as in (21); as extraction from tensed adjuncts as in (21b) is not always considered unacceptable, extraction from a conjunct as in (22a) can also be used for unacceptable controls because there is general agreement in the literature that such extractions are ungrammatical (Liu et al. 2022). A resumptive pronoun at the gap site, as shown in (22b), can also be used to construct ungrammatical control conditions that are close to the design implemented (Chaves and Putnam 2020, pp. 218–19).

(21) a. Which ice cream did Mary eat before she saw the celebrity?

[grammatical control]

b. \*Which celebrity did Mary eat an ice cream before she saw?

[ungrammatical control] (Brown 2017, p. 120)

	- b. \*What did Mary arrive at the office whistling it?

The set of standardized reference fillers developed for English in Gerbrich et al. (2019) are designed to provide anchor points along gradient or discrete judgment scales, ranging from a high level of acceptability to a low level; the idea of providing a standardized scale for acceptability is also found in Featherston (2009), who develops a set of German reference fillers.

The goal of the standardized fillers is to provide anchor points on the extremes of the rating scale with highly acceptable and highly degraded sentences, as well as a range of acceptability in between; ideally, this results in a reference scale with equal distances between the individual levels, so that the experimental items can be assigned a relative level of normed acceptability. The choice of very general levels of well-formedness along the spectrum of acceptability which is not limited to control items that are related to the construction has the advantage that the fillers can be re-used across multiple experiments and, thus, allows a more grounded discussion of acceptability across experiments. It is of course possible to include both the standard fillers and construction-specific control conditions in an experiment. A sample of the reference fillers is given in (23); the assignment of more traditional graded grammaticality judgment marks, such as '?' or '\*', are adapted from Gerbrich et al. (2019, p. 310).

	- b. √B: Before every lesson the teacher must prepare their materials.
	- c. ?C: Hannah hates but Linda loves eating popcorn in the cinema.
	- d. ??D: Who did he whisper that had unfairly condemned the prisoner?
	- e. \*E: Historians wondering what cause is disappear civilization.

(see Gerbrich et al. 2019, p. 315)

The two best levels A and B are usually not marked in such judgment schemes, and are both considered fully grammatical; still, Gerbrich et al. (2019) suggest that there are significant acceptability differences between these grammatical levels, which are difficult to detect in judgments with limited conventionalized markings. Judging from their experimental results with the standardized fillers, Gerbrich et al. (2019, p. 309) conclude that

there may be even more distinguishable levels of well-formedness. Note that the E-level is still interpretable, but highly unnatural; it is possible to add a further level with low interpretability, as for example in the adaptation of the standard fillers in Brown et al. (2021). Brown et al. (2021, p. 10) refer to this as a "clearly ungrammatical level" with examplessuch as *The ink was for spilled* that are considered both unacceptable and uninterpretable.

Figure 2 illustrates the expected distribution of the five sets of standardized fillers on a 7-point scale of acceptability; see also the discussion in Featherston (2020, pp. 168–72) showing a similar distribution in *z*-scores based on an actual experiment. The exact values may vary from experiment to experiment, and it may not always be the case that the distance between the levels is evenly distributed, especially if target conditions fall between two of the levels (Gerbrich et al. 2019, pp. 315–16). From these predicted values and the judgment marks in Gerbrich et al. (2019), it becomes apparent that the binary ungrammaticality marking may be limited to a rather small gradient acceptability area, unlike the assumption above that the threshold for binary grammaticality judgments lies in the middle of the gradient scale. I leave this point open for discussion here.

**Figure 2.** Expected rating distribution for the A–E levels of the standard fillers in (23) on a gradient judgment scale.

By comparing the experimental items in declarative and interrogative conditions relative to their location on the gradient acceptability continuum established by the reference fillers, more reliable conclusions about the relative acceptability of BPPA constructions can be made. Since the fillers leave enough room in the upper half of the scale (A–C) for highly acceptable to slightly marked levels of acceptability, even subtle differences in declarative BPPA constructions that are obscured in binary intuitive judgments can be detected. With respect to interrogative BPPA constructions, it is of interest whether they decline all the way to the bottom of the scale in suboptimal conditions and how large the difference is to conditions that the literature considers to be grammatical.

The use of the standardized fillers in an experimental setting also has two more mundane, methodological benefits: (i) a plausibility check for the target items, and (ii) a plausibility check for participant responses.

In a typical experiment, it is advisable to construct target items that avoid the extreme points of a closed scale to prevent ceiling and floor effects. It is also advisable to exclude target items that have no unique structural representation (*word salad*) because the researcher cannot determine which structural parse is being judged (Gerbrich et al. 2019, pp. 310–11). The E-standards are marked in several clearly determinable ways but still have a unique structural representation, whereas the A-standards contain no structural or semantic faults. This means that researchers should become skeptical if their target items fall outside the range of the standard fillers, i.e., if there are items averaging significantly better than the A-standards or significantly below the E-standards. There can be, of course, good reasons for such situations, but the results should be scrutinized closely. Target items that fall somewhere between the ranges of the standard fillers can be more clearly evaluated for their overall gradient acceptability.

The second point concerns the reliability of participant judgments. As these judgments are collected in an anonymous fashion and there are no negative consequences for incoherent or blatantly random judgments, data quality needs to be ensured at some point. Especially experiments that are carried out with compensation of some kind, be it monetary or for course credit, may create an environment where participants are not really engaged with the task and do the experiment half-heartedly. Large crowdsourcing platforms, such as Amazon's Mechanical Turk or others, have on the one hand been shown to provide usable data (Gibson et al. 2011; Sprouse 2011), but on the other hand it can always happen that participants try to complete as many tasks as possible for maximum compensation. An ethical amount of payment is a first step to avoid this, but does not guarantee accurate data. To solve this issue, the standard fillers can be evaluated for individual participants to see whether they reproduce the expected decline in mean acceptability from the A-standards to the E-standards. If the E-level averages higher than the C-level, for instance, this is a good indication that the experiment was not carried out diligently, providing a principled reason to exclude this participant from the statistical evaluation. Although the judgments for the standard fillers may not always exactly follow the expected pattern, as shown in Featherston (2020, p. 170), it is still possible to distinguish completely random judgments from those that are slightly off.

These two methodological points have shown that the standardized fillers have a valid use in experimental judgment studies in addition to the better comparability with stable levels of acceptability. They provide a more fine-grained scale of well-formedness compared to binary judgments, and also allow for a more principled conversion to traditional judgment marks, such as the question mark or the asterisk.

#### *3.4. Interim Conclusion*

In this section, I have discussed three issues that should be considered in the analysis of island-internal variation, exemplified with an evaluation of theoretical approaches to BPPA constructions. First, the relation between grammaticality and acceptability and how this relation can become problematic for theoretical conclusions about locality operations, such as *wh*-extraction. I have argued that there is nothing wrong in considering the two declarative sentences in (14) grammatical; it is, however, problematic to ignore potentially interesting differences in acceptability. Second, I have described the use of a factorial design to better describe island-internal variation in relation to the variation that is independent of extraction from the island. This design avoids potential confounds that arise if too much emphasis is placed on variation in interrogatives. Third, I have discussed how acceptability judgment tasks can benefit from the use of the standardized fillers in Gerbrich et al. (2019), from both conceptual and methodological perspectives. In combination with a factorial design that includes declarative base-structures, this allows for a principled analysis of the effects operating in specific types of islands.

In the following section, I consider existing experimental work on the acceptability of declarative and interrogative BPPA constructions and how these results compare to the issue of gradient acceptability and ramifications for the construction of licensing mechanisms for extraction.

#### **4. Previous Experimental Investigations**

The idea that not all declarative BPPA constructions are equally acceptable because the adjunct predicate is not semantically licensed in all configurations was first proposed in Brown (2015, 2016, 2017). She argues that only low-merged VP adjuncts are in the right structural configuration to allow extraction, whereas high-merged *v*P adjuncts resist extraction.8 By hypothesis, not all types of participle adjunct predicates qualify as low-merging adjuncts to all types of matrix predicate. This means that some participle adjuncts fail to be licensed in the configuration that would allow extraction, which leads to reduced acceptability that is independent of whether extraction takes place or not. Brown (2017) formulates this as a distinction between the semantic licensing conditions on the low-merging adjunct and the syntactic licensing conditions for extraction. For the semantic licensing conditions of low-merging adjuncts, she suggests that the temporal interval of the matrix predicate should be properly included in that of the adjunct predicate, which works best if the adjunct predicate is atelic and the matrix predicate telic; this is essentially the generalization formulated in Truswell (2007). Kehl (2021) goes in a similar direction by proposing a set of semantic compatibility and syntactic complexity criteria that determine the acceptability level of the declarative BPPA construction, taking into account the properties of the host predicate. Both approaches share the common assumption that there is a principled relation between acceptability differences in interrogatives and the corresponding declaratives.

Brown (2017) shows experimentally that there are significant effects of transitivity in declarative BPPA constructions, and that this effect does not interact with the presence vs. absence of a gap: thus, the relative acceptability difference between the intransitive (24a) and the less acceptable transitive (24b) is the same as that between the corresponding declaratives in (25a) and (25b).


(Brown 2017, p. 119)

What this means is that transitivity shows an effect on acceptability but does not determine how strongly extraction affects acceptability. This result is unexpected in the framework proposed by Borgonovo and Neeleman (2000).

In addition, Brown (2017) shows that telicity only has a significant effect for intransitive matrix predicates, i.e., for unergatives and unaccusatives. Transitive atelic activities and transitive telic accomplishments do not show a similar sensitivity. This is not predicted by the event-semantic account in Truswell (2007). For example, the telic transitive sentence in (26a) is equally acceptable as the atelic transitive in (26b), but the telic intransitive in (27a) is more acceptable than the atelic intransitive in (27b); the same obtains for the corresponding interrogatives. Similar observations are found in Kehl (2021).

(26) a. Mary picked the candidates whistling the national anthem.

b. Sophie finished sketches whistling the national anthem.

(Brown 2017, p. 119)

(27) a. Lucy arrived whistling the national anthem. > more acceptable b. Lucy shivered whistling the national anthem.

(Brown 2017, p. 119)

Brown (2017) concludes that transitivity is a key factor in determining the acceptability of declarative and interrogative BPPA constructions;9 she also concludes that the relation between acceptability contrasts in declaratives and interrogatives should be taken seriously. These results fit her two-component model with independent licensing conditions for the adjunct and extraction operations. The complex acceptability pattern observed for interrogative BPPA constructions in the literature can be traced back to similar differences in declaratives, obviating the need for additional licensing mechanisms that are tied to extraction.

Similarly, Kehl (2021) reports that telic matrix predicates have an advantage over atelic ones (experiments 1 and 2) and that unaccusative matrix predicates are judged as more acceptable compared to unergatives and transitives (experiment 4); in none of the experiments, however, do these factors interact with extraction, so that the acceptability differences in interrogatives can be reliably predicted from identical contrasts in declaratives. These results obviate the requirement for additional syntactic or semantic licensing conditions for extraction as postulated in Borgonovo and Neeleman (2000) and Truswell (2007). For example, there are already significant differences between declarative conditions with telic and atelic matrix predicates, respectively, seen in (28). To be precise, the relative difference is exactly the same as in the interrogatives in (29), as the telicity of the matrix predicate does not interact with the presence or absence of extraction.


Additionally, the same contrasts are obtained in relativizations such as (30), which are closer in form to the attested examples in Santorini (2019). A comparison of declarative, interrogative, and relativization BPPA constructions shows that the effect of telicity remains the same across these sentence types, but the overall acceptability is shifted: declarative BPPA constructions are generally more acceptable than relativizations, which, in turn, are more acceptable than interrogative BPPA constructions. This points towards the fact that different types of long-distance dependencies require different degrees of processing effort.

(30) a. This is the song that John arrived whistling. [telic matrix predicate] b. This is the song that John worked whistling. [atelic matrix predicate]

Similar results are obtained for the distinction between unaccusative, unergative, and transitive matrix predicates; this points towards the fact that the proposals in Borgonovo and Neeleman (2000) and Truswell (2007) are not related to extraction from the adjunct. From an architectural perspective, it is easier to include a condition on the possibility of L-marking along the lines of Borgonovo and Neeleman (2000) instead of making core syntactic operations sensitive to semantic factors; whether an event-semantic approach to acceptability differences in declaratives fares better than one based on the grammatical verb type of the matrix predicate remains to be seen, but both are most likely related to how complex the resulting BPPA construction is for the parser to interpret and how plausible the complex event described there is; see also Chaves and Putnam (2020) for similar points. It is probably the case that Truswell (2007) is on the right track concerning the influence of event structure, even if this factor does not seem to depend on the presence or absence of extraction.

Several experiments in Kehl (2021) also show that there are considerable differences between declarative conditions, which are not directly predicted in Borgonovo and Neeleman (2000) or Truswell (2007), again pointing to the importance of considering the relative acceptability of the underlying declaratives instead of only their grammaticality. These

differences in declaratives can be captured in the comparison with the standardized reference fillers from Gerbrich et al. (2019): in most of the reported experiments, there is a contrast between the more acceptable declarative conditions, which are located between the A- and the B-level of the reference fillers, and the less acceptable declarative conditions with judgments clearly below the B-level and sometimes closer to the C-level. This shows that these differences are not too subtle to be irrelevant, or "unremarkable" as Truswell (2007, p. 1373) puts it.

The declarative counterparts of BPPA constructions are also compared to interrogatives in Kohrt et al. (2018), who do not find evidence for the theoretical claims about the factor agentivity in Truswell (2011), but crucially also no interaction of their factor ±extractable with extraction (their experiment 1). Against the predictions from Truswell (2007) and Truswell (2011), they do not find significant effects of verb type distinctions between extractable *arrive*-type verbs and non-extractable *work*-type verbs; see the example items in (31a). The only significant effect they find is between declaratives (31a) and interrogatives (31b), which is the predicted negative effect of extraction on acceptability.

	- b. John wondered which coffee his best friend {worked/arrived} at the office drinking \_\_ late this afternoon.

#### (Kohrt et al. 2018)

The lack of a significant effect of whether the matrix predicate is a suitable predicate for extraction may partially be caused by their assignment of event types to either extractable or non-extractable conditions: they include states in the extractable category and accomplishments in the non-extractable category, which is in line with the claims about agentivity in Truswell (2011), but is problematic from the observations about telicity in Truswell (2007) and the possibility for accomplishments to allow extraction when the adjunct specifies the causal component of the accomplishment, which is explicitly acknowledged in Truswell (2011).

The experimental evidence provided by Brown (2017) and Kehl (2021) supports the hypothesis that the factors identified in the literature do not influence the strength of extraction from the adjunct; there is no need to postulate additional licensing mechanisms to evade the CED. Both find that there are systematic acceptability differences in declaratives that are carried over to the interrogative structures without additional effects requiring an explanation.

#### **5. A Model for the Acceptability of Participle Adjuncts**

Once the focus of interest is shifted to a principled comparison between declarative base positions and *wh*-interrogatives, as well as the underlying acceptability differences in declaratives, the question is what causes these acceptability differences found in Brown (2017) and Kehl (2021). In this section, I will first discuss factors which influence the acceptability of (declarative) participle adjuncts; some, but not all of these factors have been discussed in the previous literature. At the end of this section, I will combine the factors into a partially weighted model for predicting the acceptability of declarative and interrogative participle adjunct constructions. This model will be conceptually based on graded and multifactorial models of acceptability such as the Decathlon Model (Featherston 2008, 2019) and the Cumulative Effect Hypothesis discussed in Haegeman et al. (2014) and Greco et al. (2017).<sup>10</sup> In these types of model, the violation of individual constraints show negative effects on acceptability; these constraint violations are cumulative, so that the violation of each additional constraint further decreases acceptability. I will argue that extraction from the adjunct is simply one additional negative effect that is added to the combined effects of the factors which influence acceptability in declarative BPPA constructions; crucially, the size of the extraction effect does not depend on whether other effects apply in the

declarative or not.<sup>11</sup> This is precisely the fundamental assumption made in Brown (2017) and Kehl (2021), which differentiates these accounts from previous approaches to extraction from adjuncts.

#### *5.1. Transitivity: Multiple Referents Incur Independent Processing Costs*

One of the factors that determines whether a BPPA construction is highly acceptable in declaratives is transitivity, i.e., whether the matrix predicate selects one or more arguments. Brown (2017) finds that transitivity is a relevant factor because it determines whether telicity has an effect at all, shown by an interaction of the two factors in her experiments. For transitive predicates, it is not important whether it is an atelic activity or a telic accomplishment, but intransitives are sensitive to the unergative–unaccusative distinction, with unaccusative achievements being more acceptable than unergative activities. This result is also found in Kehl (2021, experiment 4), where unaccusatives have a general advantage over unergatives and transitives, which are not differentiated between telic and atelic.

An additional observation made in Kehl (2021), based on the discussion in Borgonovo and Neeleman (2000), is that the nature of the second argument is important: reflexive objects as in (32a) and subjects of resultative constructions as in (32b) behave differently than prototypical transitive predicates with two distinct discourse referents, as in (32c).


Here I will not go into a detailed discussion why resultative constructions differ from transitives; see Winkler (1997), Rothstein (2017), and Hu (2018) for discussion of how the subject of the resultative is assigned its *θ*-role. Incidentally, Borgonovo and Neeleman (2000, p. 212) observe that extraction from the adjunct in (32b) is ungrammatical, whereas Truswell (2007, 2011) considers this a prime example of transparent accomplishments; this emphasizes the need to investigate this type of matrix predicate in more detail.

In more general terms, a second argument increases complexity in the BPPA construction, also because potential control conflicts of the adjunct predicate need to be resolved: in a transitive sentence, the adjunct can be controlled by both the subject and the object of the matrix clause, which increases the amount of processing to resolve this ambiguity. Some event types show restrictions in their control possibilities (Rapoport 2019; Simpson 2005), but then the parsing of the wrong control orientation should lead to even lower acceptability.<sup>12</sup>

The observation that transitivity in general incurs drops in acceptability independently of extraction operations is also made in Jurka (2010, 2013), Polinsky et al. (2013), and Konietzko (2021); they all find that predicates which select a second argument are slightly less acceptable than intransitives (unergatives and unaccusatives) in declarative structures. Polinsky et al. (2013, p. 296) refer to this as a 'transitivity penalty', which is probably caused by the processing effort to parse the second argument. Similar effects of transitivity are also discussed in relation to extraction in Dependency Locality Theory (Gibson 1998, 2000), which also offers an explanation for the behavior of transitives; I follow Polinsky et al. (2013) in assuming that the effects of transitivity are not exclusive to sentences with extraction.

The negative effects of transitivity make the prediction that the more arguments are selected by the matrix predicate, the higher the processing effort required of the parser, with at least some effect on acceptability. Thus, I predict a relative decline in the acceptability of the sentences in (33), even if all structures might receive a grammatical binary judgment:


The full paradigm of transitivity thus ranges from purely intransitive to reflexive transitive, resultative, transitive, and, finally, ditransitive. It is also possible that not only the number of arguments, but also other factors play a role; this could be formulated in terms of the multi-faceted definition of the transitivity continuum in Hopper and Thompson (1980). An additional problem that arises in ditransitives is that there is a potential orientation ambiguity for the participle adjunct depending on its lexical content: the orientation can be shifted towards the direct object, as in (34), and is sometimes the preferred interpretation.

(34) John*<sup>i</sup>* gave Mary*<sup>j</sup>* a letter*<sup>k</sup>* [lying on the table]*k*.

In the interrogatives corresponding to (33), the contrast between the intransitive and the (di-)transitive structures is noticeable, but the ditransitive is even worse than the transitive. This is not directly reflected in the binary judgments in (35), but should be visible in a judgment study. The low acceptability of the ditransitive structure (35c) carries over to the alternative ordering in the double object construction in (35d).


Chaves and Putnam (2020, p. 15) point to the fact that optional transitivity may confound the intended interpretation of interrogative BPPA constructions because the *wh*-phrase may be linked to a gap in complement position of an optionally transitive matrix predicate instead of the complement position of the adjunct; see also Staub (2007) and Ness and Meltzer-Asscher (2019).<sup>13</sup> This ambiguity is shown in (36), where potential optional gap sites are indicated by underscores in parentheses.

	- a. John walked the dog whistling.
	- b. John walked whistling a funny song.

An ambiguous parse with gap position after the matrix predicate can be avoided by restricting adjunct predicates to obligatorily transitive predicates, such as *proclaiming*, as in (37). Here the gap site after the main verb would trigger ungrammaticality because the gap after the adjunct is obligatory, here indicated by the lack of parentheses around the gap site following the adjunct predicate. This means that the *wh*-pronoun cannot associate with the optional potential gap site in the matrix clause. A parasitic gap reading is also possible here if the filler can be the object of both predicates; I do not discuss this possibility further here.

	- a. \* John walked the dog proclaiming.
	- b. John walked proclaiming his love for Pam.

Yet another way to reduce gap site ambiguity is if a motion verb like *walk* is augmented with a directional phrase, as in (38); it is still possible that John walks his dog to the park, but this parse becomes less likely than in (37).

(38) What did John walk to the park whistling \_\_ ?

To sum up, transitivity, even if it is optional, increases the overall complexity of the BPPA construction and thus gradually builds up hurdles for extraction. Unambiguously intransitive predicates are predicted to have an advantage over potentially transitive and unambiguously transitive predicates; reflexive and resultative predicates occupy the middle ground because on the one hand they include a second argument, but this argument is either not directly selected by the main verb (resultatives) or is co-referential with the main verb's subject (reflexives).

#### *5.2. Event Structure: Durativity Instead of Telicity*

Another factor which has an effect on the acceptability of declarative and interrogative BPPA constructions is based on the observation that not all types of matrix predicate can be felicitously modified by an adjunct predicate. The restrictions on BPPA constructions resemble those that operate in depictive secondary predication, where likewise not all types of main verb accept depictives to the same degree (Rapoport 2019; Simpson 2005). There is an ongoing discussion whether complex adjuncts, such as BPPAs, can be analyzed as depictives, but I will assume this for the present discussion; see also Rothstein (2017, p. 3874). For example, permanent statives, as in (39a), are odd with a BPPA, whereas temporary statives, as in (39b), are more acceptable.

(39) a. ? John was blond [wearing his new sunglasses]. [permanent state] b. John lay in bed [wearing his new sunglasses]. [temporary state]

The difference between these types of states is that temporary states have an event variable, which permanent states lack (Rapoport 1993, p. 173). Permanent states are property ascriptions whereas temporary states are predicated of the subject for a temporal interval that allows delimitation. This distinction also shows up in the corresponding interrogatives in (40):

	- b. What did John lie in bed [wearing \_\_ ]?

Since both permanent and temporary states are atelic, these acceptability differences are problematic for the telicity-based account in Truswell (2007) and Brown (2017), as well as for the reflexivity account in Borgonovo and Neeleman (2000).

A telicity requirement is also problematic for purely punctual achievements like *appear*, which should be ideal candidates for a temporal inclusion relation in Brown (2017); still, these predicates are degraded in interrogatives, as seen in (41):

	- b. \*What did John appear [wearing t]?

(Truswell 2007, p. 1374)

Similar observations can be made for verbs such as *notice* and other perception verbs. The question is whether this carries over to the declarative counterparts; as far as I am aware, this has not been directly tested in a controlled experiment. What permanent states and purely punctual achievements have in common is that both fail to felicitously appear in the progressive, as seen in (42a) and (42b). Crucially, temporary states are fine with the progressive, shown in (42c).

	- b. ? John is appearing.
	- c. John is lying in bed.

In terms of Rothstein (2004), punctual achievements and many perception verbs such as *notice* fail to appear in the progressive because the progressive cannot target an interval preceding the culmination point. The situation is different in cases similar to *arrive*, where the preceding interval can be conceptualized as the path component that leads up to the arrival. With *appear*, the perspective is different: it is inherently external to the appearing entity, whereas *arrive* allows a conceptualization from the perspective of the arriving entity. This is a first indication that telicity alone makes the wrong predictions in these cases; rather, it seems that there is a certain correlation between the reported interrogative patterns and the ability to appear in the progressive.

Thus, the generalization about telicity in Truswell (2007) needs to be revised to exclude purely punctual achievements and to allow for temporary states. Instead of telicity, I argue that a first step towards a descriptive pattern is to consider the encoding of a durative subevent as relevant for acceptability, which is not the case for permanent states and punctual achievements.

#### *5.3. Incrementality: Themes, Paths, and Properties*

An exclusive focus on durativity leads to problems with the experimental results for activity main verbs in Brown (2017) and Kehl (2021): BPPA constructions with activity main verbs are less acceptable than achievements. To further constrain declarative BPPA constructions, a comparison with depictive secondary predicates shows that not all activity main verbs license a depictive, as shown in (43). The pattern is more difficult to capture than that of permanent and temporary states or punctual achievements, but if the BPPA construction can be analyzed as depictive secondary predication, similar effects can be expected there as well. It is also noteworthy that the addition of an object in (43c) ameliorates the modification of *draw* by a depictive.

	- b. \* Jane laughed/drew drunk.
	- c. Jane drew pictures drunk.

#### (Rapoport 2019, pp. 434–35)

The distinction between *draw* and *draw pictures* in (43b) and (43c) also shows up in BPPA constructions, where the bare form in (44) is degraded in the interrogative; as noted above, the experimental evidence in Brown (2017) and Kehl (2021) suggest that the declarative counterparts are also less acceptable than sentences with achievement main verbs.

(44) a. I work listening to music.

b. \*What do you work [listening to t]?

(Truswell 2007, p. 1373)

The sentences improve in the presence of a direct object, seen in (45). This is contrary to the expectations derived from transitivity in the previous subsection, but suggests that some form of temporal delimitation may be a factor contributing to acceptability, without leading to a telicity requirement.

	- b. What did Mary work on her thesis drinking \_\_ ?

All the acceptable depictive constructions in (43) involve an activity predicate that is in some sense delimited, but still atelic. A specific dance or a lecture have a specified duration, and the drawing of pictures can be measured by the amount of pictures produced, whereas laughing and drawing in the sense of aimlessly doodling are not delimited in the same sense. It could be argued that this type of delimitation is connected to the concept of incremental themes (Dowty 1979): a lecture, pictures, and working on a thesis can be measured against a scale of progress, similar to the incrementality of eating one, two, or three apples. The analogy to incremental themes also extends to the domain of motion, which also come in incremental and non-incremental forms. As noted in Dowty (1979), Tenny (1995), and Borgonovo and Neeleman (2000), unergative manner of motion verbs like *walk* behave differently when they are followed by a directional PP like *to the station*; this PP introduces a path component that can be measured similar to incremental themes. The effect is shown in (46):

	- b. Mary walked to the station whistling a funny song.

Incrementality also extends to properties, which captures cases such as (47), where the degree of being scared increases with the progress through the movie (the gradual reading of this sentence probably comes from the durative character of the adjunct predicate, but this discussion is outside the scope of this paper).

(47) John got scared watching a horror movie.

Similar effects of incrementality are seen with semelfactive main verbs such as *jump* in (48), where a particle inducing iterativity and thus durativity has a positive effect in interrogatives. A possible factor in addition to transitivity and durativity could thus be the potential of the event described by the main verb to be measurable or quantifiable in some sense.

(48) What did she jump \*(around) [singing t]? (Truswell 2007, p. 1361)

Taken together, there is at least some evidence that purely temporal inclusion of the matrix interval within the interval of the adjunct predicate is not able to account for the full data pattern, which casts doubt on the scale amalgamation process suggested in Brown (2017). The overall picture emerging from this discussion is that it is unlikely that there is a single factor which determines whether a given main verb will be highly acceptable with a BPPA. This bears close similarity to the multiple factors which influence performance and acceptability along the lines of Chomsky (1965), suggesting that the acceptability of declarative BPPA constructions is a matter of syntactic and semantic complexity and compatibility criteria instead of strict syntactic licensing requirements.

#### *5.4. Combining the Factors into an Acceptability Model*

Based on the theoretical discussion of the relation between acceptability in declarative and interrogative BPPA constructions in Brown (2017) and the evidence supporting it, Kehl (2021) develops a model that captures this relation; this model includes factors that differ from those in Brown (2017) and other approaches. The main focus is on the fact that the factors which operate in interrogatives are also visible in declaratives. Extraction simply acts as an additional factor that is independent of the individual decreases in acceptability resulting from other factors, such as transitivity or durativity. The model can be summarized as follows:

	- i. Determine the acceptability of the declarative sentence; factors: transitivity, durativity, incrementality
	- ii. Determine the acceptability of the interrogative sentence by adding the processing costs of extraction to the result of (i)

In the first stage of the model (49i), the factors discussed above influence the acceptability of the BPPA construction: transitivity will decrease acceptability because more arguments require more processing effort. Durativity and incrementality work similarly: the absence of a durative subevent, i.e., for permanent states and purely punctual achievements, decreases acceptability, as does the absence of a delimited or incremental meaning component. Transitivity is most likely a result of increased processing effort, but durativity

and incrementality are semantic factors which seem more related to the conceptual felicity of the situation described in the sentence. Kehl (2021) collects durativity and incrementality under the term *semantic compatibility*. <sup>14</sup> In contrast to these factors, transitivity can be captured in syntactic terms, but the reason that transitivity matters is more likely to be found in relation to ease of processing and the ambiguity between transitive and intransitive uses of the verb in question.

The second stage of the model (49ii) adds the cognitive cost of establishing a dependency (Wagers 2013); this cost is most likely higher than into other domains, such as subcategorized complements, in line with the CED.<sup>15</sup> As this dependency formation is more demanding than a declarative sentence, this results in decreased acceptability. Crucially, the application of extraction and the resulting decreases in acceptability are independent of the factors which determine acceptability in the declarative: in a sense, extraction is blind to these factors. This is compatible with the independence of syntactic operations from purely semantic properties of the sentence (Brown 2017).

With respect to the relative weight of the factors that affect acceptability in declarative BPPA constructions, the previous experimental work on this construction in Brown (2017), Kohrt et al. (2018), and Kehl (2021) does not directly allow conclusions. The negative effect of transitivity is observed and isolated as a key factor in Brown (2017) and is in agreement with the transitivity penalty discussed in Polinsky et al. (2013). Scalar change and durativity are more complex to evaluate because the previous experimental work has focused on the telic–atelic distinction to check the predictions of Truswell (2007), but this distinction does not directly map to the factors discussed here. The complex interactions of these factors should be addressed in future experimental research. Based on the experimental results from Brown (2017) and Kehl (2021), it is possible to assign a preliminary weighting to this model: the effect of extraction is much stronger than that of durativity, incrementality, or transitivity. This observation connects to the discussion above about the subtle acceptability differences in declarative BPPA constructions, which run the risk of being considered irrelevant, especially if the focus of the approach in question is in grammaticality rather than acceptability. The acceptability model can be graphically represented as in Figure 3, taken from Kehl (2021).

**Figure 3.** Acceptability model for BPPA constructions from Kehl (2021, p. 284); the factor *scalar change* corresponds to incrementality in this paper. Upwards arrows indicate a positive effect on acceptability, (double) downward arrows a negative effect.

This illustration shows the positive effects of durativity and incrementality with upward arrows, as well as the negative effect of transitivity with downward arrows; double downward arrows on the factor extraction indicate that this effect is stronger than the others. The central characteristic of this model is that it incorporates the relation between declarative and interrogative acceptability as formulated in Brown (2017), which is stated in Kehl (2021) as the independence of extraction from the factors operating in declaratives. This model accounts for the sometimes subtle acceptability differences in declarative BPPA constructions, as well as the central factors isolated for participle adjunct islands in Borgonovo and Neeleman (2000) and Truswell (2007). At the same time, however, this model is conceptually simpler because the extraction operation remains blind to semantic characteristics of the sentence in question.

The model captures the following judgment differences discussed in the literature: (i) the advantage of telic over atelic matrix predicates due to scalarity (50i), (ii) the oddity of punctual matrix predicates because the latter do not satisfy durativity (50ii), (iii) the improvement with path scales and incremental themes for atelic matrix predicates because they introduce a scalar meaning component (50iii), and (iv) the effect of the number of arguments selected by the matrix predicate as a reflex of transitivity (50iv). If these contrasts can be shown to be observable in declaratives as well as interrogatives alike, this supports the predictions of the factorial acceptability model.



Not all of these contrasts have been tested experimentally in the literature: the contrast in (50i) is the one that most of the existing literature focuses on, e.g., Brown (2017), Kohrt et al. (2018), and Kehl (2021). Likewise, transitivity effects as in (50iv) are to a certain extent explored in these studies, but further studies are required to see where reflexive and resultative matrix predicates lie in relation to intransitive and transitive sentences. The contrasts between purely punctual and extendable achievements in (50ii) as noted in Truswell (2007) and the precise effect of an added scalar meaning in cases like (50iii) also require additional work.

This acceptability model focuses on simple declarative and interrogative BPPA constructions, but it can also be modified to include other sentence forms, such as relativization or topicalization; these sentence forms also encode unbounded dependencies, but are not interrogative (Chaves and Putnam 2020). It can thus be expected that they do not show the same degree of decreased acceptability as the *wh*-interrogatives focused on in this article, which is also indicated in the data reported in Abeillé et al. (2020) and Liu et al. (2022). Compare the declarative BPPA construction in (51) with the different types of dependencies in (51a)–(51c).


Initial evidence that relativization leads to a generally smaller decrease in acceptability than bare *wh*-interrogatives is given in (Kehl (2021) [experiment 1]). This might be related to a better match between the information-structural status of the adjunct constituent from which extraction takes place and the discourse function of relativization, as proposed in Abeillé et al. (2020). The visualization of the acceptability model in Figure 3 can be generalized by adding more extraction types than just *wh*-extraction, and by linking these different types of dependency formation to separate acceptability levels; this is shown in Figure 4, where relativization and topicalization are allowed for negative effects on acceptability that are not necessarily identical to that of *wh*-extraction. I will have to leave the relative magnitude of these effects for future experimental research. The underlying hypothesis remains that the contrast between matrix verbs such as *arrive* and *work* can be observed equally across these different dependency types; this assumption follows the argumentation in Chaves and Putnam (2020) that the pragmatic felicity of the underlying proposition has a strong role to play in island effects and extraction asymmetries.

**Figure 4.** Adapted acceptability model for BPPA constructions in different sentence forms; the acceptability of a dependency construction is modeled as the acceptability of the underlying simple declarative plus the effect of establishing a dependency construction. Upwards arrows indicate a positive effect on acceptability, downward arrows a negative effect.

Another important issue is how strong the factors of the acceptability model are affected by variation in speaker judgments. So far, I am not aware of experimental studies that explicitly take this factor into account. There are studies on the related phenomenon subject islands investigating whether judgments improve depending on presentation order: Chaves and Dery (2019) report that judgments improved if the item was presented later in the experiment, suggesting that there is a satiation effect and that the initially low acceptability judgment improves with repeated exposure as a type of learning effect. If violations of the subject condition can improve over time, it seems plausible that the type of semantic mismatches resulting from scalarity and durativity can also improve with repeated exposure, but this requires further investigation.

In conclusion to the factors related to acceptability in the BPPA construction and the model proposed in Kehl (2021), it seems that Truswell (2007) is not right in his claim that declarative BPPA constructions which do not meet his extraction condition are unremarkable. The exact opposite holds: acceptability differences in declaratives resulting from a variety of different factors are the key determinants of acceptability in interrogative BPPA constructions, and it is not the extraction operation that triggers these differences in interrogatives.

#### **6. Converging Evidence for the Relevance of Acceptability Differences in Declaratives**

More recent work agrees about the relevance of potential acceptability differences in declaratives for the acceptability of movement constructions. The proposals diverge slightly in the source of such differences, but the focus has shifted from purely syntactic explanations towards more interface-based ones. Transitivity as a processing-related complexity criterion and event structure as a semantic notion have been the focus of this article.

Similar conclusions about extraction-independent effects of processing-related complexity on acceptability are drawn for the apparent licensing of island-violating extractions in so-called parasitic gap (PG) environments in Culicover and Winkler (2022); they trace back the ameliorating effect ascribed to parasitic gaps to complexity differences between (declarative) PG and non-PG constructions. The former are more acceptable because they are less complex for processing due to the fact that one less referentially distinct argument is encoded. In the contrast in (52), the additional gap in the matrix clause in (52b) leads to the fact that there is only one discourse referent in the sentence, whereas there are two

in (52a). There is, thus, an underlying difference in complexity that pushes (52a) below a threshold for grammaticality, which is not the case in (52b). The parasitic gap is indicated by *pg* in this example.

	- b. a person who*<sup>i</sup>* [talking to *pgi*] about this would prove to *ti* that there is a problem (Culicover and Winkler 2022, p. 2)

Although the two corresponding non-extraction sentences in (53) are certainly both grammatical, (53a) is more complex than (53b) from a processing perspective because an additional discourse referent needs to be processed. Whether this results in noticeable acceptability differences is a question that is outside the scope of this paper, but can explain the strong judgment difference reported for (52) by Culicover and Winkler.

	- b. Talking to *person X* about this would prove to *person X* that there is a problem.

The conclusions in Culicover and Winkler (2022) are very similar to that discussed in this article: there is no requirement for a dedicated licensing or repair mechanism associated with parasitic gaps; a sufficient description of the underlying complexity differences is sufficient to explain why PG constructions are more acceptable than the non-PG construction. Culicover and Winkler (2022) also discuss the important distinction between grammaticality and acceptability that can be used to provide a comprehensive explanation of the patterns detected for parasitic gaps in the literature.

Another set of factors comes from the interface of syntax with pragmatics: Chaves and Putnam (2020) point out that apparent grammaticality contrasts in syntactically marked constructions, such as *wh*-questions, often have their origin in sometimes subtle pragmatic differences that are unrelated to the formation of the marked construction. They propose a largely pragmatic approach to most island domains by arguing that the low acceptability can often be traced back to issues of relevance and salience: if the island domain is not salient or relevant, acceptability contrasts in unmarked constructions can arise and evoke the impression of stronger grammaticality contrasts in marked constructions. This is captured in the Relevance Presupposition Condition (RPC):

(54) RELEVANCE PRESUPPOSITION CONDITION: the referent that is singled out for extraction in a UDC must be highly relevant (e.g., part of the evoked conventionalized world knowledge) relative to the main action that the sentence describes. Otherwise, extraction makes no sense from a Gricean perspective, as there is no reason for the speaker to draw attention to a referent that is irrelevant for the main contribution of the sentence to the discourse. (Chaves and Putnam 2020, p. 206)

The contrast in (55) is given as an example of this, but the grammaticality difference is unrelated to extraction:

(55) a. What did you read a book about?

b. \*What did you drop a book about?

(Chaves and Putnam 2020, p. 207)

It has been noted as early as Kuno (1987) that the corresponding declaratives already show a noticeable acceptability difference; this is shown in (56):

	- b. ?Speaking of Napoleon, I just dropped a book about him.

(Chaves and Putnam 2020, p. 205)

The reasoning to explain these independent acceptability differences is along the following lines: verbs evoke certain conceptualizations when they are encountered by the parser, and some meaning components are more easily accessible than others. Reading a book evokes the concept of a topic covered by the book, which is relevant information. However, the topic is not as relevant and easily evoked when a book is dropped (Chaves and Putnam 2020, p. 207). This has clear ramifications for acceptability in marked constructions, but may not be as clear in unmarked ones.

For BPPA constructions, the predictions of the RPC predict that adjuncts which supply relevant information invoked by the event described in the matrix predicate can be targeted by extraction. This serves as an explanation for the relative acceptability of cases where the adjunct describes the cause of the matrix predicate, as in (57a). The distinction between non-causal adjuncts discussed in Truswell (2007, 2011), as in (57b) and (57c) is less clear, but it could be argued that telic predicates like *arrive* are informationally light, so that the adjunct can be analyzed as relevant in the sense of Chaves and Putnam (2020); atelic predicates such as *work*, on the other hand, can be argued compete with the adjunct in terms of which information is more relevant, so that the extraction is not licensed by the RPC.

	- b. What did Peter arrive whistling \_\_ ?
	- c. \*What did Peter work whistling \_\_ ?

The acceptability model discussed in the previous section is not mutually exclusive with the RPC; the generalizations in the model could be seen as factors that influence the relevance of the adjunct compared to the matrix predicate and hence have an effect on the acceptability of extraction. I agree with Chaves and Putnam (2020, p. 230) that "extraction from such island environments is contingent on the proposition itself, rather than strictly on its syntax". This captures the idea in the model that the factors described by the generalizations show effects that are independent of extraction.

There exists a number of experimental studies that test the relation between declaratives and interrogatives in related phenomena:16 for example, Chaves and King (2019) find a strong correlation between plausibility ratings for declaratives and acceptability of subextraction from objects, indicating that plausibility ratings act as a predictor of acceptability that is not modulated by extraction. However, Chaves and Putnam (2020) report on another experiment investigating extraction from tensed adverbial clauses, where they do not find a correlation between declarative and interrogative acceptability, meaning that the latter is not reliably predicted by the former. In such cases, it is reasonable to assume that there is another factor which distorts the relation, similar to the factorial definition of island effects in Sprouse and Hornstein (2013). The effects of tensed adjuncts are also discussed from a theoretical perspective in Truswell (2011, pp. 175–79) and experimentally investigated in a cross-linguistic study in Müller (2019). Abeillé et al. (2020) examine relativization from subjects and objects, with the result that extraction from subjects is actually better than extraction from objects, contrary to the predictions of locality constraints such as the CED, which do not discriminate between different types of extractions; this points towards the conclusion that not all extractions function alike, and that the discourse functions of the extraction operation and the extracted element should be included in an analysis.

These brief glances beyond the scope of this paper show that theory development is well advised to take subtle acceptability differences in declaratives seriously in the discussion of licensing mechanisms for movement. Differences in processing complexity, semantic compatibility, and pragmatic characteristics can affect canonical word orders to

such a degree that the application of movement operations invokes the impression of strong grammaticality differences.

#### **7. Conclusions**

In this article, I have emphasized the importance of the underlying declarative sentences in the discussion of extraction from participial adjunct islands. Once the distinction between grammaticality and acceptability is taken seriously, it becomes possible to explain the acceptability differences in interrogatives by examining potential acceptability differences in the declarative counterparts. The result is an approach to extraction from adjunct islands that does not require additional and complicated licensing machinery as in the theories presented in Borgonovo and Neeleman (2000) or Truswell (2007, 2011). The approaches in Brown (2017) and Kehl (2021) both emphasize the relevance of acceptability differences in declarative BPPA constructions and propose factors to capture the acceptability variation independently of extraction. I have discussed three factors that are of interest in these accounts: the notion of transitivity, expressed in the number of arguments directly selected by the main verb, the event structure of the main verb, as well as the encoding of an incremental measure scale in the matrix predicate. The effect of transitivity can be described as a processing advantage of verbs with lower transitivity: more arguments to be processed incurs processing costs that can be reflected in acceptability. As far as event structure is concerned, I have argued that a simple telicity requirement, as postulated in Truswell (2007) and Brown (2017) is insufficient to explain the low judgments observed in the literature (e.g. Truswell 2007, p. 1370) for extraction from BPPA constructions with purely punctual matrix predicates, such as *appear* and the relatively acceptable judgments with temporary stative predicates, such as *lie in bed* (Truswell 2011, pp. 158–59). One of the key components isolated in the discussion is durativity instead of telicity, even if further factors need to be taken into account in order to explain the low acceptability with activity matrix predicates. The last factor is that of incrementality, where the progression of the matrix predicate can be measured against an incremental scale, formulated either as paths, incremental themes, or property values. Together, these factors provide a first set of tools to capture the acceptability differences in declarative and interrogative BPPA constructions without the need for additional, complex licensing mechanisms.

A final, more programmatic note about the nature of so-called 'island constraints' such as the CED: there is recent evidence that not all extraction types show the same effects in CED-violating operations, and that the magnitude of the extraction effect also depends on other factors of the island domain. For example, Abeillé et al. (2020) have shown that relativization has a different effect than *wh*-extraction in subject islands, which is hard to explain in pure syntactic terms such as the CED; similar observations are reported in Kehl (2021) for *wh*-extraction and relativization from BPPA constructions, who finds that relativization from BPPAs is more acceptable than *wh*-extraction, and that the aspectual classes of the matrix and adjunct predicates have identical effects, as in declaratives and interrogatives. Additionally, experimental work in Müller (2019) suggests that some adverbial clauses are harder to extract from than others, involving factors such as adverbial clause type and tense-marking. It would appear that the notion of categorical extraction constraints, such as the CED, should be critically evaluated: are such constraints really binary in core syntactic terms, meaning that the grammar can compute the extraction only in one but not in another configuration? Or is this the same type of overgeneralization that has been shown here to be problematic for accounts like Borgonovo and Neeleman (2000) and Truswell (2007)? This is a general problem faced by binary or categorical models of grammar because they are at risk of glossing over subtle acceptability differences in favor of broad general predictions; a graded model of grammar such as the Decathlon Model (Featherston 2008, 2019) has the flexibility of assigning individual decreases in acceptability to different operations from minimally different constructions, so that these effects can be individually quantified and summed up to predict acceptability in a wider range of configurations than the categorical predictions of the CED. The upshot from this brief

discussion is that there are good reasons to assume that extraction from some structural domains is harder than extraction from others, as captured in the original formulation of the CED; whether this is due to derivational or structural factors (competence-based) or the result of increased processing complexity (performance-based) is beyond the scope of this article. I leave the details of such an analysis of island constraints to future research and conclude here that BPPA constructions are an interesting showcase of island-internal variation that can be fruitfully employed to dive deeper into the nature of acceptability and its relation to intuitively observed grammaticality patterns in island constructions.

**Funding:** This article is the result of my work in the A7 project "Focus and extraction in complex constructions and islands" of the Collaborative Research Center SFB 833 at the University of Tübingen, which was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)— SFB 833—Project-ID 75650358.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This article greatly benefited from discussion with Susanne Winkler, Sam Featherston, Peter W. Culicover, and Andreas Konietzko. I am also grateful to the three anonymous reviewers for their detailed comments and criticism, relevant literature, and overall suggestions for improvement. All remaining errors and shortcomings are necessarily my own.

**Conflicts of Interest:** The author declares no conflicts of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Notes**


#### **References**

Abeillé, Anne, Barbara Hemforth, Elodie Winckel, and Edward Gibson. 2020. Extraction from subjects: Differences in acceptability depend on the discourse function of the construction. *Cognition* 204: 104293. [CrossRef] [PubMed]

Abels, Klaus. 2012. *Phases: An Essay on Cyclicity in Syntax*. Berlin and Boston: de Gruyter. [CrossRef]

Bader, Markus, and Jana Häussler. 2010. Toward a model of grammaticality judgments. *Journal of Linguistics* 46: 273–330. [CrossRef]


Borer, Hagit. 2005. *Structuring Sense, Vol. 2: The Normal Course of Events*. Oxford: Oxford University Press. [CrossRef]


Konietzko, Andreas. 2021. PP Extraction from Subject Islands in German. *Glossa*, accepted for publication.


Stepanov, Arthur. 2007. The end of CED? Minimalism and extraction domains. *Syntax and Semantics* 10: 80–126. [CrossRef] Tenny, Carol. 1995. How motion verbs are special: The interaction of semantic and pragmatic information in aspectual verb meanings. *Pragmatics & Cognition* 3: 31–73. [CrossRef]

Truswell, Robert. 2007. Extraction from adjuncts and the structure of events. *Lingua* 117: 1355–77. [CrossRef]

Truswell, Robert. 2011. *Events, Phrases, and Questions*. Oxford: Oxford University Press. [CrossRef]

Vendler, Zeno. 1957. Verbs and times. *The Philosophical Review* 66: 143–60. [CrossRef]

Wagers, Matthew W. 2013. Memory mechanisms for wh-dependency formation and their implications for islandhood. In *Experimental Syntax and Island Effects*. Edited by Jon Sprouse and Norbert Hornstein. Cambridge: Cambridge University Press, pp. 161–85. [CrossRef]

Winkler, Susanne. 1997. *Focus and Secondary Predication*. Berlin and New York: Mouton de Gruyter. [CrossRef]

## *Article* **On the Nature of Syntactic Satiation**

**William Snyder**

Department of Linguistics, University of Connecticut, Storrs, CT 06269-1145, USA; william.snyder@uconn.edu

**Abstract:** In syntactic satiation, a linguist initially judges a sentence type to be unacceptable but begins to accept it after judging multiple examples over time. When William Snyder first brought this phenomenon to the attention of linguists, he proposed satiation as a data source for linguistic theory and showed it can be induced experimentally. Here, three new studies indicate (i) satiation is restricted to a small, stable set of sentence types; (ii) after satiation on one sentence type (e.g., *wh*-movement across ... *wonder whether* ... or ... *believe the claim* . . . ), acceptability sometimes increases for distinct but syntactically related sentence types (. . . *wonder why* ...; ... *accept the idea* . . . ); (iii) for sentence types susceptible to satiation, the difficulty of inducing it (e.g., number of exposures required) varies systematically; and (iv) much as satiation in linguists persists over time, experimentally induced satiation can persist for at least four weeks. These findings suggest a role for satiation in determining whether the perceived unacceptability of two sentence types has a common source.

**Keywords:** syntactic satiation; linguistic judgments; island effects; experimental syntax

#### **1. Introduction**

#### *1.1. Overview of the Project*

In generative linguistics, information about a person's mental grammar comes primarily from that person's judgments of acceptability: certain combinations of form and meaning are fully acceptable, while others are not. The standard idealization is that any given native-speaker consultant who is asked, on different occasions, to judge the same <form, meaning> pair will provide the same judgment on each occasion.

A systematic exception is presented by "satiation" effects: for certain initially unacceptable sentence types, after a linguist has judged multiple examples over a period of time, the perceived acceptability increases. Satiation calls out for investigation, not only because linguistic theories need to take account of its possible effects on the data they use but also because it may provide new insights into the basic phenomena that linguistic theories are meant to explain.

This article performs some of the necessary groundwork for linguistic investigation of satiation by providing evidence for the following points:

	- b. Satiation effects induced in the laboratory are replicable, in the sense that the set of sentence types that potentially satiate is consistent across studies (and for the majority of sentence types, satiation does not occur);
	- c. Satiation effects for different types of "satiable" grammatical violation have different signatures (e.g., in the number of exposures typically needed before satiation occurs and in the typical percentage of experimental participants whose judgment changes).

The objective will be to show that investigation of satiation can broaden the range of empirical phenomena (and, thus, sources of data) bearing on key linguistic issues, including

**Citation:** Snyder, William. 2022. On the Nature of Syntactic Satiation. *Languages* 7: 38. https://doi.org/ 10.3390/languages7010038

Academic Editors: Anne Mette Nyvad and Ken Ramshøj Christensen

Received: 9 August 2021 Accepted: 1 February 2022 Published: 17 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in particular a whole range of issues concerning the nature and status of acceptability judgments.

#### *1.2. Overview of Satiation*

Building on earlier unpublished work by Karin Stromswold, Snyder (2000) published a squib drawing linguists' attention to the phenomenon Stromswold had termed "syntactic satiation". For a highly circumscribed set of sentence types, linguists sometimes experience a "shift" in their native-speaker acceptability judgments. The paradigm case is (2).

(2) Who does John wonder whether Mary likes? (*Answer: He wonders whether she likes Pat*.)

On first exposure to an example like (2), a linguist may have thought it sounded starkly unacceptable. Yet, by the time the linguist was teaching an introductory course on syntax and presented an example like (2) to the students, the perception of grammatical impossibility may have been weaker, or even absent altogether. If so, the linguist had experienced satiation on that sentence type.

The description in (Snyder 2000), together with anecdotal reports and personal experience, motivates the characterization in (3).

	- a. Lexical Generality: Satiation operates at the level of a grammatical structure. The increased acceptability of the structure is general, extending beyond the specific sentences that caused satiation—at a minimum, to sentences with different open-class lexical items.
	- b. Structural Specificity: Only a limited number of sentence structures (i.e., types of grammatical violation) are potentially affected by satiation.
	- c. Between-speaker Consistency: At least across native speakers of English, the same sentence types (notably sentences involving *wh*-extraction of an argument from a *Wh*-Island, Complex NP, or Subject Island) are the ones that are, at least in principle, susceptible to satiation.
	- d. Within-speaker Persistence: Once an individual has experienced satiation on a given sentence type, the increased acceptance persists for a considerable period of time, even in the absence of routine exposure to sentences of that type.

In judging whether an experimental effect qualifies as "satiation" in the relevant sense, the characteristics in (3) will serve as a guide.

In the sections that follow, three new experimental studies are presented and discussed in light of the following questions:

	- b. When satiation is induced in the laboratory, does it persist beyond the experimental session? (Section 4)
	- c. Does the difficulty of inducing satiation vary between different sentence types that are susceptible to the effect? (Section 5)
	- d. Does satiation on one sentence type ever carry over to judgments of related sentence types, for example, from *Whether*-Island violations to sentences violating another type of *wh*-island? (Section 6)
	- e. How sensitive is the satiation phenomenon to details of experimental methodology? What aspects of the methodology appear to matter? (Section 7)

Section 7 includes a survey of the literature on satiation. Section 8 turns to larger questions: the nature of satiation and its relevance to the objectives of generative linguistics.

#### **2. Review of the Original Findings**

An important component of (Snyder 2000) was the evidence suggesting satiation can be induced in the laboratory, under controlled experimental conditions, and measured objectively.1 Yet, Sprouse (2009) raised an important concern: the findings might have been due to what he termed "response equalization", rather than syntactic satiation. To facilitate discussion of the issue, this section will describe Snyder's (2000) experiment in detail. Section 3 will present a new experiment based on that study but modified to preclude response equalization.

#### *2.1. Overview of Methodology and Findings*

The experimental task in (Snyder 2000) took the form of a lengthy, printed questionnaire. Native English speakers, with no prior exposure to linguistics, were required to provide acceptability judgments on sentence-meaning pairs. A certain number of (initially) unacceptable sentence structures were systematically repeated in the course of the questionnaire. Thus, the participant received a compressed version of the linguist's experience of judging structurally equivalent sentences on multiple, distinct occasions.

On each page there was a single item like (5) (Snyder 2000, p. 576).

(5) (Context: Maria believes the claim that Beth found a \$50 bill.) Test Sentence: "What does Maria believe the claim that Beth found?" Judgment: \_\_\_\_ (Y/N)

Prior to starting, participants were told they would be asked for a series of 60 judgments. On each page there would be a declarative sentence (the "context") and then an interrogative sentence (the "test sentence"). Participants were instructed to provide a Yes/No judgment: Is the test sentence grammatically possible in English, given the meaning that fits the context? In other words, could the test sentence have the intended meaning and still be accepted as "English", in their personal opinion? Participants were advised that many items would be similar to one another, but they should not look back to previous pages or try to remember previous answers. Given that no two items would be identical, and given that the differences might be important, they should simply provide an independent judgment on each new test sentence and then move on.

Fifty of the items corresponded to a series of five experimental blocks (although this structure was invisible to participants). Each block contained items of the following types, in pseudo-random order: three fully grammatical items and seven items that would typically be perceived as anywhere from mildly to severely unacceptable, namely one item of each type in (6).2

	- g. Whether-Island violation (Context: Dmitri wonders whether John drinks coffee.) Test Sentence: "What does Dmitri wonder whether John drinks?"

In addition to the 50 test items, the experimental materials included six practice items immediately prior to Block 1 and four post-test items immediately following Block 5. (The distinction between these items and the actual test items was invisible to participants.) No two test items were ever identical: even within a single sentence type, almost all the open-class lexical items differed across sentences. There were two exceptions: CNPC violations in the body of the experiment—but not the post-test—consistently used the phrase *believe the claim*, and *Whether*-Island violations in the body of the experiment—but not the post-test—consistently used the phrase *wonder whether*.

An informal poll of linguists (all of them native speakers of English) indicated to Snyder that the phenomenon of syntactic satiation was relevant (at least) to *wh*-extraction of an argument across a *Whether* Island (6g) and out of a complex noun phrase of the type in (6b). In contrast, there appeared to be no satiation on LBC violations (6c) or *That-trace* violations (6e).<sup>3</sup> Thus, Snyder reasoned that if syntactic satiation could indeed be induced by his task, there ought to be a systematic tendency for participants to become more accepting of *Whether*-Island violations and/or CNPC violations by the end of the experiment. There should not, however, be increased acceptance of LBC or *That-trace* violations. (For the other sentence types in (6), the possibility of satiation was treated as an open question.)

The findings were as follows: As predicted, for both *Whether* and CNPC items there was a significant increase in acceptance from the beginning (Blocks 1 and 2) to the end (Blocks 4 and 5) of the questionnaire (two-tailed *p* < .05 by Binomial Test).<sup>4</sup> In contrast, for LBC and *That-trace*, there was no appreciable change. Hence, the findings were broadly consistent with the possibility that the task was inducing the same kind of judgment change that linguists sometimes experience. (Of the other sentence types, only Subject Islands showed any appreciable increase, and it was only marginally significant; *p* < .07.)

The four post-test items following Block 5 were two fully grammatical fillers, plus the two items in (7).

	- b. Whether-Island violation, with *ask whether* (Context: Mildred asked whether Ted had visited Stonehenge.) Test Sentence: "What did Mildred ask whether Ted had visited?"

To check for a possible "carryover" effect from judging CNPC violations with *believe the claim* (as in Blocks 1–5; cf. 6b), to *accept the idea* (7a), Snyder focused on participants who had initially rejected both of the CNPC violations in Blocks 1 and 2. (The intention was to focus on individuals whose grammar had clearly excluded such sentences prior to the experiment.) These participants were first classified as satiating or not satiating on *believe the claim*, based on whether they accepted at least one of the two *'believe the claim'* items in Blocks 4 and 5. They were then cross-classified as accepting or rejecting the post-test item, (7a).

Both in (Snyder 2000) and in the new experiments reported below, a participant is classified as having "satiated" on a given sentence type if (and only if) one of the following three situations holds true: (i) the exemplars in the first two blocks of the study were both rejected and exactly one of the exemplars in the final two blocks was accepted, (ii) the exemplars in the first two blocks were both rejected and the exemplars in the final two blocks were both accepted, or (iii) exactly one of the exemplars in the first two blocks was accepted and both of the exemplars in the final two blocks were accepted.

The rate of acceptance of the post-test item among participants who had consistently rejected the CNPC violations, both in Blocks 1 and 2 and in Blocks 4 and 5, was calculated as a baseline. A binomial test was then used to assess the data from participants who had likewise rejected the CNPC violations in Blocks 1 and 2 but accepted at least one of the CNPC violations in Blocks 4 and 5 (i.e., had satiated), in order to answer the following question: What was the probability of obtaining, simply by chance, an acceptance rate for the post-test item that was as high as (or even higher than) the rate observed in these latter participants? "Simply by chance" meant that the probability was calculated under the null hypothesis that, in general (among participants who rejected both items in Blocks 1 and 2), the participants who satiated (i.e., accepted at least one of the items in Blocks 4 and 5) had the same probability of accepting the post-test item as the participants who had not satiated.

In Snyder's data, all 22 participants had rejected the CNPC violations (i.e., with *believe the claim*) in Blocks 1 and 2. Of those 22, 17 also rejected the CNPC violations in Blocks 4 and 5. Only four of these 17 "non-satiators" accepted the post-test item with *accept the idea*. In contrast, among the five satiators, four accepted the post-test item. Under the null hypothesis that the general acceptance rate for satiators was the same as for non-satiators, namely 4/17 = 23.5%, the likelihood of seeing acceptance by at least four out of five satiators simply by chance is given by the binomial test: in the present case, two-tailed *p* < .05. Hence, there was significant carryover.

In the case of *Whether*-Island violations, 18 of the 22 participants rejected the items (i.e., with *wonder whether*) in Blocks 1 and 2. Of these 18, seven also rejected the *wonder-whether* items in Blocks 4 and 5. Only three of these non-satiators accepted the post-test item (with *ask whether*). This provided a baseline acceptance rate of 3/7 = 42.9%. Of the 11 satiators, however, 10 accepted the post-test item (binomial *p* < .005). Hence, there was also significant carryover for *Whether* Islands.5

In sum, Snyder(2000) obtained statistically reliable satiation on argument *wh*-extraction from both the complex-NP (*believe the claim*) environment and the w*onder-whether* environment, although far fewer participants showed the effect with complex NPs (five out of 22, as opposed to 11 of 22 for *whether*). Moreover, the satiation overwhelmingly "carried over" from *believe the claim* to *accept the idea* and from *wonder whether* to *ask whether*: four of the five satiators on CNPC violations exhibited carryover, as did 10 of the 11 satiators on *Whether* Islands.

#### *2.2. Some Possible Concerns*

A few further details of methodology are important for the present discussion. A major issue in any type of work with acceptability judgments is the fact that many different factors can influence them. These include not only the grammatical structure of the sentence being judged, but also the choices of open-class lexical items and the characteristics of the test item that was judged immediately prior. Therefore, alongside satiation, one of the possible reasons for participants to become more accepting of a given sentence type, as they work their way through an experiment, is that the specific examples presented later in the experiment are somehow intrinsically more acceptable, for reasons independent of their grammatical structure (e.g., due to the open-class lexical items that they happen to contain). Another possibility is that the specific test items positioned immediately prior to the sentences of interest made the earlier sentences seem less acceptable, and/or the later ones seem more acceptable, than they would ordinarily.

A simple way to minimize these possibilities is to counterbalance, across participants, the order of presentation: half the participants receive the items in forward order, and the other half receive the same items but in reverse order. Snyder (2000) therefore gave half of his participants a questionnaire containing the 50 test items in the order "..., Item 1, Item 2, Item 3, ..." and gave the other half the same items but in the order "..., Item 50, Item 49, Item 48, *...*". Any items that were intrinsically more acceptable than others of the same type would yield an increase in acceptability for half the participants but an equally strong decrease for the other half. Similarly, if judging a certain test item had a special effect on the participant's next judgment, then this effect would apply to different "next judgments" in the different orders of presentation. Crucially, if the experiment induced actual satiation on sentences of a given grammatical type, then it should yield increasing acceptance not only overall but also both in the subset of participants who received a "forward" order of presentation and in the subset who judged the same items but in reverse order.6

#### *2.3. Response Equalization?*

Let's now consider Sprouse's (2009) Response Equalization Hypothesis (REH). One type of task effect that is not addressed simply by counterbalancing the order of presentation, and that must be addressed separately, is the following. Suppose that participants come to any Yes/No task, such as the one in (Snyder 2000), with an expectation that exactly half the test items will have an expected answer of "Yes". A problem, then, is that, for 50 of the 60 items in Snyder's experiment (i.e., the five blocks of ten mentioned above), there was a ratio of seven items with a grammatical violation for every three items that were fully grammatical. Now, blocking of the items was invisible to the participants, and exactly half of the other 10 items (i.e., practice and post-test items) were fully grammatical. Therefore, participants saw a single series of 60 items, and 40 of them (66.7%) contained a grammatical violation.

Assuming participants noticed the discrepancy between the expected frequency of "Yes" items (50.0%) and the actual frequency (presumably 33.3%, for an unchanging nativespeaker grammar of English), the REH says participants should have become more willing to say "Yes" as the experiment progressed. To make sure that Snyder's (2000) findings were not simply due to response equalization, the best approach is to rerun the experiment with exactly one change: add enough fully grammatical items for a 1:1 balance. This will be the first of three new experiments reported below.7

#### **3. Experiment I: A Direct Test of the Response Equalization Hypothesis**

#### *3.1. Materials*

Experiment I was identical to the experiment in (Snyder 2000) except that 20 new, fully grammatical test items were added to the questionnaire, so as to create a perfect balance: 40 items that were fully grammatical and 40 that violated a grammatical constraint. For each of the experimental blocks in Version A, four of the new items were randomly selected and inserted among the original 10 items, as follows: 1 2\_3 4\_5 6\_7 8\_9 10. Following these additions, each of the five blocks contained seven fully grammatical items and seven grammatical violations (one item for each of the seven types in (6), above), and there were never more than two expected "NO" items in a row. A new Version B was created from Version A by reversing the order of the resulting 70 test items. Together with the six practice items and four post-test items, this yielded 80 items per participant.

In keeping with the original materials of (Snyder 2000), the new grammatical items were designed to be comparable in their structural complexity to the ungrammatical items. Some representative examples are provided in (8b,d,f).

	- d. Grammatical item, with *wonder what*: (Context: Gina wonders whether Einstein discovered relativity.) Test Sentence: "Who wonders what Einstein discovered?" e. LBC violation, with *how many ... books*:
		- (Context: Edwin thinks Margaret read three books.) Test Sentence: "How many does Edwin think Margaret read books?"
	- f. Grammatical item, with *how many books*: (Context: Edward thinks that Anne read ten books.) Test Sentence: "How many books did Edward think that Anne had read?"

Under the REH account of Snyder's (2000) findings, the prediction for Experiment I (where participants now see the same number of expected "YES" and expected "NO" items) is that there will be no systematic tendency for any sentence type to be accepted more often at the end than at the beginning of the experiment. In contrast, what we might call the "Satiation Hypothesis" predicts an increased likelihood of "Yes" responses at later points in the experiment for *Whether*-Island and/or CNPC violations but no systematic tendency toward increasing acceptance of *That*-trace or LBC violations.

#### *3.2. Plan for Data Analysis*

In a yes–no task, the responses cannot be expected to obey a normal (Gaussian) distribution. Snyder (2000) therefore relied primarily on binomial tests and Fisher Exact Tests, which are both "non-distributional" in the sense of not assuming a normal distribution. Here, the approach to data analysis will again rely on non-distributional methods of two main types. First, for each of the initially unacceptable sentence types, a Wilcoxon Signed-Rank test will be used to assess whether "Yes" responses were significantly more frequent at the end of the experiment (in the final two blocks) than at the beginning (in the initial two blocks).

Second, whenever possible, mixed-effect (ME) logistic regression will be used as a follow-up test. The logistic extension to linear regression is in many ways ideal for the analysis of yes–no judgment data, but sometimes, there are difficulties in achieving convergence (i.e., in fitting a model to the dataset), especially if the number of participants is relatively small. Given that convergence is not always possible, the role of ME Logistic Regression will be secondary: in the event that convergence cannot be achieved, the results of the Wilcoxon Tests will have to suffice. (In practice, such a situation will arise during the analysis of data from Experiment II, below.)

When applying ME logistic regression, the search for a model fit will always begin with a "maximally" specified model (cf. Barr et al. 2013), which will then be simplified if necessary in order to achieve convergence. A limit on simplification, however, will be that the Random Effects (RE) portion of the model must always include "random intercepts" for individual participants and for individual test items and must include by-participant "random slopes" for the effect of each of the major factors in the experiment. (For the present purposes, the major factors are the sentence type being judged and the block of the experiment in which the judgment is made.) This ensures that the model is appropriately adjusted for (i) variation in the overall willingness of a participant to say "yes" to test items in general (i.e., the by-participant random intercept), (ii) the participant's general willingness to say "yes" to each of the different types of sentence (i.e., the by-participant random slopes for Type), and (iii) the participant's general willingness to say "yes" in each successive Block of the experiment (i.e., the by-participant random slope for Block). It also ensures that the model adjusts for variation across the different sentences (i.e., "ItemCodes") that exemplify a particular sentence type (i.e., within any single experimental treatment).

A small change from (Snyder 2000) is that the blocks of Experiment I (as well as Experiments II and III below) will be numbered from 0 to 4, rather than 1–5. This has the desirable consequence that, for each of the (initially) unacceptable sentence types, the block number can be interpreted as the participant's number of previous exposures to that sentence type during the experiment.

One special strength of ME Logistic Regression is its ability to evaluate a given participant's response to a test sentence relative to that same participant's responses to control sentences. The control sentences (in all of Experiments I–III) will be grammatically wellformed sentences that are similar to the test sentences in their structural complexity and that are judged in the same block as the corresponding test item. If participants experience genuine satiation on sentences of type T, then we expect ME logistic regression to reveal a significant interaction between block number and sentence type, for sentence type T.

More precisely, ME logistic regression will be conducted with one level of each factor specified as a baseline for use in "treatment contrasts" (i.e., pairwise comparisons) with each of the other levels of that factor. For Type, the baseline level will be "Good" (i.e., within each block, the results for the seven fully grammatical items). A treatment contrast will then be calculated for each of the seven other (i.e., deviant) sentence types. Crucially, for each of these non-baseline levels, the analysis will check for an interaction effect: did the effect of "changing" from the grammatical items (the baseline) to an item of that type differ significantly, as a function of the experimental block in which the judgments were made?

Finally, evidence of increased acceptance at the end of the experiment (in the form of a significant Wilcoxon test and, when ME logistic regression converges, a significant interaction effect) is necessary, but not sufficient, for a claim of satiation on T. If participants exhibit genuine satiation of the kind characterized earlier in (3), then we expect some additional findings, and we need to confirm their presence. Specifically, the increased acceptability of a given sentence type following satiation should be evident regardless of the order in which sentences were presented. Hence, the next step will be to examine the data from Versions A and B separately. If genuine satiation occurred, we expect each version to show a statistically significant increase, from the beginning to the end of the experiment, in the frequency of acceptance.

#### *3.3. Experimental Participants and Procedure*

The participants in Experiment I were 22 undergraduate students, all native speakers of English, who were recruited by means of printed flyers posted on campus. Compensation was provided in the form of a \$5 gift card, redeemable at the university bookstore. Participants were brought into an individual testing room and told the instructions (which were also provided in printed form). Participants then received the materials in the form of a printed booklet, exactly as in (Snyder 2000). Completion of the task took about 15 min.

#### *3.4. Checking for Outliers*

Prior to running inferential statistics, the data were checked for participants more than two standard deviations from the group average on either expected "YES" or expected "NO" items, because any such participants may not have understood the instructions. Indeed, two participants were more than two standard deviations below the group mean on acceptance of expected "YES" items and were excluded from further analysis, leaving *N* = 20.8

#### *3.5. Primary Analysis: Wilcoxon Tests*

Wilcoxon Signed-Rank tests were used to assess statistical reliability of changes in acceptance rate, for each sentence type, between the first two blocks (0 and 1) and the final two blocks (3 and 4). The main results were as follows. Acceptance in Blocks 3 and 4 was significantly greater for *whether* items (*W* = −93, ns/r = 14, *Z* = −2.9, *p* < .005), but there was no significant change for any other sentence type (all *p* > .10). The data are shown graphically in Figures 1–4. 9

For *whether* items, when the 20 participants are viewed individually, some 14 showed a change between the initial two blocks and the final two, and in 13 cases, it was an increase. (Nine increased from 0/2 to 1/2, three increased from 0/2 to 2/2, and one increased from 1/2 to 2/2. The individual showing a decrease changed from 2/2 to 1/2.) Among the six participants whose level of acceptance was unchanged, three consistently rejected the sentences, and three consistently accepted them.

**Figure 1.** (**Left**) Percentage of participants in each block of Experiment I, who accepted the *Whether*-Island violation; Version A (*N* = 12) used forward presentation; Version B (*N* = 8) used reverse order; "All" indicates the total (*N* = 20). (**Right**) Mean percentage of the "Yes" items that were accepted; each participant judged seven items per block.

**Figure 2.** Experiment I, Adjunct-Island and CNPC violations.

**Figure 3.** Experiment I, LBC and Subject-Island violations.

**Figure 4.** Experiment I, *That*-trace and *Want-for* violations.

#### *3.6. Cross-Checking: ME Logistic Regression*

Linear modeling was performed using R (version 3.2.3; R Core Team 2015) and the lme4 software package (version 1.1–11; Bates et al. 2015). An ME logistic model was constructed using lme4's "glmer" function. In the notation of the lme4 package, the model was specified as in (9).

(9) Response~Block\*Type + (1 + Block + Type|Participant) + (1|ItemCode) + (1|Version) Thus, the software searched for an optimally specified logistic-regression model with which to "predict" each participant's yes/no response to each test item, based on (i) the (integer) number of the block (0–4) in which the test item appeared, (ii) the grammatical type 'T' of the test item, and (possibly) (iii) an interaction effect between Block and Type for each (non-baseline) value of Type. As noted above, a significant interaction is precisely what we expect to see if participants experience satiation on a given sentence type.

The initial attempt to fit a model with structure (9) to the data from Experiment I was unsuccessful: the glmer program failed to converge. Inspection of the program's best attempt revealed two issues. First, in the RE structure, the random intercepts by Version explained none of the variation in the dataset.<sup>10</sup> Second, in the fixed-effects structure for the program's "best attempt" at a fit, the main effect of Block had an estimated coefficient (0.16) that was an order of magnitude smaller than the coefficients for the main effects of the different levels of Type (which ranged from 1.73 to 8.21). A difference in scale of one or more orders of magnitude can prevent convergence. Hence, two changes were made. First, Version was removed from the RE structure in (9). Second, the factor Block was re-scaled: the values of Block in the dataset were simply divided by 10 (so that Block ranged from 0.0 to 0.4).

Following these changes, the program converged on the model summarized in Table 1. <sup>11</sup> As expected, pairwise comparisons showed that each of the ungrammatical levels of Type differed significantly from the grammatical (baseline) items. There was no main effect of Block (*p >* .10), and there was exactly one significant interaction of Block with Type, namely for Type = *Whether*; acceptance of *Whether* items increased significantly, as the participants progressed from Block 0 to Block 4.


**Table 1.** Table of fixed effects for Experiment I.

Thus, the results of ME logistic regression are entirely consistent with the results from Wilcoxon tests: in Experiment I there was possible satiation on items with *wonder whether*, but (in contrast to Snyder 2000) there was no satiation on the complex-NP items with *believe the claim* (*p* > .10). Consistent with Snyder 2000, there was no satiation on any of the other sentence types tested.

#### *3.7. Follow-Up Testing*

The next question is whether the apparent satiation on *wonder whether* meets the additional criterion discussed above: Did acceptance increase, from the beginning to the end of the questionnaire, in both versions? Indeed, Versions A and B each showed the same general pattern as the full study. Overall (as noted above), fourteen participants showed a change, and in 13 cases, it was an increase. On Version A, seven participants changed, and in all cases, it was an increase. On Version B, seven participants changed, and in six cases, it was an increase. Hence, the findings conform very closely to what is expected in satiation.

Experiment I shows that satiation can indeed be obtained under laboratory conditions, at least for *wonder whether* items, even if participants judge a perfect balance of fully acceptable, versus initially unacceptable, sentences. The main difference from Snyder 2000 is the absence of a change for CNPC items. In Section 5, we will see evidence that the final sample size (*N* = 20) in Experiment I was far too low for reliable detection of satiation on CNPC sentences, but regardless, the specific sentence type (*wonder whether*) that showed satiation was also one of the types showing it in (Snyder 2000). Hence, the findings from Experiment I are fully in-line with Between-speaker Consistency (3c) (as well as Generality and Structural Specificity). Next, we check for Within-speaker Persistence (3d).

#### **4. Experiment II: Persistence**

Did the increase in acceptance of *Whether*-Island violations observed in Experiment I persist beyond the time of the experiment? To find out, each participant was invited to return for testing one month later. Of the 20 participants whose data were included in the analyses for Experiment I, 15 agreed to return.

Each of these participants was tested again, 4 to 5 weeks later, in much the same way as the first time. In almost all cases, if a participant (for example) received Version A at Time 1, then Version B was given at Time 2. One participant was accidentally given B at both Time 1 and Time 2. Among the other 14, eight received version A and six version B at Time 1; hence, six (of these 14) received A and eight received B at Time 2.

The predictions were as follows. If the satiation on *Whether* items in Experiment I quickly faded, then there should be no significant difference between participants' judgments at the beginning of Experiment I (Blocks 0 and 1), and the same participants' judgments one month later at the beginning of Experiment II (Blocks 0 and 1). In contrast, if the satiation that was detected on *Whether* items showed Within-speaker Persistence, then, at least for *Whether* items, the frequency of acceptance at the beginning of Experiment II should be significantly higher.

Moreover, if the satiation on *Whether* items persisted, this should be evident when we examine the data by participant. For example, someone who accepted neither of the examples in Blocks 0 and 1 of Experiment I but accepted one of the examples in Blocks 3 and 4 of Experiment I would be expected to accept at least one of the two examples in Blocks 0 and 1 of Experiment II.

#### *4.1. Primary Analysis: Wilcoxon Tests*

Wilcoxon Signed-Rank Tests were performed to check for increased acceptance at Time 2. For each sentence type, each participant's responses in Blocks 0 and 1 of Experiment I were compared against Blocks 0 and 1 of Experiment II. As predicted by Within-speaker Persistence, there was a significant increase for *Whether* items from the beginning of Experiment I ("Time 1") to the beginning of Experiment II ("Time 2"; *W* = −45, ns/r = 9, *p* < .01). No other sentence type showed a significant increase. On *Whether*, when the participants are viewed individually, six were consistent across Times 1 and 2, and nine showed a change. In all cases, if there was a change, it was an increase: for four participants, from 0/2 "yes" responses at Time 1 to 1/2 "yes" at Time 2 and, for five participants, from 0/2 at Time 1 to 2/2 at Time 2. The full results are shown graphically in Figure 5.

**Figure 5.** Experiment II: Acceptance of initial sentences (Blocks 0 and 1) at Times 1 and 2; Error bars show standard error; "**\***" indicates significance (*p* < .05) by Wilcoxon Signed-Rank Test.

The by-participant results are shown in Figure 6. Of the 15 individuals who participated in Experiment II, 11 rejected both of the *Whether*-Island violations at the beginning (Blocks 0 and 1) of Experiment I, and four accepted both of them. Of the 11 who rejected them, one participant continued to reject them at the end (Blocks 3 and 4) of Experiment I, while the other ten accepted at least one (i.e., they had satiated). As can be seen in the table, eight of the ten satiators accepted at least one of the two exemplars of a *Whether*-Island violation in Blocks 0 and 1 of Experiment II; the remaining two satiators both accepted one fewer than they had in Blocks 3 and 4 of Experiment I. (The four participants who accepted the exemplars in Blocks 0 and 1 of Experiment I continued to accept (in all but one case) the exemplars at the end of Experiment I and beginning of Experiment II.)


**Figure 6.** By-participant findings in Experiment II, showing the number (0–2) of Whether-Island violations accepted at three time points (Beginning and End of Experiment I and Beginning of Experiment II).

#### *4.2. Cross-Checking: ME Logistic Regression*

The complete Time 1 and Time 2 data for Blocks 0 and 1 were submitted to GLMER, with the following model specification.

(10) Response~Time \* Type + (1 + Time + Type|Participant) + (1 + Time|ItemCode) + (1|Version)

Unfortunately, GLMER failed to converge on a fit (even when the RE component for Version had been removed, due to a theta parameter of zero). Given that the primary point of interest concerned the *Whether* items in relation to the grammatically well-formed (Good) items (i.e., because *Whether* was the only sentence type for which the Wilcoxon tests had indicated a significant effect), the dataset was next trimmed to include only the *Whether* and Good sentence types. At that point, using the model specification in (10), GLMER succeeded. The resulting parameters for the fixed effects are shown in Table 2. The results are fully consistent with those from the Wilcoxon tests: the effect of "changing" from the control items to the *Whether* items was a large reduction in acceptance at Time 1 but a significantly smaller reduction at Time 2.


**Table 2.** Table of fixed effects for Experiment II.

#### *4.3. Follow-Up Testing*

As indicated in Section 4.1, when we compare the initial responses (i.e., during the first two blocks of the stimuli) for Time 1 and Time 2, nine of the 15 participants showed an increase in their acceptance of *whether* items, and none showed a decrease. The results were very similar for each version. Of the eight participants who saw Version A at Time 1 and Version B at Time 2, five (i.e., about half) showed an increase from Time 1 to Time 2, and none showed a decrease. Of the six who saw Version B at Time 1 and A at Time 2, three increased, and none decreased. The one participant who saw Version B at both Time 1 and Time 2 also showed an increase.

In sum, Experiment II indicates that the characteristic of Within-speaker Persistence (3d), reported anecdotally by linguists, also holds (at least in the case of *Whether*-Island violations) for experimentally induced satiation in non-linguists. The increase in "yes" responses on *Whether* items that was statistically significant by the end of Time 1 testing, was still statistically significant four weeks later. Indeed, when we examined performance of the 15 participants individually (Section 4.1), we found that 10 had changed at Time 1 from zero "yes" responses in Blocks 0 and 1 to at least one "yes" response in Blocks 3 and 4 (i.e., they had satiated). In Blocks 0 and 1 of the Time 2 testing, nine of these 10 satiators still said "yes" to at least one *Whether* item. Indeed, given the extremely low likelihood of encountering any *Whether*-Island violations between testing sessions, and given that participants had judged only six examples at Time 1 (five with *wonder whether*, plus a post-test item with *ask whether*), the persistence of the satiation effect is remarkable. This degree of persistence suggests that the satiation the participants had experienced on *whether* items was a "learning" effect rather than a short-term priming effect.

#### **5. Experiment III: Variation in Effect Size**

*5.1. Overview*

A possible concern about Experiments I and II is that they are based on a sample of only 15–20 individuals. This is an especially important consideration given that satiation on CNPCs was weak in (Snyder 2000) (i.e., detected in only five of 20 participants) and not detected at all in Experiments I and II. Increasing the sample size will potentially allow us to reproduce, and better characterize, whatever satiation effect is present for CNPCs.

In Experiment III, the sample size was increased to 151 individuals. The participants were undergraduates taking a large introductory course on the philosophy of language. (None of them had participated in Experiment I or II.) The stimuli were nearly identical to those in Experiments I and II, but they were presented online (one item at a time, so as to control the order in which the judgments were made), and the two items shown in (11) were added to the post-test.

(11) a. *Wh*-Island violation with *wonder why* (Context: Olga wonders why Sally likes Fred.) Test Sentence: "Who does Olga wonder why Sally likes?" b. *Wh*-Island violation with *know how* (Context: Sue knows how Bill fixed the motorcycle.) Test Sentence: "What does Sue know how Bill fixed?"

Two fully grammatical items were also added to the post-test. Hence, a participant judged 84 items in total.

Participants began by answering questions about their language background and then were randomly assigned to Version A or B. The initial sample included 194 individuals, but the data were discarded from 29 participants who reported (in answer to the initial questions) that English was not the first language they had acquired. Data were also discarded if a participant's rate of "Yes" responses to fully grammatical items was more than two standard deviations below the group's average or if the rate of "Yes" responses to "deviant" items was more than two standard deviations above the average. Fourteen additional individuals were excluded by these criteria for a final sample of 151.

#### *5.2. Primary Analysis: Wilcoxon Tests*

Wilcoxon Signed-Rank Tests indicated possible satiation on four sentence types, namely the sentences violating *Whether*-Island, CNPC, *That*-trace, and Subject-Island constraints. On *Whether* Islands, 70 of 151 participants showed a change between the initial two and the final two blocks, and for 56, it was an increase (*W* = −1,645, ns/r = 70, *Z =* −4.81, *p* < .0001). For CNPC violations, 36 showed a change, and for 28, it was an increase (*W* = −394, ns/r = 36, *Z =* −3.09, *p* < .005). For *That*-trace violations, 72 showed a change, and for 49, it was an increase (*W* = −940, ns/r = 72, *Z* = −2.64, *p* < .01), and for Subject Islands, 60 showed a change, and for 40, it was an increase (*W* = −650, ns/r = 60, *Z* = −2.39, *p* < .05). None of the other sentence types showed a significant change.

#### *5.3. Cross-Checking: ME Logistic Regression*

Findings were cross-checked using ME logistic regression with the same model structure (9) that was tried initially on the data from Experiment I (i.e., with random intercepts for Version and without any re-scaling of the Block number). The software converged on a model fit, as shown in Table 3, and indicated possible satiation on extraction from *Whether* Islands, Complex NPs, Subject Islands, and *That-trace* environments. Hence, the results were fully consistent with the results from the Wilcoxon tests.


**Table 3.** Table of fixed effects for Experiment III.

#### *5.4. Follow-Up Testing*

The next step was to check whether these cases met the additional criterion discussed above: Did acceptance increase significantly in both versions? For *wonder whether*, findings were fully consistent between versions. Recall that with the two versions combined, 70 participants showed a change in acceptance, and for 56, it was an increase. In Version A, 34 showed a change, with 28 increasing (Wilcoxo*n* Signed-Rank Test, *W* = −399, ns/r = 34, *Z* = −3.41, *p* < .001), and in Version B, 36 showed a change, with 28 increasing (*W* = −434, ns/r = 36, *Z* = −3.41, *p* < .001). This qualifies as reliable evidence of a satiation effect on *Whether* Islands.

For CNPC violations as well, the findings were fully consistent across versions. Recall that, with the two versions combined, 36 participants showed a change, with 28 increasing. In Version A, 18 showed a change, with 15 increasing (*W* = −114, ns/r = 18, *Z =* −2.47, *p* < .05), and in B, 18 showed a change, with 13 increasing (*W* = −91, ns/r = 18, *Z* = −1.97, *p* < .05). Note that, in their initial acceptance rate, the CNPC items were quite similar to LBC items. Figure 7 provides a side-by-side comparison of LBC, where Block 0 acceptance was approximately 5% and no satiation was evident, versus CNPC, where Block 0 acceptance was just under 5% and satiation clearly occurred.

In the case of *That*-trace, recall that 72 participants showed a change, and for 49, it was an increase. Yet, this increase was overwhelmingly driven by Version B, where 39 showed a change, and for 30 (i.e., 77%), it was an increase (*W* = −444, ns/r = 39, *Z* = −3.09, *p* < .005). On A, however, where 33 showed a change, this was an increase in only 19 (58%) of the cases (*W* = −47, ns/r = 33, *Z* = −0.42, *p* > .10 NS). The lack of a significant change in Version A means the findings do not qualify as reliable evidence of satiation. Instead, they were quite possibly an artifact of the particular order in Version B. (For a side-by-side comparison

of *That*-trace with a Block 0 acceptance of approximately 25% and *Whether*-Island violations with a Block 0 acceptance just above 30%, see Figure 8.)

**Figure 7.** Comparison of LBC violations and CNPC violations in Experiment III.

**Figure 8.** Comparison of *That*-trace and *Whether*-Island violations in Experiment III.

Figures 9 and 10 show the findings for the remaining sentence types. In the case of Subject Islands, recall that 60 participants showed a change, and for 40, it was an increase. When the versions are viewed separately, there is still a significant change in version A (*W* = −213, ns/r = 30, *Z* = −2.19, *p* < .05), but the change observed in version B does not reach significance (*W* = −116, ns/r = 30, *Z* = −1.19, *p >* .10 NS). The absence of a significant change in Version B means the findings from Experiment III do not qualify as reliable evidence of satiation, but this could well change in a follow-up study (as will be discussed momentarily).

In sum, Experiment III provides clear evidence of satiation on *Whether* Islands and complex NPs but not on the other sentence types examined. The results are entirely consistent with (Snyder 2000) and largely consistent (i.e., in all respects, except for CNPC) with Experiments I and II. Once again, there was no possibility of response equalization (in the sense of the REH), and yet, a familiar pattern emerged: clear-cut satiation on *Whether* Islands (and, this time, also on CNPC violations, as in Snyder 2000) but not on LBC violations and not on *That*-trace violations. There was also no reliable evidence of satiation on Subject Islands, Adjunct Islands, LBC violations, or *Want-for* environments. Naturally, this does not preclude the possibility that one or more of these latter sentence types will show clear evidence of satiation in another study, especially if the experimental conditions are different. Indeed, in the literature review in Section 7, we will see that satiation is sometimes found for Subject-Island violations but chiefly in studies where participants judged a greater number of examples than they did in Experiments I–III.

**Figure 9.** Experiment III, *Yes*-items and Adjunct-Island violations.

**Figure 10.** Experiment III, Subject-Island and *Want-for* violations.

#### *5.5. Assessing Effect Size*

Finally, a key difference from Experiment I is the clear satiation on CNPC items. The findings of Experiment III actually serve to clarify why this difference exists: in absolute terms, the effect size for CNPC violations was extremely small. With our sample size of more than 150 participants, we can characterize the effect fairly precisely, and it turns out to have been unrealistic to expect reliable detection of satiation on CNPC items with only 20 participants, as in Experiment I.

In Experiment III, the average acceptance rate for CNPC violations increased from 5% of participants in Blocks 0 and 1 to 12% in Blocks 3 and 4. For present purposes, let's make the (generous) assumption that these rates are a good estimate for the larger population of English-speaking college students. In that case, a simple probability calculation indicates the following. Any participant drawn from this general population and presented with the same materials should have a .013 probability of accepting exactly two more CNPC items in Blocks 3 and 4 than in Blocks 0 and 1. This is because *p*("No" in Block 0) × *p*("No" in Block 1) × *p*("Yes" in Block 3) × *p*("Yes" in Block 4) = (1 − .05) (1 − .05) (.12) (.12) = .013. Likewise, there should be a .192 probability of accepting one more, a .074 probability of accepting one fewer, and a .00194 probability of accepting two fewer. By power analysis, it follows that *N* = 76 is the absolute smallest sample size for which the expected frequencies of participants in these four categories (i.e., an expected (76) (.013) = one participant who increases by two, 15 who increase by one, six who decrease by one, and zero who decrease by two) will result in a significant change by Wilcoxon test (at *p* < .05).

The moral is that, even within the set of sentence types that are susceptible to satiation, the strength of the effect may vary as a function of the specific linguistic constraint that is violated. Of the 151 participants in Experiment III, there were 125 who rejected at least one of the two *wonder whether* items in Blocks 0 and 1 and therefore had the possibility of showing increased acceptance (satiation) in Blocks 3 and 4. In fact, 56 of the 125 (45%) satiated. In contrast, 149 of the 151 participants rejected at least one of the CNPC items in Blocks 0 and 1, and only 28 (19%) showed satiation. This raises several questions. For a start, we might ask why, even on *whether* items (where we saw quite a strong satiation effect), fewer than half the participants showed any detectable change. Is this purely a matter of chance, or does susceptibility to the effect relate systematically to some other aspect of an individual's cognitive profile? This would be an interesting question for future research.

At the same time, we might ask whether this dimension of the satiation phenomenon increases its value as a diagnostic tool: perhaps both the susceptibility of a sentence type to satiation in the first place and the strength of the satiation effect observed can provide useful information about the source of the initial unacceptability. This will be taken up again in Section 8.

#### **6. Carryover Effects of Satiation (Experiments I and III)**

Recall that (Snyder 2000) found evidence of carryover effects. Restricting attention to those participants who rejected both *wonder whether* items in the first two blocks, the ones who satiated on *wonder whether* (i.e., who accepted at least one of the two exemplars in the final two blocks) were significantly more likely than the others to accept a post-test item involving *ask whether*. Likewise, among those participants who rejected both CNPC items (with *believe the claim*) in the initial two blocks, the ones who accepted at least one CNPC item in the final two blocks were significantly more likely than the others to accept a post-test item involving *accept the idea*. This section presents the corresponding findings from Experiments I and III.12

In Experiment I, the participants showed clear satiation on *Whether* Islands (although not on CNPC items), and they judged the same post-test items used in (Snyder 2000). Following the procedure of (Snyder 2000) (described in Section 2), we will restrict our attention to the 15 participants who had rejected both of the *wonder-whether* items in Blocks 0 and 1. Of these 15, three also rejected the *wonder-whether* items in Blocks 3 and 4. Among these three non-satiators, none accepted the post-test item with *ask whether.* In contrast, of the 12 satiators (all of whom accepted at least one of the two *wonder whether* items in Blocks 3 and 4), four (=33%) accepted the post-test item. Hence, the data are consistent with the presence of carryover to the post-test item, although the small numbers make it

difficult to assess the finding statistically. (In particular, the base rate of zero acceptance in the non-satiators makes the use of a binomial test inappropriate.)

When we turn to the data from the larger sample in Experiment III, we find evidence of satiation carryover with both CNPC items and *Whether* Islands. For CNPCs, 139 participants rejected both items in Blocks 0 and 1. Of these 139, 112 were non-satiators (i.e., they also rejected the CNPC items in Blocks 3 and 4), and of these 112, 28 (=25%) accepted the post-test item. In contrast, 15 (=56%) of the 27 satiators accepted it. Hence, there was significant carryover (binomial *p* < .005).

As noted in Section 5, the post-test in Experiment III included an example of *wh*extraction across *ask whether* (as was also the case in Experiment I), plus one example each for *wonder why* and *know how*. Possible carryover from satiation on *wonder whether* was checked for all three of these post-test items. For *wonder whether*, 79 participants rejected both examples in Blocks 0 and 1. Of these 79, 38 were non-satiators: three (8%) accepted *wonder why*, eight (21%) accepted *know how*, and six (16%) accepted *ask whether*. Of the 41 satiators, 11 (27%) accepted *wonder why*, 10 (24%) accepted *know how*, and nine (22%) accepted *ask whether*. Application of the binomial tests yields *p* < .001 for *wonder why* but *p* > .10 for both *know how* and *ask whether*. The lack of a significant carryover effect for *ask whether* is a departure from (Snyder 2000).

To sum up, in Experiment III (as in Snyder 2000), there was significant carryover from CNPC items involving *believe the claim* to items with *accept the idea*. Yet, the findings for *wh*-islands were more complex. On the one hand, there was significant carryover from *wonder whether* to *wonder why*, which is interesting insofar as it suggests the satiation on *Whether* Islands may be independent of the many special characteristics of the English *wh*-complementizer *whether*. Yet, in Experiment III, there was no significant carryover from *wonder whether* to *ask whether* (as there had been in Snyder 2000 and possibly also in Experiment I), nor was there significant carryover to *know how*. Hence, there is clearly a need for additional research.

Before concluding this section, one final point should be examined. Given that Experiment III yielded evidence of satiation on both *wonder-whether* and CNPC items, we can ask about the relationship between the two within individual participants. Did participants who satiated on one necessarily also satiate on the other? In particular, did individuals in the smaller set of participants who satiated on CNPCs necessarily also accept *wonder-whether* items by the end of the experiment?

The answer is "no". Overall, there were 55 individuals who satiated on *wonder whether* and 29 who satiated on CNPCs. There were only 14 individuals in the intersection. In other words, there were 15 individuals who satiated on CNPCs but showed no increase in their acceptance of *wonder whether*, and there were 41 individuals who satiated on *wonder whether* but showed no increase in their acceptance of CNPCs. Moreover, five of the individuals who satiated on CNPCs actually rejected *wonder whether* entirely (i.e., in both of Blocks 3 and 4). Hence, the satiation effects for these two sentence types appear to be independent.

#### **7. Comparison of Findings across Studies**

The work in (Snyder 2000) has given rise to a substantial, growing literature on satiation, which includes new findings from experimental studies (e.g., Hiramatsu 2000; Braze 2002; Francom 2009; Goodall 2011; Crawford 2012; Maia 2013; Christensen et al. 2013; Hofmeister et al. 2013; Chaves and Dery 2014; Do and Kaiser 2017), as well as some efforts to apply these findings to issues in theoretical syntax (e.g., Boeckx 2003, Chapter 3; Stepanov 2007).13 The present section reviews this literature to assess the consistency of findings across studies. Attention will be focused on studies examining one or more of the same English sentence types studied in (Snyder 2000) (and, hence, Experiments I–III), in order to identify the possible effects of methodological differences across studies.14 (For other recent surveys of the satiation literature, see Sprouse and Villalta 2021; Snyder 2021.)

#### *7.1. Response Equalization?*

First, while Experiment I was the most direct test to date of Sprouse's (2009) REH critique of (Snyder 2000) (as discussed in Section 2, note 7), the conclusions were certainly anticipated by others in the literature. For example, the following four researchers had all reported satiation effects in experiments with a perfect balance of "expected YES" and "expected NO" items: Hiramatsu (2000, Experiment II - henceforth "E2"), Braze (2002), Francom (2009, E2), and Crawford (2012).15

#### *7.2. Consistency of Findings and Points of Variation*

For the English sentence types examined in Experiments I–III, the results are very much in-line with other studies in the literature. (A synopsis is provided in Tables 4 and 5.) A clear majority detect satiation on argument extraction from *Whether* Islands.16, <sup>17</sup> Sprouse's (2009) A3 and B1 and 2 are exceptions and will be discussed in Section 7.3.1. 18

**Table 4.** Sentence types on which satiation has been induced experimentally. ("ME" = magnitude estimation; "Y/N" = yes/no task.).



**Table 5.** Sentence types on which studies have consistently failed to induce satiation. ("ME" = magnitude estimation; "Y/N" = yes/no task.).

Two other sentence types have sometimes, but not always, shown satiation: CNPCs and Subject Islands. Studies testing CNPCs include Goodall 2011, which found clear satiation, as well as several others that did not.19 As discussed in Section 5, Experiment III sheds considerable light on this variability; it appears the effect size for CNPCs is much smaller than for *Whether* Islands. Without a sizable number of participants (a bare minimum of *N* = 76, it seems, for the specific design and materials used in Experiment III), there is a high probability of failing to detect satiation on CNPCs (i.e., even if some degree of satiation is occurring). As seen in Table 4, two of the three experiments finding significant satiation on CNPCs (including Experiment III above) had at least 40 participants, while those not finding it all had fewer than 40.

Note that (Sprouse 2009) included four experiments trying to induce satiation on CNPCs, each with 25 or fewer participants. Individually, these experiments probably had little chance of succeeding, but overall (with 81 participants in total), the chances of detecting it (at least once) were perhaps not so bad. The larger issue may have been that the stimuli in all but one of these experiments (A5) omitted the context sentence, with the result that there was no clear indication of the intended meaning. This looks like it might have been a critically important change, because, in Table 4, the experiments that succeeded all provided context sentences. If so, the fact that only A5 included context sentences, together with the fact that A5 had only 20 participants, may explain the absence of satiation in Sprouse's experiments.

Turning to Subject Islands, ten of the other studies in Table 4 tested argument extraction from DPs in the subject position (especially DPs that were underlyingly direct objects, as with passives and unaccusatives). Six found satiation, and four did not. One problem in some of the latter studies may have been an insufficient number of exposures. Almost all the studies finding satiation (five out of six) increased the number of exposures beyond the original five in (Snyder 2000).<sup>20</sup> In fact, Hiramatsu (2000, E1), who employed seven exposures, noted that the satiation evident in Block 7 was not yet detectable in Block 5.21

For other sentence types examined in Experiments I–III above, the absence of detectable satiation is also largely consistent with the literature (see Table 5). For the LBC, seven other studies tested for satiation, and none found it. Adjunct Islands were checked in eleven other studies, and again, none found satiation.

*That*-trace violations have not in general shown satiation, but more needs to be said. Sprouse (2009, p. 331, Table 2) characterized (Hiramatsu 2000) as having found satiation on *That*-trace, but the situation was unclear. Hiramatsu (p. 107) expressed concerns about the quality of her data for *That-trace* and *Want-for* (which were tested only in E1). She reported that multiple participants had eventually begun crossing out the word *that* or *for* and then marking "Yes". Moreover, on p. 111, she seems to disavow the data for these sentences altogether: "As we saw in the previous section, we do not have a clear picture of the results for [...] *That-trace* and *Want*-*for* sentences." Hence, the cautious approach would be to set those findings aside, and the other studies of *That-trace* in Table 5 found no satiation.

In the case of *want-for*, once we set aside (Hiramatsu 2000), the main data in the literature (aside from Snyder 2000, which found no satiation, and Experiments I–III above, which likewise found no satiation) come from Francom (E1), who does report satiation. As it happens, Francom employed Snyder's (2000) method of counterbalancing the order of presentation. He did not originally provide information about the consistency of responses across the two orders, but he very kindly shared his data. This made it possible to check whether the change in acceptance was comparable across the two versions.

As it turned out, the evidence for satiation on *want-for* did not satisfy this criterion. Collapsing across the two versions, acceptance increased from 75% in Blocks 1 and 2 to 83% in Blocks 4 and 5 (*W* = −856, ns/r = 61, *Z =* −3.08, *p* < .005). Yet, this change was driven almost entirely by participants receiving Version B.<sup>22</sup> The acceptance on Version B shifted from 71% to 83% (*W* = −390, ns/r = 36, *Z =* −3.06, *p* < .005), but the acceptance on Version A went only from 79% to 83% (*W* = −86, ns/r = 25, *Z =* −1.15, *p >* .10 NS). Hence, the increased acceptance of *want for* at the end of Francom's experiment was probably due to an accidental property of the presentation order in Version B. By the criteria used in Experiments I–III (specifically, the requirement for the effect to be present in both orders of presentation), the findings do not qualify as reliable evidence of satiation.

In sum, across the studies reviewed here, the sentence types showing a satiation effect have consistently been some combination of *Whether*-Island, CNPC, and Subject-Island violations. At least by the criteria employed here, no study has yielded reliable evidence of satiation on Adjunct-Island, Left-Branch, *That*-trace, or *Want-for* violations.

#### *7.3. Points of Variation in Method*

#### 7.3.1. Experimental Set-Up

Studies attempting to induce satiation have varied somewhat in their experimental set-up. As can be seen in Tables 4 and 5, one potentially important variable is whether a context sentence was provided. The studies that included a context sentence have mostly succeeded in inducing satiation, at least for *Whether* Islands, but the results have been less consistent when it was omitted.23

Note that providing a context sentence is one way of conveying the intended meaning of a sentence. Arguably, judgments of linguistic acceptability are always (at least implicitly) relative to an interpretation. For example, the acceptability of the English sentence *John likes him* depends critically on whether *him* is taken to mean the same person as *John*; hence, referential indexing is provided in the literature on binding theory. In other cases, one does find linguists simply placing an asterisk on a sentence without specifying an intended meaning, but in practice, this appears to mean one of two things. Either the sentence is

unacceptable on what the linguist takes to be the "obvious" interpretation or the linguist believes the sentence is unacceptable no matter what the intended meaning is. Thus, in an experimental study of linguistic acceptability, one possible effect of including a context sentence is simply facilitation of the judgment task by making it easier for the participant to identify an intended meaning when making the judgment.

Yet, as suggested by an anonymous reviewer, the inclusion of a context sentence (and thus, clarification of the intended meaning) might have an additional, quite important role that would be specific to an experiment on satiation effects. This is because it helps the participant identify one particular way of parsing the test sentence. As will be discussed in Section 8, there could be a number of relevant consequences. For one, having this information could lead a probabilistic parser to increase the expected probability of an uncommon parse (e.g., in the context of *wh*-extraction from a subject island, the probability of a parse positing a gap inside the subject of an embedded clause). Another effect could be helping the participant recognize that adopting a "marked" syntactic option will render the sentence grammatically possible (e.g., in the context of *wh*-extraction from a CP inside an NP, adopting the option—which is potentially a marked option—of treating the CP as a complement to N rather than simply an N-bar adjunct). Thus, there are good reasons to expect that the inclusion of a context sentence might facilitate satiation and, moreover, that the facilitation might apply to certain sentence types more than others.

Another salient point of variation across different satiation studies is the nature of the judgment task: Does the participant provide a Yes/No judgment, a rating on a numerical scale, or an estimate of magnitude? Most studies that successfully induced satiation employed a Yes/No task, although Crawford 2012; Chaves and Dery 2014 (E1–2) employed a numerical scale. Sprouse 2009 (A1–5) differed in choosing magnitude estimation (ME). At present, it is unclear whether the choice of task affects the findings for satiation—and, if so, why this would be the case. (For a recent discussion of the task characteristics of ME, see Featherston 2021 and the references therein.) What is needed is a side-by-side comparison of these methods within a single satiation study.

As already noted, two other variables appear to be critically important: the number of exposures to each sentence type and the number of participants in the study. The information in Table 4 suggests that satiation on Subject Islands is difficult to obtain, unless the number of exposures is at least seven, and that satiation on CNPCs is likewise difficult to obtain, unless the number of participants is substantial (at bare minimum 76 for the specific materials and design in Experiment III). These points will be taken up again in Section 8. 24

#### 7.3.2. Variation in the Stimuli

Another salient point of variation concerns the detailed syntax of the test sentences. For example, Hiramatsu 2000 contrasted two types of Subject-Island violations, involving extraction from a subject DP that was either the underlying object or the underlying subject of a transitive verb. Interestingly, she found satiation only with underlying objects. In a similar vein, she contrasted the extraction of arguments versus adjuncts from a *Whether* Island and found satiation only for arguments.

#### 7.3.3. Variation in Data Analysis

Studies have varied in their statistical methods, but the differences seem to be immaterial. Francom (2009, pp. 32–35) applied sign tests, paired *t*-tests, ANOVA, and logistic regression, with identical results. Likewise, the datasets from Experiments I–III in this paper were analyzed both with ME logistic regression and with a more traditional method (the Wilcoxon Signed-Rank Test), and the results were effectively identical.

In contrast, what is clearly of great importance is confirming that one's data are internally consistent: Do we see the consistency across orders of presentation that we should for an effect at the level of grammatical structure? As illustrated in Experiment III, this common-sense check can have a critically important influence on the conclusions drawn. Furthermore, in Section 7.2, they helped eliminate an apparent conflict across studies in the findings for *want-for*.

#### **8. Directions for Future Research**

*8.1. Satiation as a Diagnostic Test*

We now return to the question of how investigating satiation could benefit generative linguistics. First, a number of potentially valuable ways to apply our current knowledge of satiation might follow from a proposal made by Goodall (2011, p. 35):

[I]f one unacceptable sentence type is satiation-inducing and another is not, it is unlikely that their unacceptability is attributable to the same underlying principle. This suggests, for instance, that violations of *whether* islands, which are susceptible to satiation, and *that*–trace violations, which are not, must be due to different underlying principles, in accord with the general consensus in the literature about these two phenomena.

Following this line of reasoning, and incorporating the findings discussed in this article, one can see a number of immediate applications. Whenever a linguistic theory (be it a theory in syntax, semantics, or morphophonology) posits a single source for the unacceptability of two different sentence types X and Y, testable predictions immediately follow.

For example, one possibility will be to run a pair of studies modeled on Experiment III. In one of the studies, we add a single example of sentence type X to each block. In the second study, we use examples of Y in place of X. Upon completion, we check to see if X and Y are alike (or disparate) in whether their initial unacceptability satiates. If satiation is present for both, we can also check whether the number of exposures required for satiation is comparable across X and Y, and we can check whether the percentage of participants who show a change in their judgment is comparable across X and Y.

If the satiation findings for X and Y are either highly similar, or highly dissimilar, the interpretation will be straightforward. More complex (and, no doubt, more interesting) will be the intermediate cases, where some of the diagnostics come out as expected under the hypothesis of a single source of unacceptability and others do not. This sort of mixed case might, for example, indicate that X and Y overlap only partially in the factors rendering them (initially) unacceptable.

Yet, there are a number of ways for the ideas just sketched to be too simplistic. In particular, there is a tacit assumption that the underlying source of unacceptability is the thing undergoing change. Suppose, for example, that a specific UG constraint on syntactic movement is what is rendering both X and Y unacceptable. If this constraint is somehow weakened by satiation, then both X and Y should become more acceptable. However, suppose that the UG constraint is immune to satiation and that something else is changing. For example, perhaps a speaker can learn to reanalyze structure X as a superficially similar but syntactically distinct structure X-prime, to which the UG constraint does not apply. If the reanalysis operation depends on surface characteristics that are present on X but absent from Y, only X will be able to satiate, even though the cause of the initial unacceptability of X and Y is exactly the same.

#### *8.2. Explaining Satiation*

Before we try to use satiability as a diagnostic, we will naturally want to know as much as we can about what exactly satiation is. A logical starting point is to ask whether satiation is a unitary phenomenon. Is there essentially the same process at work in every example of a sentence type that satiates (according to the operational definition of satiation in 3)? Alternatively, are there different mechanisms at work in different sentence types?

The findings in this article can at least help narrow down the possible answers. Consider the following "strongly unitary" scenario:

*Scenario 1.* Suppose that a kind of "mental alarm" goes off whenever a person's language-processing mechanisms are forced to postulate a grammatically deviant structure for a linguistic expression. Let's assume that the alarm system is highly similar from one speaker to another; the strength of the alarm varies along a single, smoothly continuous dimension, and violations of different grammatical constraints all trigger the same alarm, although the strength of the resulting alarm signal may vary with the type of violation. If so, satiation could perhaps be a kind of habituation effect: perhaps repeatedly experiencing a certain level of alarm, over a certain period of time, can make one tolerant.

Under Scenario 1, no matter which sentence types undergo satiation, the mechanism is exactly the same: habituation to alarm signals of a certain magnitude. Grammatical constraints associated with a weak signal should always satiate prior to constraints with a stronger signal. Indeed, satiation on a constraint with a strong signal should yield satiation not only on sentences violating that particular constraint but also on sentence types yielding weaker signals, even if those sentence types violate different constraints and even if those sentence types have never actually been encountered.

The evidence presented in this article speaks against an account along these lines. Specifically, the fact that satiation on CNPC violations in Experiment III was found in a much smaller percentage of participants than satiation on *wonder whether* more or less forces us to conclude, under Scenario 1, that CNPC violations elicit a "louder" alarm signal than *wonder whether* violations. Hence, there is a strong prediction that every single individual who satiated on CNPC violations by the end of Experiment III must have ended up accepting *wonder whether* items as well. At the end of Section 6, however, it was noted that five of the 29 participants who satiated on CNPC violations actually rejected both of the *wonder whether* items in Blocks 3 and 4.

In place of a strongly unitary account, one might consider a "weakly unitary" account along the following lines:

*Scenario 2.* Suppose the language processor has a number of distinct alarm signals, each of which indicates the violation of a different grammatical constraint. In this case we might once again imagine that satiation results from habituating to an alarm signal (and, hence, that satiation is unitary in a certain sense), but now, satiation will proceed independently for different grammatical constraints (i.e., as a separate process of habituation for each of several different alarms). Satiation on a given constraint will require exposure to sentences violating that particular constraint.

Note that, under Scenario 2, the number of exposures required before full habituation occurs might still vary as a function of the constraint in question if (for example) some constraints have "louder" alarms than others.

Is Scenario 2 compatible with the evidence from Experiment III? This depends on our assumptions. If we assume—as seems fair—that, prior to the experiment, the participants had no exposure to either CNPC violations or *wonder whether* violations, and if we assume that each exposure during the experiment is equally effective at promoting habituation, then the same prediction that defeated Scenario 1 will probably exist for Scenario 2. Specifically, by the end of the experiment, every participant will have encountered the same number of CNPC violations and *wonder whether* violations; if that number (of CNPC violations) is sufficient to create habituation on the alarm signal for CNPCs (again assuming that these are the more difficult sentence type to satiate), then the same number (of *wonder whether* violations) should be sufficient to produce habituation on the (distinct, but weaker) alarm signal for *wonder whether*.

Yet, the prediction will change if we assume, for example, that habituation to a given alarm signal requires not only some number of encounters with relevant examples but also some particular internal state in the experimental participant (perhaps something like introspective awareness) that fluctuates from moment to moment. In this case, it might be possible, simply by chance, for a participant to have "genuinely" experienced a smaller number of alarm signals for *wonder whether* violations by the end of the experiment than alarm signals for CNPC violations.

In any case, a complete alternative to Scenarios 1 and 2 would be Scenario 3, which sketches a strongly non-unitary model.

*Scenario 3.* Suppose that different satiable constraints may owe their satiability to different mechanisms. Perhaps, in some cases, satiation results from habituation to a particular constraint's alarm signal, but in other cases, it results from, say, discovering an alternative syntactic analysis of a particular sentence type. For example, perhaps CNPC violations involving *wh*-extraction across ...*believe the claim that*... are usually assigned an "unmarked" structure in which the CP is treated as an appositive (i.e., an N-bar adjunct), but UG also permits another, more marked analysis (at least for epistemic nominals, like *claim* and *idea*) in which the CP is a complement selected by N. In terms of Chomsky's (1986) *Barriers* system, the appositive analysis forces the *wh*-phrase to cross two barriers (the lower CP, which is not L-marked, as well as the NP above it, which is a barrier by inheritance). In contrast, no barrier will be crossed if the lower CP is selected by the N. Hence, in this case, satiation is not habituation but, rather, the discovery of a new, UG-compatible (but "marked") parse, which (by hypothesis) was not being exploited before.

Under Scenario 3, it is perhaps less surprising (than under Scenarios 1 and 2) to find participants who have satiated on CNPC violations but who still firmly reject *wonder whether* violations. If satiation on CNPC violations results from a sudden (tacit) insight into UG-compatible structures but satiation on *wonder whether* violations results (say) from the gradual accumulation of a particular volume of experience over time, then an individual can easily satiate on one and not the other.25

At this point, it is interesting to note that Chaves and Dery (2018) proposed an explicit model of satiation on Subject Island (SI) violations, and their model seems to be far more compatible with Scenario 3 than Scenarios 1 and 2. This is because their work does not address satiation on sentence types other than SIs, and the proposed mechanism of satiation appears to be specific to SIs. In brief, Chaves & Dery argued that SI violations are not ungrammatical but merely difficult to parse. They assumed the parsing difficulty results from "the fact that subject-embedded gaps are pragmatically unusual—as the informational focus does not usually correspond to a dependent of the subject phrase—and are therefore highly contrary to comprehenders' expectations about the distribution of filler gap dependencies" (Chaves and Dery 2018, p. 1). In their view, comprehenders' expectations can change fairly rapidly with exposure to clear examples of subject-embedded gaps.

Thus, the Chaves–Dery mechanism seems like a plausible candidate for a source of satiation that is specific to SI violations. Let's suppose this proposal is correct for SI violations. Then, as suggested above, perhaps satiation on CNPC violations will turn out to involve discovering a new, UG-compatible (but ordinarily nonpreferred) parse for a CP following an epistemic nominal. Perhaps satiation on extraction from certain *wh*-islands will turn out to involve habituating to a mental "alarm" triggered by a certain type of grammatical violation. This type of non-unitary scenario leads to distinctive predictions, such as the strict absence of satiation carryover effects between sentences of these three types. Experimental tests of such predictions would be a reasonable next step for research into the nature of satiation.

In conclusion, Experiments I–III have provided evidence (i) that experimentally induced satiation, like the satiation that sometimes affects linguists, is restricted to a small, stable set of sentence types; (ii) that, after satiation on one sentence type (e.g., *wh*-movement across ...*wonder whether...* or ...*believe the claim...*), acceptability sometimes increases for distinct but syntactically related sentence types (such as ...*wonder why*... or ...*accept the idea*...); (iii) that, for sentence types susceptible to satiation, the difficulty of inducing it (e.g., number of exposures required) varies systematically; and (iv) that, much as satiation in linguists persists over time, experimentally induced satiation (at least in the case of *wonder whether*) can persist for at least four weeks. These findings may suggest an eventual role for satiation in determining whether the perceived unacceptability of two distinct

sentence types has a common source, but more immediately, they suggest that satiation may be a powerful tool for examining the tacit mental operations that are responsible for our judgments of linguistic (un)acceptability.

**Funding:** This work was supported by the National Science Foundation under NSF IGERT DGE-1144399 and grant DGE-1747486.

**Institutional Review Board Statement:** This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of the University of Connecticut.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Anonymized data are available from the author.

**Acknowledgments:** The author is grateful for helpful discussions with (among others) Dave Braze, Rui Chaves, Jean Crawford, Jerrod Francom, Grant Goodall, Kazuko Hiramatsu, and Whit Tabor. The author also thanks the anonymous reviewers for their numerous astute suggestions.

**Conflicts of Interest:** The author declares no conflict of interest. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

#### **Notes**


Scenarios 1 and 2), and they might be different still for the mechanism that Chaves and Dery proposed for Subject Islands (in terms of changing the probability associated with a given parse in a probabilistic parser). In fact, as noted at the end of Section 4.3, the persistence of satiation on *wonder whether* in Experiment II is strongly suggestive of a learning effect rather than the sort of habituation to an alarm signal (i.e., for subjacency violations or the like) suggested in Section 8.2.

#### **References**


Boeckx, Cedric. 2003. *Islands and Chains: Resumption as Stranding*. Amsterdam: John Benjamins.


Chacón, Dustin A. 2015. Comparative Psychosyntax. Doctoral dissertation, University of Maryland, College Park, MD, USA.

Chaves, Rui P., and Jeruen E. Dery. 2014. Which subject islands will the acceptability of improve with repeated exposure? Paper presented at the 31st West Coast Conference on Formal Linguistics, Tempe, Arizona, USA, 8 February 2013; Edited by Robert E. Santana-LaBarge. Somerville: Cascadilla Proceedings Project, pp. 96–106.

Chaves, Rui P., and Jeruen E. Dery. 2018. Frequency effects in Subject Islands. *Journal of Linguistics* 55: 475–521. [CrossRef] Chomsky, Noam. 1986. *Barriers*. Cambridge: MIT Press.


Ross, John Robert. 1967. Constraints on Variables in Syntax. Doctoral dissertation, MIT, Cambridge, MA, USA.


Snyder, William. 2000. An experimental investigation of syntactic satiation effects. *Linguistic Inquiry* 31: 575–82. [CrossRef]

Snyder, William. 2021. Satiation. In *The Cambridge Handbook of Experimental Syntax (Cambridge Handbooks in Language and Linguistics)*. Edited by Grant Goodall. Cambridge: The Cambridge University Press, pp. 154–80.

Sobin, Nicholas. 1987. The variable status of Comp-trace phenomena. *Natural Language & Linguistic Theory* 5: 33–60.

Sprouse, Jon. 2009. Revisiting satiation: Evidence for an equalization response strategy. *Linguistic Inquiry* 40: 329–41. [CrossRef]

Sprouse, Jon, and Sandra Villalta. 2021. Island effects. In *The Cambridge Handbook of Experimental Syntax (Cambridge Handbooks in Language and Linguistics)*. Edited by Grant Goodall. Cambridge: The Cambridge University Press, pp. 227–57.

Stepanov, Arthur. 2007. The end of CED? Minimalism and extraction domains. *Syntax* 10: 80–126. [CrossRef]

## *Article* **Sources of Discreteness and Gradience in Island Effects**

**Rui P. Chaves**

Linguistics Department, University at Buffalo, Buffalo, NY 14260-1030, USA; rchaves@buffalo.edu

**Abstract:** This paper provides an overview of categorical and gradient effects in islands, with a focus on English, and argues that most islands are gradient. In some cases, the island is circumvented by the construction type in which the extraction takes place, and there is growing evidence that the critical factor is pragmatic in nature, contrary to classic and categorical accounts of island effects that are favored in generative circles to this day. In other cases, the island effect is malleable and can weaken with increased exposure to the extraction pattern, a phenomenon traditionally referred to as 'syntactic satiation'. However, it is not clear what satiation consists of. Some argue that it is nothing more than task adaptation (mere increased familiarity with the experimental paradigm, impacting difficult sentences more than easy ones), whereas others propose that it consists of a form of error-driven structure-dependent form of learning. The present paper discusses this controversy, and the broader adaptation debate, and argues that both task adaptation and grammatical adaptation are taking place during the processing of complex sentences, and that both frequency and attention are plausible factors to stimulate adaptation.

**Keywords:** Islands; satiation; frequency; adaptation

#### **1. Introduction**

There is growing evidence that repeated exposure to infrequent syntactic structures can lead to adaptation, as measured in faster reading times and/or increased acceptability. For example, certain illicit *wh*-movement structures known as 'islands' (Ross 1967) can become more acceptable, and are processed faster, with increased repeated exposure, a phenomenon often referred to as *syntactic satiation* (Snyder 1994, 2000; Stromswold 1986) The precise nature of syntactic satiation is not known. It could be an instance of *task adaptation* (Heathcote et al. 2000) (i.e., mere increased familiarity with the experimental paradigm, perhaps impacting difficult sentences more than easy ones as argued by Prasad and Linzen (2021)). Alternatively, it could be *syntactic adaptation* (Chang et al. 2006, 2012; Fine et al. 2010, 2013; Fine and Jaeger 2013; Sikos et al. 2016) (an error-driven structuredependent form of statistical learning, whereby unexpected structures cause the processor to adapt to the contingencies of the input), or a combination of the two. Such changes in behavior are important because they can shed light on whether grammar is gradient, and fundamentally probabilistic, or categorical after all. This in turn is connected to broader questions about how language changes, and how it is learned by children as well as adults.

In Section 2 I provide an overview of the evidence suggesting that there are two major kinds of island phenomena. One the one hand we have categorial effects, which are due to some strict (syntactic or semantic) grammatical constraint, and in the other we have gradient effects, which are to a large extent caused by contextual or expectationbased factors. In some islands, there is a confluence of both types of phenomena, which are difficult to disentangle. In Section 3 I turn to amelioration effects caused by repeated exposure, which is a selective phenomenon, as certain island violations are more susceptible than others to ameliorate than others. Several kinds of account for this effect are surveyed, and it is argued that Brown et al. (2021) are incorrect in regarding all satiation as a form of task adaptation having nothing to do with grammar or island phenomena. To further disentangle task adaptation from syntactic adaptation, I describe a self-paced reading

**Citation:** Chaves, Rui P. 2022. Sources of Discreteness and Gradience in Island Effects. *Languages* 7: 245. https://doi.org/ 10.3390/languages7040245

Academic Editors: Anne Mette Nyvad and Ken Ramshøj Christensen

Received: 4 March 2022 Accepted: 7 September 2022 Published: 21 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

experiment, using a classic garden-path effect, to show that increased reward leads to more adaptive behaviour in the critical region. The experiment suggests that fine-grained error-driven learning is taking place, and that frequency can compound with reward to and speed up sentence processing of complex sentences, over and above task adaptation.

#### **2. Discreteness and Gradience**

It has become increasingly clear that island effects are not created equal, and lie on a continuum constrained by multiple factors. At one end, we have islands that are categorical and exceptionless. These are construction-invariant (i.e., are active in any construction that involves unbounded extraction), immune to any form of principled circumvention (e.g., parasitism), insensitive to contextualization, and do not weaken with repeated exposure (i.e., satiation).1

A good example of how disparate island phenomena can be is the Coordinate Structure Constraint (Ross 1967), which is composed of two separate parts. One bans extraction *from* conjuncts, called the Element Constraint (Grosu 1973), and the other one bans extraction *of* conjuncts, named the Conjunct Constraint (Grosu 1973). There is good evidence that the two constraints are due to fundamentally different factors. Let us focus on the Conjunct Constraint first, illustrated in (1). This constraint is construction-invariant, since it arises in any kind of filler-gap dependency construction, be it interrogative, declarative or subordinate.

	- b. \*Who did you see and Robin yesterday?
	- c. \* It was Alex who you saw Robin and yesterday.
	- d. \* It was Alex who you saw and Robin yesterday.
	- e. \*The person who you saw and Robin yesterday was Alex.
	- f. \*The person who you saw Robin and yesterday was Alex.

All of the sentences in (1) become acceptable if the conjunction 'and' is replaced with a comitative like 'with', which serves to indicate that it is the coordination that hampers extraction. To my knowledge, nothing can improve the acceptability of Conjunct Constraint violations. This includes Across-the-Board (ATB) extraction, as in (2).2

	- b. \* It was Alex who you saw and yesterday.

The insensitivity to ATB extraction is noteworthy because ATB extraction circumvents the part of the Coordinate Structure Constraint that bans extraction from conjuncts, the Element Constraint. This is illustrated in (3).

	- b. \*Who did you say Alex dislikes and Mia absolutely loves Robin?
	- c. Who did you say Alex dislikes and Mia absolutely loves ?

What is more, filler-gap dependencies like (3a,b) can become more acceptable if the conjunction is interpreted asymmetrically (Kehler 2002; Lakoff 1986; Na and Huck 1992), as illustrated in (4). Here, the order of the conjuncts matters for the interpretation. For example, in (4a) the first conjunct is a preparatory action for the second conjunct, which expresses the main assertion. In (4b) the second conjunct is a consequence of the first, and in (4) we have a more complex case of the same kind of pattern. No such meaning-based amelioration can salvage Conjunct Constraint violations.

	- b. How much can you drink and still stay sober?
	- c. What did Harry buy , come home, and devour in thirty seconds?

Taken together, the foregoing data tell us that the Conjunct Constraint and the Element Constraint are due to very different factors. The former constraint is brought about by coordination itself (conjuncts cannot be extracted), which can be explained if conjunctions are markers that attach to heads rather than heads that select arguments (Abeillé and Chaves 2021; Chaves 2007). The Element Constraint, in contrast, seems to be caused by the symmetric interpretation of coordination, which can be predicted by independently motivated pragmatic factors; see Kehler (2002, chp. 5) and Kubota and Lee (2015) for a more detailed discussion.

Another island type that resists any form of amelioration is the Left Branch Condition (LBC). This prohibits the extraction of determiner expressions in languages like English, as seen in (5). Nothing can ameliorate the effect, including repeated exposure (Francom 2009; Goodall 2011; Hiramatsu 2000; Snyder 2000, 2017; Sprouse 2009, 2007).

	- b. \*Which did you buy book? (cf. 'Which book did you buy?')
	- c. \* It was Robin's I liked painting the most. (cf. 'It was Robin's painting I liked the most.)

Since English LBC effects appear in any construction (relative clauses, declaratives, and interrogatives), and are not subject to contextual amelioration of any kind, they are a good candidate for a syntactic constraint on extraction.

Languages that apparently allow LBC violations, like most Slavic languages, don't have determiners (Uriagereka 1988, p. 113), and therefore the extracted phrase is in apposition to the nominal head. No LBC violation occurs. This is best illustrated by languages, like French, that obey the LBC but have a special construction in which such extractions are apparently possible (Corver 2014). Consider the contrast illustrated by (6a,b).

(6) a. \**Quels* how-many *avez-vous* have-you *acheté* bought *livres?* books

'How many books have you bought'

b. *Combien* how-many *a-t-il* has-he *vendu* sold *de* of *livres?* books

'How many books did he sell?'

There are good empirical reasons to believe that there is no LBC violation in (6b). The phrase *de livres* is a post-verbal NP in French, and *combien* behaves more like a nominal than a canonical quantifier (Abeillé et al. 2004; Kayne 1981), since the former can appear without the latter in the presence of other licensors, including the preposition *sans* ('without') or negation, e.g., *Paul n'a pas lu* [*de livres*] ('Paul did not read any books'). If *combien* and the *de*-phrase are autonomous, then that means that no LBC violation occurs in (6b). I suspect something analogous occurs in Slavic languages.

There are other construction-invariant and categorical island effects, to be sure, such as the Preposition Stranding Ban in most languages that have prepositions, with the exception of some Germanic languages (including Danish, Dutch, English, Frisian, Norwegian and Swedish), as well as Berber, Hungarian, and Zoque (Emonds and Faarlund 2014, pp. 84–96).

At the other end of the spectrum we have island effects that are construction specific (i.e., are only active in certain types of unbounded dependency construction), permit systematic circumvention, exhibit varying degrees of acceptability depending on the exact wording (e.g., the plausibility of the content expressed, parsing difficulty caused by lexical ambiguity, garden paths, infrequent words, and/or stylistic issue), and can weaken with repeated exposure. According to the survey in Chaves and Putnam (2021), this is the true of the majority of known island effects; cf. with Szabolcsi and Lohndal (2017). In what follows I provide a brief overview of a number of island effects which are graded, malleable, and construction-specific.

#### *2.1. Subject Islands*

Subject Island violations, like the one in (7a), famously vanish with the presence of a second gap (Engdahl 1983) as illustrated by (7b), but see Chaves and Dery (2019) for concerns about such a paradigm.

	- b. Who did [the opponents of ] assassinate ?

The standard view that the second gap rescues the first by virtue of being outside the island is dubious, as Levine and Sag (2003), Levine and Hukari (2006, p. 256), and Culicover (2013, p. 161) note, because of examples like (8) in which both gaps are Subject Island violations. Such constructions should be completely ungrammatical.

(8) This is a man who [friends of ] think that [enemies of ] are everywhere.

More recently it has also been show that Subject Island effects can vanish if the extraction is from a relative clause, as in (9), which are attestations found by Culicover and Winkler (2022); see also Abeillé et al. (2020) for supporting experimental evidence.

	- b. I'm looking for someone who I click with. You know, the type of person who*<sup>i</sup>* [spending time with *<sup>i</sup>*] is effortless. [https://3-instant.okcupid.com/profile/mpredds, accessed on 7 January 2020]
	- c. Survived by her children, Mae (Terry), Dale (Andelyn), Joanne (Gary), Cathy (Jordan), George, Betty (Tim), Danny (Angela); a proud grandmother of 14 grandchildren and 16 great-grandchildren, who*<sup>i</sup>* [spending time with *<sup>i</sup>*] was one of her finest joys; [http://www.mahonefuneral.ca/obituaries/111846, accessed on 7 January 2020]

Attestations involving extraction from subject-embedded verbal structures are shown in (10). To my knowledge, there are no attested Subject Island violations that do not involve extraction from relative clause subjects.

	- b. In his bedroom, which*<sup>i</sup>* [to describe *<sup>i</sup>* as small] would be a gross understatement, he has an audio studio setup. (Chaves 2012, p. 471)

c. Leaving the room, she is quick to offer you some Arabic coffee and dates which*<sup>i</sup>* [to refuse *<sup>i</sup>*] would be insane because both are delicious, and an opportunity to relax and eat is welcome when working twelve hours. [www.thesandyshamrock.com/being-an-rt-in-saudi-arabia/, accessed on 7 January 2020]

Still, various authors such as Ross (1967, p. 242), Kluender (1998, p. 268), Hofmeister and Sag (2010, p. 370), Sauerland and Elbourne (2002, p. 304), Jiménez–Fernández (2009, p. 111), and Chomsky (2008, pp. 160, ft. 39), among others, have noted that slight rewording can attenuate Subject Island effects in interrogatives, as (11) illustrates.

	- b. Which President would [the impeachment of ] not shock most people? (Chaves and Putnam 2021, p. 80)
	- c. Which problem will [a solution to ] never be found? (Chaves and Dery 2014)

Interrogative Subject Island violations like the above sometimes ameliorate with repetition (Chaves and Dery 2014; Clausen 2011; Francom 2009; Goodall 2011; Hiramatsu 2000; Lu et al. 2021). According to Chaves and Dery (2019), extractions from subjects like (12a) are initially less acceptable than their object counterparts in (12b), but as the experiment progressed the former became more acceptable, and by 12 exposures the two types of extraction were equally acceptable. This was replicated by Chaves and Putnam (2021, p. 213).

	- b. Which stock does the value of the dollar often parallels [the price of ]?

The authors ensured that the acceptability differences in (12) were due to extraction (rather than to lexical biases, semantic plausibility, complexity, pragmatics, etc.), by making sure that their declarative counterparts shown in (13) were truth-conditionally near synonymous and expressed highly plausible propositions to begin with. This was done via a sentence acceptability norming experiment, with different participants.

	- b. The value of the dollar often parallel the price of this stock.

Since the items expressed essentially the same proposition, this design avoided the concern raised by Kim (2021) about the factorial design adopted by Sprouse (2007), which does not control for important non-syntactic factors and therefore has limited ability to identify the exact nature of island effects. Chaves and Dery (2019) also compared acceptability and the online processing of near-synonymous sentence pairs like (12), which express essentially the same proposition. Any acceptability differences must come from the extraction itself.

The fact that no such dramatic acceptability increase was observed in the ungrammatical controls (including in a later replication by the same authors) suggests that Subject Island effects can vanish, in ideal conditions. That is, if the items are not too complex, express highly plausible propositions, and participants are sufficiently exposed to such structures. A similar effect was also observed in terms of reduced reading times around the gap site on a subsequent experiment in Chaves and Dery (2014). In other words, speakers can adjust to unusual gaps and the associated semantic-pragmatic consequences. The asymmetry between subject and object subextraction is not categorical, and can be countered in ideal conditions.

The conclusion is that English Subject Islands are most likely not purely syntactic phenomena. The effect is not present in relative clauses, and is sometimes graded elsewhere. But what, then, is behind such otherwise strong islands? One possibility is that extraction from subject phrases is dispreferred when the subject is expected to be discourse-old. Subject phrases are used typically used for topic continuity (Chafe 1994; Kuno 1972; Lambrecht 1994). For example, subject phrases are more likely to be pronominal or elliptical than objects (Michaelis and Francis 2007). Consequently, there is a conflict between the discourse function of the extracted element (focus) and the discourse function of the subject phrase itself (Abeillé et al. 2020; Erteschik-Shir 2006b; Goldberg 2006; Takami 1988; Van Valin 1986). Extracting from a discourse-old subject not only involves a structurally unexpected move, so to speak, but also contextually unusual state-of-affairs, one in which a discourse-old referent is linked to a subordinate referent that can be the focus. No such contradiction arises in relative clauses, because their subjects are under no obligation to be a main topic or focus.

According to Kluender (2004, p. 495), 'Subject Island effects seem to be weaker when the *wh*-phrase maintains a pragmatic association not only with the gap, but also with the main clause predicate, such that the filler-gap dependency into the subject position is construed as of some relevance to the main assertion of the sentence'. In other words, the subject-embedded referent must contribute to the interpretation of the main predication. For example, in (11a) the extraction is licit because whether or not the attempt to find *x* ends in failure crucially depends on the identity of *x*; the search failed precisely because of the nature of what was sought. Similarly, whether or not an impeachment shocks most people crucially depends on the one that is impeached, and whether or not a solution is found crucially depends on the identity of the problem.

Chaves and Putnam (2021, p. 228) found supporting experimental evidence for such a relevance constraint. A total of 20 experimental items were constructed, each of which had two versions, as seen in (14). The extracted referents in the –Relevant condition are less important for the situation described by the sentence as compared to the items in the +Relevant condition.

	- b. Which joke was the punchline of overheard by the teacher? (–Relevant)

To ensure that the +Relevant condition items were indeed more relevant than the –Relevant condition items, a norming experiment in which a different group of participants were asked to use a 5-point Likert scale to state to what extent they agreed with statements like (15), created from the 20 original experimental items.

(15) Whether the punchline of a joke is [offensive / overheard by the teacher] depends on what the joke is.

To ensure that any difference in acceptability between the item pairs was due to extraction and not to semantic or pragmatic differences between the item pairs, a norming experiment was conducted to measure the acceptability of the declarative counterparts of the 20 items, illustrated in (16). The goal of this task is to ensure that the non-extracted counterparts of the items were equally acceptable to begin with.

(16) The punchline of this joke was extremely offensive/overheard by the teacher.

After these norming experiments, acceptability ratings were collected for the 20 Subject Island items like (14). A Cumulative Bayesian Linear Regression model with sentence acceptability ratings as a dependent variable and the mean relevance ratings per item from the questionnaire experiment as the independent variable (allowing for the intercept to vary with items and declarative acceptability ratings as random effects) found a significant effect (*β* = 0.08, *SD* = 0, CI = [0.07,0.08], P(*β* > 0) = 1). These results suggest that the more important the extracted referent is for the proposition described by the utterance, the more acceptable the subject subextraction. This is consistent with the view in which not all subject embedded referents are equally biased to be assigned the same pragmatic function as the subject referent. This depends on the predication, the proposition, and the context. Moreover, whether or not a referent embedded in a discourse-old subject is interpreted as new and has an impact on the main predication is a matter of degree, and therefore it is not surprising that with repeated exposure such constructions sometimes become more and more acceptable. To conclude, Subject Islands are not construction-invariant, and even when they are active, their effect is gradient. Although a syntactic account may be possible, stipulating that in certain constructions extraction is allowed, it is unclear how such an account can *explain* why things are the way they are on independently motivated grounds.

#### *2.2. Adjunct Islands*

A similar situation arises in connection with Adjunct Islands. First, they can be circumvented by the presence of a secondary gap (Engdahl 1983), as illustrated by (17) and (18). But these sentence pairs have radically different meanings, and therefore it is not clear in which sense the main gap can be said to rescue the secondary gap. Indeed, it is wellknown that such environments are not categorical boundaries to extraction, given examples like (19).

	- b. Which colleague*<sup>i</sup>* did John slander *<sup>i</sup>* [because he despises *<sup>i</sup>*]?
	- b. What*<sup>i</sup>* do you think Robin computed the answer [with *<sup>i</sup>*]? (Bouma et al. 2001, p. 45)
	- c. Which movies*<sup>i</sup>* does Sean Bean die [in *<sup>i</sup>*]? (Chaves and Putnam 2021, p. 87)
	- d. Which temperature*<sup>i</sup>* should I wash my jeans [at *<sup>i</sup>*]? (Chaves 2013)

There is no independently motivated empirical reason to assume that these adjuncts combine with their VP heads in different ways (Truswell 2011), which suggests that syntax is not the source of the island effect. Müller (2017) provides sentence acceptability evidence from Swedish suggesting that extraction from tensed adjuncts is contingent on the degree of semantic-pragmatic cohesion between the matrix, and similar results are reported for Norwegian by Bondevik (2018). More recently, Kohrt et al. (2020) and various others show that semantic factors play a critical role in English Adjunct Islands.

As in the case of Subject Islands, clausal Adjunct Island violations are usually stronger than phrasal violations. Compare (17a) and (18a) with (20).

	- b. \*Who*<sup>i</sup>* did Mary cry after John hit *<sup>i</sup>*? (Huang 1982, p. 503)

But Gibson et al. (2021) recently show that if a supporting context is provided, then island effects in tensed adjuncts is significantly ameliorated, suggesting that pragmatics plays a role as well. Further evidence for the presence of semantic-pragmatic factors comes from the fact that the most acceptable Tensed Adjunct Island violations involve relative clauses which express assertions rather than backgrounded information. This is illustrated in (21).

	- b. I called the client who*<sup>i</sup>* the secretary worries if the lawyer insults *<sup>i</sup>*. (Sprouse et al. 2016)
	- c. This is the watch*<sup>i</sup>* that I got upset [when I lost *<sup>i</sup>*]. (attributed to Ivan A. Sag (p.c.) by Truswell (2011, pp. 175, ft.1))

Indeed, Sprouse et al. (2016) found evidence of an Adjunct Island effect in interrogatives but no such effect in relative clauses like (21b). See also (Kush et al. 2018, 2019), and Müller and Eggers (2022) for similar findings about such relatives in English and other languages. In sum, Adjunct Islands are not construction-invariant, and seem to be sensitive to semantic and pragmatic factors (Kohrt et al. 2018a; Kohrt et al. 2018b; Müller and Eggers 2022). The parallel with Subject Islands does not stop here. Repeated exposure to interrogative Adjunct Islands can lead to amelioration effects (Chaves and Putnam 2021, pp. 232, 238). This includes clausal islands like (22), which by the end of the experiment were as acceptable as grammatical controls.

(22) Who would Amy be really happy [if she could speak to ]?

#### *2.3. Complex NP Constraint*

There are various other island effects that are similarly not construction-invariant, and which are attenuated when extraction occurs from structures that do not express backgrounded content. Complex NP Constraint (CNPC) phenomena, illustrated in (23) in one such case.

	- b. \*Which student*<sup>i</sup>* should we report [the teacher who punished *<sup>i</sup>*]?

These island are graded, as has long been noted (Culicover 1999; Deane 1991; Erteschik-Shir and Lappin 1979; Kluender 1998; Kuno 1987). Compare (23b) with the isomorphic example in (24). Furthermore, CNPC violations ameliorate with repeated exposure, as shown by Snyder (2000), and Goodall (2011).

Erteschik-Shir (1977, chp. 2) first noted that in CNPC exceptions the matrix predicate is in general less informative than the embedded, and main verbs like *hear* and *know* are almost devoid of semantics, which makes it more likely for the main action to be conveyed by the subordinate clause. See Vincent (2021) and Vincent et al. (2022) for experimental evidence confirming that English should be counted among the languages that allow extraction from relative clauses in environments such as the one in (24).

(24) Which kid*<sup>i</sup>* did you hear [a rumor that my dog bit *<sup>i</sup>*]? (Chaves and Putnam 2021, p. 67)

This also explains why CNPC effects tend to vanish in relative clauses that express the assertion of the utterance, as in (25). See Erteschik-Shir and Lappin (1979), McCawley (1981, p. 108), Chung and McCloskey (1983) for more examples, and Kush et al. (2013) and Sprouse et al. (2016) for supporting experimental evidence. The situation is not unlike that of Subject and Adjunct Islands.

	- b. John is the sort of guy*<sup>i</sup>* that I don't know [a lot of people who think well of *<sup>i</sup>*]. (Culicover 1999, p. 230)

#### *2.4. Factive Islands*

Factive Island phenomena exhibit various kinds of circumvention phenomena. As Szabolcsi and Zwarts (1993) originally noted, when a question necessarily has a unique true (and non-negative) answer then the presence of a factive verb hampers extraction, as illustrated in (26). See Oshima (2007), Schwarz and Simonenko (2018), and Abrusán (2014) for elaborations of this conclusion.

	- b. Who did Robin say that [Alex helped first]?

As a consequence, there are two ways to circumvent the effect in (26a). One way is to make the question not have a necessarily unique true answer, which can be achieved by replacing the one-time adverb *first* with any other kind of adverb:

```
(27) Who did Robin know that [Alex helped yesterday]?
```
The other way to circumvent the effect is to convert the unbounded dependency to a declarative, as in (28). This means that Factive islands are not construction-invariant, since they disappear in non-interrogative extractions.

	- b. I met the person who Robin knew that [Alex helped ].
	- c. KIM, Robin knew that [Alex helped ]. MIA he didn't.

But there are other, more subtle, island effects in clausal complements, illustrated in (29), where the interrogatives are not required to have a unique true answer. Here, it is the mere presence of a factive or manner-of-speaking verb that hampers extraction.


Most researchers seem to agree that the explanation for these for Bridge verb effects is at least in part pragmatic, although they disagree in the details (Ambridge and Goldberg 2008; Erteschik-Shir 2006a; Kothari 2008; Liu et al. 2022), and if Tonhauser et al. (2018) and Degen and Tonhauser (2022) are correct about factivity being a matter of degree, this would explain why such island effects are fuzzy.

For example, Ambridge and Goldberg (2008) provide evidence suggesting a pragmatic explanation: the more backgrounded the proposition, the stronger the island effect. Liu et al. (2019) challenge these findings and instead provide evidence suggesting that the frequency with which verbs are used in the clausal complement frame is responsible for acceptability contrasts observed by extracting from factive and manner clausal complements. Liu et al. (2022) conjectures that discourse, semantic, and structural factors might conspire to give rise to the observed frequency distributions, which in turn give rise to acceptability ratings.3

#### *2.5. Interim Summary*

Most of the islands discussed above are not construction-invariant. They are stronger in interrogatives than in relative clauses that express assertions, for example. This suggest a common thread between the Element Constraint, Subject Islands, Adjunct Islands, Factive Islands, and the Complex NP Constraint: asserted content more readily allows extraction than backgrounded (non-at-issue content); cf. with Erteschik-Shir and Lappin (1979), Kuno (1987), Goldberg (2013), Chaves and Dery (2019), and Abeillé et al. (2020).

This observation allows us to make further predictions. For example, it means that extraction from parentheticals should be impossible, regardless of the construction, because parentheticals by definition express suppletive information, orthogonal to the main assertion. This prediction is borne out in the contrasts in (30) and (31).

	- b. \*It was that article that the union leaders–in case you missed –refused to sign the contract.
	- c. \*What the union leaders –in case you missed –refused to sign the contract was that article.
	- b. \*It was Robin who David Johnson–I am not sure told you this –refused to sign the contract.
	- c. \*Who David Johnson–I am not sure told you this–refused to sign the contract was Robin.

Why are island effects gradient, even in interrogative environments? Tonhauser et al. (2018) provides evidence that whether or not speakers commit to the content expressed by subordinate clauses is a matter of degree, as it depends on a number of factors, including the prior probability of the event that is described. If this is correct, then it would provide an explanation for why *wh*-phrases embedded in the subjects of certain interrogatives are more readily interpreted as Foci than others, i.e., more readily extracted, and so on. Another possibility is that the increase in acceptability is due to more general factors, independent of islands, which have more to do with how informants adapt to psycholinguistic tasks. I turn to this matter in the following section.

#### **3. Satiation**

As discussed above, even when the filler-gap construction type is island-inducing, it is often the case that the island effect can be attenuated with repeated exposure (Chaves and Dery 2014, 2019; Clausen 2011; Do and Kaiser 2017; Francom 2009; Goodall 2011; Hiramatsu 2000; Hofmeister 2015; Lu et al. 2021; Snyder 2000, 2017) as discussed above. To be sure, such amelioration is not consistently observed, suggesting that different results arise because different researchers have used different stimuli and different exposure rates (Chaves and Dery 2019; Hiramatsu 2000; Hofmeister 2011; Snyder 2017). In particular, the role of stimuli design cannot be overstated. If sentences that are too complex or awkward are used, satiation is less likely to be observed (Hofmeister 2011; Hofmeister and Sag 2010). Consider for example the sample of items in (32), from Sprouse et al. (2012).

	- b. \*What*<sup>i</sup>* do you sneeze if the dog owner leaves open *<sup>i</sup>* at night?
	- c. \*What*<sup>i</sup>* do you cough if the tourists photograph *<sup>i</sup>* in the exhibit?
	- d. \*What*<sup>i</sup>* do you laugh if the heiress buys *<sup>i</sup>* at the auction? (Sprouse et al. 2012)

Now contrast these with (33), which Chaves and Putnam (2021, p. 238) found induce satiation. Crucially, the adjunct clause coheres much better with the main predication because it expresses a cause that triggers the state described by the psychological predicate. In contrast, there is no obvious relation between the main predication and the conditional clause in (32).

	- b. What*<sup>i</sup>* would Jill get really angry if she missed *<sup>i</sup>*?
	- c. What*<sup>i</sup>* would Allison be really upset if she forgot *<sup>i</sup>*?

The low acceptability of such tensed Adjunct Island violations and their lack of satiation is likely to be due, at least to some extent, to the described propositions. For instance, people don't routinely faint when something is forgotten on stage as in (32a), or typically sneeze if dog owners leave something open at night, as in (32b). These are perfectly possible propositions, but they describe rather unusual situations. The event described by the matrix predication does not cohere particularly well with that of the adjunct's predication. In order to avoid this kind of problem, one would have to norm the declarative counterpart of these items, to ensure that all are equally felicitous and plausible.

The amelioration effect caused by repeated exposure is referred to as *syntactic satiation*, in analogy to the phenomenon of semantic satiation, whereby repetition causes a word or phrase to temporarily lose meaning for the listener. There are two problems with this terminology. First, it is perfectly reasonable that the increase of acceptability is caused by semantic and pragmatic factors, over and above syntactic factors. Second, whereas semantic satiation is a fairly well-understood general reactive inhibition phenomenon (a bottom-up processing process associated with lower level neural mechanisms of inhibition), the increase of acceptability during sentence processing is selective: certain island violations robustly ameliorate with repetition, whereas others simply do not, as discussed above. In contrast, repetition of any lexical item can induce the semantic satiation effect. Syntactic satiation seems to be faciliatory in nature, rather than inhibitory, because repeated exposure to island violations does not lead to loss of sentence meaning. Thus, comprehension question accuracy does not decline as island effects ameliorate.

A more concerning problem is that it remains unclear what syntactic satiation actually amounts to. It could be a form of adaptation, caused by changes in the activation of the representations in declarative memory (i.e., a form of priming), residual activation (the mechanism that accounts for priming), a change in the procedural knowledge required to construct the relevant structures (adaptations to the parsing strategy), or belief-updating (violated expectations lead to probabilistic updates, under a Bayesian interpretation).

#### *3.1. Adaptation*

Sensory input is typically noisy and ambiguous, and individuals respond to the challenges created by such variation by using probabilistic expectations (Anderson 1990; Gigerenzer et al. 1999; Newell and Simon 1972). For example, infants already exhibit the ability to integrate prior beliefs, knowledge, and expectations about human actions with new evidence provided by the environment (Xu and Kushnir 2013), and use new evidence to modify their prior expectations (Brandone et al. 2014). Linguistic input is particularly noisy, ambiguous, and variable across individuals and contexts, and therefore it is expected that speakers can adapt to the contingencies of the input. This would enable individuals to make heuristic predictions and robustly cope with such a dynamic linguistic input. For example, it is known that comprehenders create expectations about upcoming words (Altmann and Kamide 1999; Arai and Keller 2013; Creel et al. 2008; DeLong et al. 2005; Kutas and Hillyard 1984; Metzing and Brennan 2003), about upcoming lexical categories (Gibson 2006; Levy and Keller 2013; Tabor et al. 1997), and about syntactic structures (Farmer et al. 2014; Fine et al. 2010, 2013; Fine and Jaeger 2013; Kamide and Mitchell 1997; Lau et al. 2006; Levy 2008; Levy et al. 2012; MacDonald et al. 1994; Malone and Mauner 2018; Stack et al. 2018; Staub and Clifton 2006; Wells et al. 2009), among other modalities of linguistic input. In what follows I will provide a brief survey of this literature and the controversy therein about the nature of adaptation. See Kaan and Chun (2018) for a detailed overview.

#### *3.2. Adaptation in Garden-Path Sentences*

Fine et al. (2010), Kamide and Mitchell (1997), Farmer et al. (2014) and others provide evidence suggesting that syntactic expectations are malleable and quickly adapt to changes in the input. Fine and Jaeger (2013) argue that repeated exposure to *a priori* unexpected structures can reduce, and even completely invert, their processing disadvantage, and *a priori* expected structures can become less expected (even eliciting garden paths) in environments where they are hardly ever observed. As illustrated in (34), past participle verb forms often give rise to a temporary ambiguity between a main verb parse like (34a), and a relative verb parse, seen in (34b). However, (34a) and (34b) differ in that the latter consistently elicits a garden-path effect, because the main verb use of *warned* is much more likely than the relative verb use according to corpora evidence (Roland et al. 2007). This effect has been detected by various researchers, including Stack et al. (2018), Malone and Mauner (2018), Prasad and Linzen (2019 2021), Dempsey et al. (2020).

(34) The experienced soldiers . . .


But by making the relative verb use more frequent than the main verb use in a controlled experiment, Fine and Jaeger (2013) found that the garden-path effect can flip: the relative verb parse becomes the default preferential parse, and the main verb parse becomes dispreferred. By the end of the experiment, sentences like (34b) no longer exhibit a gardenpath effect because the relative verb parse is now the most frequent and preferential parse, whereas sentences like (34a) now yield a garden-path effect. The latter is called a *reverse ambiguity effect*. Fine et al. (2013) argue that comprehenders adapt to the statistics of the current linguistic environment by generating expectations that reflect the distribution of actual events in the environment. This rational strategy allows comprehenders to reduce the average prediction error experienced during processing.

More recently, Lu et al. (2021) provide evidence suggesting that comprehenders can exhibit speaker-specific satiation to Subject Islands, and argue that syntactic satiation in island phenomena is a form of Bayesian learning *a la* Fine et al. (2010).

The reverse ambiguity effect that penalizes a priori preferred structures found by Fine and Jaeger (2013) seems to be elusive, however. It was replicated by Sikos et al. (2016), but not by Stack et al. (2018). See also Jaeger et al. (2018) for a response. Now, it is worth pointing out that these studies used different experimental items, different numbers of participants, different amounts of exposure, different compensation levels for participants, and different statistical methods. As I will discuss below, at least some of these may play a crucial role in promoting adaptation.

Second, although the reading times of garden-path sentences decreased in all of the aforementioned studies, this also happened for all other sentences, including controls that were not temporarily ambiguous. In fact, there is independent evidence that reading times generally decrease exponentially as a function of practice (Heathcote et al. 2000). Given this evidence, Stack et al. (2018), Prasad and Linzen (2019, 2021), and Dempsey et al. (2020) argue that the reduction in reading time due to syntactic adaptation is confounded with a more general adaptive phenomenon, called *task adaptation*: adaptation driven instead by increased familiarity with the experimental procedure, rather than by syntactic structure. For Dempsey et al. (2020), task adaptation is what is commonly referred to as syntactic satiation. The latter does not directly depend on the syntactic structure of the sentence, and could be due to a number of factors, such as word frequency, plausibility, predictability, and syntactic disambiguation difficulty.

#### *3.3. Syntactic Satiation as Adaptation in Islands*

For Brown et al. (2021) syntactic satiation in islands is a form of task adaptation, and has nothing to do with grammar or island phenomena. In their experiments, only the items with intermediate acceptability became more acceptable, and they did so only at the beginning of the experimental session, regardless of syntactic construction. However, other island satiation studies find different patterns. For example, Hiramatsu (2000) found Subject Island satiation with 7 exposures but not with 5. This should be impossible if satiation mainly occurred at the beginning of the experiment. Similarly, Hofmeister (2015) found that Adjunct Islands satiate after 8 exposures but not before (this experiment replicated); see Chaves and Putnam (2021, p. 232) for details. None of these results are expected if satiation is mainly located at the beginning of the experiment.

What is more, different conditions usually satiate at different rates, contrary to the generalization put forth by Brown et al. (2021). In (35) are examples of item types used by Hofmeister (2015).

	- b. Just a few years ago, terrorists would have thought twice before attacking the city of Mosul.
		- [Non-island condition]
	- c. The rebels in the jungle captured the diplomat who pleaded with the villagers after they threatened to kill his family for not complying with their demands. [Right-branching]
	- d. The diplomat who the rebels who were in the jungle captured pleaded with the villagers after they threatened to kill his family for not complying with their demands.
		- [Center-embedded]

Linear Mixed-Effect models with acceptability as the dependent variable and the presentation order as the predictor (allowing the intercept to be adjusts by list and item, as random effects) reveal that the acceptability center-embedding condition increased significantly as the experiment progressed (*β* = 0.02, *SD* = 0.005, *t* = 4.042, *p* < 0.0001), as did the adjunct island condition (*β* = 0.01, *SD* = 0.004, *t* = 3.89, *p* = 0.0001), whereas the right-branching condition did not (*β* = −0.01, *SD* = 0.005, *t* = −0.19, *p* = 0.84). The non-island condition improved as well, but the effect size was much the smallest (*β* = 0.007, *SD* = 0.002, *t* = 2.66, *p* = 0.007). This is seen in Figure 1.

Crucially, the right-branching condition received middle ratings, and yet did not experience any acceptability changes. Moreover, adjunct island items (at the very bottom of the acceptability range) only rise sharply and consistently in the last two thirds of the experiment. These results are unexpected for Brown et al. (2021).

As a final example, consider the satiation patterns of three different types of clausal adjuncts from the data in Chaves and Putnam (2021, p. 238), shown in Figure 2. Again, extractions from one item type (in this case, conditional clauses like (22) above) exhibit a more consistent trajectory than the others. Again, these results challenge the generality of the conclusions of Brown et al. (2021).

**Figure 1.** Differential effect of repeated exposure in Hofmeister (2015).

**Figure 2.** Effect of repeated exposure in Clausal Adjuncts in Chaves and Putnam (2021, p. 238).

Prasad and Linzen (2021) suggest that sentences that are difficult to process undergo a sharper rate of task adaptation than easier sentences, which in turn overwhelms the effect of syntactic adaptation, if any exists. They argue that the effect of syntactic adaptation is very small, and required very large numbers of participants (around 1000). If this is the case, then there should be a correlation between acceptability and the satiation coefficient. To test this hypothesis, data from three separate experiments was used. First, the clausal adjunct island satiation data mentioned above (Chaves and Putnam 2021, p. 238) were obtained and the (significant) satiation coefficients, per item, were compared with the respective mean acceptability ratings. The correlation was not significant (*t* = 2.06, *p* = 0.13), and had it been significant, it would have been positive, not negative. Next, the significant satiation coefficients from the Adjunct Island violations in Hofmeister (2011) were also computed, by item, as above, and correlated with the respective mean acceptability ratings. Again, no

significant correlation was found (*t* = −0.28, *p* = 0.79). The same was done for the Subject Island satiation data from Chaves and Dery (2019), Chaves and Putnam (2021, p. 212), and again the correlation was not significant (*t* = 1.12, *p* = 0.34). These results are the opposite of what Prasad and Linzen (2021) would predict if these island satiation effects affected low acceptability sentences more than high acceptability sentences. And if more extreme island violations do not satiate more, then island satiation cannot amount to just task adaptation, according to the logic of Prasad and Linzen (2021).

#### *3.4. Disentangling Task Adaptation from Syntactic Adaptation*

Malone and Mauner (2020) develop a new approach to decoupling syntactic adaptation from task adaptation, and show that the former is detectable without large numbers of participants, and robust. In a nutshell, they propose that the effect of task adaptation is dealt with by using *Task-Adapted Reading Times* (TART). These are conceptually similar to residual reading times that correct for the effect of word length on reading speed within individual participants. The TART procedure uses the speed-up in reading times in the distractor items as a proxy for task adaptation. The assumption is that as distractors are structurally unambiguous, uncomplicated sentences, any reduction in reading times over the course of the experiment should (i) not be due to syntactic adaptation, and (ii) be due to task adaptation, as participants mechanically or cognitively adjust to the task. Distractor regions 4 through 11 were selected, and regressed onto stimulus order (not critical item order) for each participant. Because these regressions do not include critical items, and no learning should occur in distractors, this method can measure task adaptation, unconfounded with syntactic adaptation.

As TART involves regressing reading times over distractor item order, the first step is to correct reading times in the selected region of analysis by residualizing reading times to correct for word length. The second step is to then regress item order over the lengthcorrected distractor reading times discussed above, with participant as a random factor. The result should be a model that captures the unique rate of increase in reading time over the course of the experiment for each participant. These TART values are then subtracted from the reading times of each of the critical item regions, and the resulting reading times are residualized based on word length, per region and participant, as is standard (Trueswell and Tanenhaus 1994). The new reading times, now adjusted for both character length and participants' unique increase over time due to irrelevant task adaptation, are now ready to be fit in the primary analytic regression model. If distractors are structurally diverse, unambiguous, and uncomplicated sentences, then all syntactic adaptation must come from the regularities in the critical items.

#### *3.5. The Role of Reward*

It is now well-known that learning requires attention, alertness, and focus, and that predicted reward (dopamine) can not only help engage these systems but also promote synaptic plasticity by enhancing long-term potentiation and depression (Legenstein et al. 2008; Otmakhova and Lisman 1996; Reynolds et al. 2001; Schultz 1998). It follows that adaptation in language processing should be sensitive to the predicted reward, not just to structural frequency and task adaptation. There is currently no standard for the compensation of participants in psycholinguisics experiments, and perhaps this is a problem. For example, Fine et al. (2010) compensated participants with course credit, Fine and Jaeger (2013) paid participants \$10, Stack et al. (2018) paid \$4, Dempsey et al. (2020) paid \$3, and Prasad and Linzen (2021) paid \$6.51 per hour. It is therefore possible that these participants experienced different levels of motivation and focus while performing this task, which had an effect on the probability of learning regularities in the items. As Christianson et al. (2022) show, both online and offline measures of processing and comprehension are susceptible to focus and motivation levels, leading to results that are not reflective of normal human language processing.

To probe for the effect of the predicted reward and provide independent support for the TART methodology, an experiment was designed and conducted to determine whether syntactic adaptation is sensitive to the predicted reward, over and above structural frequency. I focused on a garden-path effect rather than on an island because there is no question that such constructions are grammatical, and all of the literature on task adaptation has focused on garden-paths. Future work should probe island constructions.

#### **4. TART Reward Experiment: Adaptation to Complex Sentences**

#### *4.1. Methods*

#### **Subjects**

In this between-subjects experiment, 100 participants with US-based IP addresses were recruited via the Amazon Mechanical Turk marketplace, all of which self-reported as having grown up speaking English as a first language via a language questionnaire conducted after the experiment concluded. Participants were informed that their responses to the language questionnaire had no bearing on their compensation.

Only subjects with at least 98% approval rating from previous jobs and with over 10k previous tasks approved were allowed to participate. Participants were told the experiment consisted of reading 32 sentences and answering yes/no comprehension questions correctly. Participants were compensated with \$2.4 for their participation.

Participants were randomly assigned to either the Control group or the Bonus group. All participants were informed that the experimenters might not be able to compensate them if their comprehension accuracy dipped significantly below 70%, although in practice no participants were excluded from compensation. The individuals from the Bonus group saw additional text and instructions informing them that if their comprehension question accuracy was above 75%, they would receive a bonus of \$4.80, for a total of \$7.2. The participants from each group saw the same stimuli.

#### **Ethics statement**

This study was conducted with the approval of the Institutional Review Board of the University of the State of New York at Buffalo. All participants gave their informed written consent.

#### **Materials**

A total of 16 items were constructed, all of which exhibited the classic subject/object ambiguity in (36), whereby a noun phrase (underlined) is a temporally plausible object of the preceding verb, but is in fact the subject of the following main verb (bold font) (Christianson et al. 2001; Ferreira and Henderson 1990; Frazier and Rayner 1982; Jacob and Felser 2016).4 This late closure parse is well-known to be susceptible to priming, as reflected behaviorally by decreased reading times (Noppeney and Price 2004; Traxler 2015), and physiologically by attenuated responses in the left temporal pole (Noppeney and Price 2004).

(36) a. After <sup>1</sup>| the <sup>2</sup>| Mayor <sup>3</sup>| visited <sup>4</sup>| the <sup>5</sup>| patients <sup>6</sup>| **were** <sup>7</sup>| moved <sup>8</sup>| to <sup>9</sup>| different <sup>10</sup>| rooms. <sup>11</sup>|

[The Mayor paid a visit after the patients were moved. True or False?]

b. While <sup>1</sup>| the <sup>2</sup>| customers <sup>3</sup>| ate <sup>4</sup>| some <sup>5</sup>| food <sup>6</sup>| **was** <sup>7</sup>| cooking <sup>8</sup>| on <sup>9</sup>| the <sup>10</sup>| grill. <sup>11</sup>|

[The customers ate only after all the cooking was done. True or False?]

Half the items were disambiguated by 'was'; the other half by 'were'. The prepended adverbs were 'after', 'although', 'as', 'though', 'when' and 'while', evenly distributed across items. To maximize the garden path effect, the subordinate verbs came from a subset of verbs from Ferreira and Henderson (1991) and Staub (2007) that had the highest proportion of transitive uses relative to intransitive uses, according to both (Gahl et al. 2004) and to a corpus study using the Corpus of Contemporary American English (Davies 2008). The 16 items were pseudo-randomized and interspersed with 16 distractors, illustrated in (37). The latter used the same prepended adverbs (plus the adverbs 'because', 'if' and 'whenever'), evenly distributed across distractors, and a variety of verbal structures different from the items. Across items and distractors, no two stimuli contained the same verb, as to avoid priming effects caused by verb repetition (Fine and Jaeger 2016; Traxler and Pickering 2005). Although all participants in the experiment saw the same stimuli, no two participants saw the same order of stimuli.

(37) a. Though <sup>1</sup>| the <sup>2</sup>| bus <sup>3</sup>| driver <sup>4</sup>| missed <sup>5</sup>| a <sup>6</sup>| street <sup>7</sup>| Sue <sup>8</sup>| was <sup>9</sup>| at school <sup>10</sup>| on time. <sup>11</sup>|

[Sue brought a child home after the bus missed its stop. True or False?]

b. If <sup>1</sup>| the <sup>2</sup>| radar <sup>3</sup>| is <sup>4</sup>| correct <sup>5</sup>| the <sup>6</sup>| storm <sup>7</sup>| will <sup>8</sup>| be <sup>9</sup>| here <sup>10</sup>| tomorrow. <sup>11</sup>| [The radar data can be used to make predictions about the weather. True or False?]

#### **Procedure**

Subjects read sentences in a self-paced moving window display (Just et al. 1982), using the self-paced reading mode of the PCIbex platform (Zehr and Schwarz 2018). Three practice trials were conducted before the experiment proper started. All experimental items were followed by a Yes/No comprehension question probing the lingering of the initial interpretation. The form of the comprehension questions varied from item to item, to prevent participants from strategizing how to answer the comprehension questions. The correct answer was "yes" half of the time, and after submitting each answer participants were informed about whether their selection was correct or not. The stimuli were pseudorandomized so that no two participants saw the items in the same order and no more than two critical items were allowed to immediately follow each other. Participants took an average 15 min to complete the experiment, meaning that Control group participants were paid at an hourly rate of about \$9.4 while the Bonus group participants were paid at a \$28.8 hourly rate.

#### **Filtering**

Participants with comprehension question accuracy below 75% were excluded, resulting in 12% of data loss (11.3% data loss for the Control group, and 11.2% for the Bonus group). Only distractors were used for this participant exclusion criterion, since it is expectable that comprehension questions about garden-path sentences are harder to answer than comprehension questions about non-ambiguous sentences (Dempsey et al. 2020). Finally, all observations with reading times lower than 100ms and longer than 2000ms were removed, excluding 1% of all observations. The results are qualitatively similar if reading times are unfiltered, or if reading times are log-transformed.

#### *4.2. Results*

The mean accuracy of the Control group was 89% (SD = 0.3) and 90% (SD = 0.29) for the Bonus group. Logit models with accuracy as the dependent variable and item order as the predictor were fit for each participant group, revealing that item accuracy increased for the Bonus group during the experiment (*β* = 0.04, SE = 0.01, z = 3.73, *p* = 0.0001), but not for the Control group (*β* = 0.01, SE = 0.006, z = 1.67, *p* = 0.09). This suggests that the Bonus group participants became better at interpreting the sentences in the experiment correctly, but Control group participants did not. Also, the control group read distractors about 10 ms faster than the Bonus group, per exposure.

In order to avoid the usual convergence problems of Linear Mixed-Effect models, power concerns, and the well-known limitations of frequentist significance testing (Kruschke 2015; Lavine 1999; Sorensen et al. 2016), Bayesian Linear-Mixed effect models were fit, using the BRMS package (Bürkner 2017). The dependent variable was the task-adapted length-corrected residual reading times (TART), with item presentation order, participant group and their interactions as predictors, allowing for the intercept to be adjusted by

participant and item. The model had a flexible threshold and weakly informative priors, and was checked for convergence (Rˆ = 1) after fitting with four chains and 2000 iterations, half of which were the warm-up phase. A significant interaction between participant group and item order (1–16) was found at the spill-over region 8 (see Table 1):

**Table 1.** Coefficients for the effect on TARTs from the interaction between Control/Bonus reward group and the item order, per region, according to Bayesian Mixed-effect models.


Plots illustrating the TART values in regions 7 through 10 are shown in Figure 3.

**Figure 3.** Effect of repeated exposure and reward differential in spill-over region 8.

For completeness, a region-by-region plot with the plain residual reading times is provided in Figure 4. The behaviour of the two groups of participants was generally the same, except that the Bonus group slowed down at region 5 (approaching the critical region), and exhibited greater variability than the Baseline group, which is consistent with participants being more attentive and taking greater care to perform the task.

**Figure 4.** Mean residual reading times for all sentence regions.

#### *4.3. Discussion*

The results suggest that participants in the Bonus group used cues in the input to predict the upcoming structure and adapted strategically to the critical items faster than the Control group participants did. Frequency can compound with reward, and speed up sentence processing of complex sentences, in this case, a classic garden-path construction.

It is possible that studies that found null effects in garden-path adaptation (and in island adaptation) were caused by factors that usually are not controlled for: the complexity of the items, their naturalness (norming their non-extracted counterparts would address that), the motivation and focus that participants experience when performing a rather repetitive and artificial task (assigning numbers to sentences, or reading sentences in moving displays), as pointed out by Christianson et al. (2022). To be sure, further research is necessary in order to investigate this matter in more detail, but if it turns out that reward does in fact modulate syntactic adaptation, then a new tool can be added to experimenter's toolkit, which can reduce the chances of null effects caused by low motivation and focus, due to low perceived reward.

#### **5. Conclusions**

It is increasingly clear that most island effects are not construction-invariant (Abeillé et al. 2020). Constructions that express assertions tend to yield weaker island effects, for example. Moreover, even in constructions where strong island effects are observed, these are far from categorical. In the present work I have drawn attention to a wide range of factors that likely contribute to that gradience. First, the complexity of the items and the plausibility of the expressed propositions likely plays a role (Hofmeister and Sag 2010). Second, the number of exposures often has an effect, in that it can sometimes cause acceptability ratings to rise. Sometimes that acceptability increase is restricted to the first exposures, sometimes it is not. It is a highly dynamic phenomenon. The acceptability increase instigated by repeated exposure is also selective, in that it does not always affect all sentence types equally. In particular, there is no correlation between sentence acceptability and rate of acceptability change.

The mechanism that drives the amelioration effects remains poorly understood, but extant evidence suggests that speakers are highly sensitive to the items themselves, so that sentences that are excessively complex, or lack semantic plausibility, or require unusual contexts in order to be felicitous in discourse are less likely to improved with repeated exposure. The amount of exposure also seems to matter, since a number of studies found thresholds after which acceptability increases are observed (Snyder 2021). A survey of the literature on syntactic/task adaptation suggests that syntactic satiation is likely to consist, at least in part, of syntactic adaptation (Fine et al. 2010). This is consistent with the notion that the grammar is gradient and flexible (Francis 2022).

Finally, the present paper puts forth a new factor that can promote adaptation to complex syntactic structures: predicted reward. The underlying mechanism is straightforward: the more motivated and focused the comprehenders are, the faster they can adapt to unusual and complex input, over and above the effect of frequency and task adaptation. This can shed light on why syntactic adaptation – in garden-paths and in certain islands–is not systematically observed in experimental research (Christianson et al. 2022; Kaan and Chun 2018).

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Notes**

	- i. [Succeed] he [[must ] or [be forever shamed]].
	- j. [[Choose wisely] or [be forever shamed]].

#### **References**


Kehler, Andrew. 2002. *Coherence, Reference, and the Theory of Grammar*. Stanford: CSLI Publications.

Kim, Ilkyu. 2021. A note on the factorial definition of island effects. *Second Language Research* 57: 211–23. [CrossRef]


Malone, Avery, and Gail Mauner. 2018. What do readers adapt to in syntactic adaptation? Paper presented at the 31st Annual CUNY Sentence Processing Conference, Davis, CA, USA, March 15–17.

Malone, Avery, and Gail Mauner. 2020. Syntactic adaptation for reduced relative clauses is not reducible to task adaptation. Paper presented at the Poster Session, 33rd CUNY Conference on Human Sentence Processing, Amherst, MA, USA, March 19–21.

McCawley, James D. 1981. The syntax and semantics of english relative clauses. *Lingua* 53: 99–149. [CrossRef]

Metzing, Charles, and Susan E. Brennan. 2003. When conceptual pacts are broken: Partner-specific effects on the comprehension of referring expressions. *Journal of Memory and Language* 49: 201–13. [CrossRef]

Michaelis, Laura, and Hartwell Francis. 2007. Lexical subjects and the conflation strategy. In *Topics in the Grammar-Pragmatics Interface: Papers in honor of Jeanette K. Gundel*. Edited by N. Hedberg and R. Zacharski. Amsterdam: John Benjamins.

Müller, Christiane. 2017. Extraction from adjunct islands in Swedish. *Norsk Lingvistisk Tidsskrift* 35: 67–85.

Müller, Christiane, and Clara Ulrich Eggers. 2022. Island extractions in the wild: A corpus study of adjunct and relative clause islands in danish and english. *Languages 7*: 125.

Na, Younghee, and Geoffrey Huck. 1992. On extracting from asymmetrical structures. In *The Joy of Grammar*. Edited by D. Brentari, G. N. Larson, and L. A. MacLeod. Amsterdam: John Benjamins, pp. 251–74.

Newell, Allen, and Herbert A. Simon. 1972. *Human Problem Solving*. Englewood Cliffs: Prentice-Hall.

Noppeney, Uta, and Cathy J Price. 2004. An fmri study of syntactic adaptation. *Journal of Cognitive Neuroscience* 16: 702–13. [CrossRef] Oshima, David Y. 2007. On factive islands: Pragmatic anomaly vs. pragmatic infelicity. Paper presented at the JSAI'06 20th Annual Conference on New Frontiers in Artificial Intelligence, Tokyo, Japan, June 5–9. Berlin/Heidelberg: Springer, pp. 147–61.

Otmakhova, Nonna A., and John E. Lisman. 1996. D1/d5 dopamine receptor activation increases the magnitude of early long-term potentiation at ca1 hippocampal synapses. *Journal of Neuroscience* 16: 7478–86. [CrossRef]

Pollard, Carl, and Ivan A. Sag. 1994. *Head-Driven Phrase Structure Grammar*. Chicago: University of Chicago Press and Stanford CSLI. Prasad, Grusha, and Tal Linzen. 2019. Reassessing the evidence for syntactic adaptation from self-paced reading studies. Paper presented at the Poster Session, 32nd CUNY Conference on Human Sentence Processing, Boulder, CO, USA, March 31.

Prasad, Grusha, and Tal Linzen. 2021. Rapid syntactic adaptation in self-paced reading: Detectable, but only with many participants. *Journal of Experimental Psychology: Learning, Memory and Cognition* 47: 1156–72. [CrossRef]

Reynolds, John N., Brian I. Hyland, and Jeffrey R. Wickens. 2001. A cellular mechanism of reward-related learning. *Nature* 413: 67–70. [CrossRef]


Schultz, Wolfram. 1998. Predictive reward signal of dopamine neurons. *Journal of Neurophysiology* 80: 1–27. [CrossRef] [PubMed]

Schwarz, Bernhard, and Alexandra Simonenko. 2018. Factive islands and meaning-driven unnaceptability. *Natural Language Semantics* 26: 253–79. [CrossRef]

Shafiei, Nazila, and Thomas Graf. 2020. The subregular complexity of syntactic islands. In *Proceedings of the Society for Computation in Linguistics*. Amherst: University of Massachusetts Amherst, vol. 3.

Sikos, Les, H. Martin, Laura Fitzgerald, and D. Grodner. 2016. Memory-based limits on surprisal-based syntactic adaptation. Paper presented at the 29th Annual CUNY Conference of Human Sentence Processing, Gainesville, FL, USA, March 3–5.

Snyder, William. 1994. A psycholinguistic investigation of weak crossover, islands, and syntactic satiation effects: Implications for distinguishing competence from performance. Paper presented at the 7th CUNY Conference of Human Sentence Processing, New York, NY, USA, March 17–19.

Snyder, William. 2000. An experimental investigation of syntactic satiation effects. *Linguistic Inquiry* 31: 575–82. [CrossRef]

Snyder, William. 2017. On the nature of syntactic satiation. *Unpublished manuscript*.

Snyder, William. 2021. Satiation. In *The Cambridge Handbook of Experimental Syntax*. Edited by G. Goodall. Cambridge: Cambridge University Press.

Sorensen, Tanner, Sven Hohenstein, and Shravan Vasishth. 2016. Bayesian linear mixed models using stan: A tutorial for psychologists, linguists, and cognitive scientists. *Quantitative Methods for Psychology* 12: 175–200. [CrossRef]

Sprouse, Jon. 2007. Continuous acceptability, categorical grammaticality, and experimental syntax. *Biolinguistics* 1: 118–29. [CrossRef] Sprouse, Jon. 2009. Revisiting satiation: Evidence for an equalization response strategy. *Linguistic Inquiry* 40: 329–41. [CrossRef]

Sprouse, Jon, Ivano Caponigro, Ciro Greco, and Carlo Cecchetto. 2016. Experimental syntax and the variation of island effects in english and italian. *Natural Language & Linguistic Theory* 34: 307–44.

Sprouse, Jon, Matt Wagers, and Colin Phillips. 2012. A test of the relation between working memory capacity and syntactic island effects. *Language* 88: 82–123. [CrossRef]

Stack, Caoimhe M. Harrington, Ariel N. James, and Duane G. Watson. 2018. A failure to replicate rapid syntactic adaptation in comprehension. *Memory & Cognition* 46: 864–77.

Staub, Adrian. 2007. The parser doesn't ignore intransitivity, after all. *Journal of Experimental Psychology: Learning, Memory, and Cognition* 33: 550–69. [CrossRef] [PubMed]


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Languages* Editorial Office E-mail: languages@mdpi.com www.mdpi.com/journal/languages

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com ISBN 978-3-0365-6317-6