1. Introduction
This paper is a corpus study of ‘not…until’ constructions across a sample of European standard languages extracted from the parallel text corpus Europarl (
Koehn 2005). A typical Europarl example is (1).
- (1)
The guidelines are not implemented until the end of 2010. [sentence uttered before 2010]
Until in (1) is a temporal preposition linking two event phases: a negative pre-phase of not implementing the guidelines is followed by a positive post-phase of implementation of the guidelines. The change from the negative to the positive phase occurs at or shortly after the time denoted by the NP complement of until (the end of 2010). As a temporal connective, until can also link two clauses, as in (2).
- (2)
Naturally, Turkey cannot join the EU until all the criteria are met.
The speaker in (2) is a member of the European Parliament who argues that the situation of Turkey not-joining the EU will last until something happens that will lead to a change in state. Such clause linking is frequently encoded by not…until in Europarl, but we also find other means of expression, such as only…when in (3) or if in (4).
- (3)
Only when corruption has genuinely been eradicated in European countries should we try reverting to the imperious recommendations granted to various countries in the resolutions adopted, unfortunately, by us.
- (4)
Europe must mobilise the Solidarity Fund and we know that if the budget is not approved, the fund cannot be mobilised.
The examples report a change in state or a potential change in state. With the PP in (1), this is a purely temporal change from a negative phase to a positive phase. In complex sentences like (2)-(4), the change in state is driven by the event of the subordinate clause. The grammar of
until in (2) builds on a negative main clause and an affirmative temporal subordinate clause, a configuration that we abbreviate as NA. Temporality and conditionality are often mixed (note the modal verbs
should in (3) and
can in (4)), which leads to various related expressions. To give an idea of the range of possibilities, a non-exhaustive list of English paraphrases is provided in
Table 1.
Table 1 anticipates our results about the major types of expressions in the ‘not…until’ domain across European languages. Throughout the paper, we use small caps (
until, before, if) to refer to cross-linguistic types of connectives, and italics (
until, before, if) to refer to language-specific instantiations of a particular category.
1Table 1 illustrates that the meaning conveyed by the original sentences in (1) and (2) can be expressed by various temporal connectives (
until, before, when, after), exceptive phrases (
without) or conditionals (
if, unless). Depending on the connective used, we find a negation in the main clause (
until, before: NA), a negation in both main and subordinate clause (
as.long.as,
if, without, unless: NN), or a focus particle (
only) in the main clause that combines with an affirmative subordinate clause (temporal or conditional: AA). Interestingly, the configuration AN is missing: there is no paraphrase in
Table 1 that combines an affirmative main clause with a negative temporal clause. Examples with temporal NPs (rather than full clauses) have equivalents to the NA- and AA-construals, but not to the NN-construal.
Table 1 illustrates that both temporal and conditional strategies are used. We know that these meanings are intertwined, for instance, in the use of English
when as a temporal connective and a domain restrictor (see
Farkas and Sugioka 1983 for discussion). Our corpus study looks at these overlapping domains from a new angle by investigating the distributional patterns within and across languages. In this paper, we will investigate to what extent grammatical paraphrases of
not…until such as the ones listed in
Table 1 occur in a range of European languages represented in the Europarl corpus, and what determines their choice language-internally and cross-linguistically. The research questions we will address in the paper are listed in
Table 2.
The methodology relies on parallel corpus data, and we use multidimensional scaling as a statistical and visualization technique to reveal the patterns. This resembles the approaches in
Wälchli and Cysouw (
2012),
Wälchli (
2018/2019), and has been dubbed
Translation Mining by
van der Klis et al. (
2017). The methodology will be introduced in
Section 3, but see
van der Klis and Tellings (
2021) for a more exhaustive overview. A special feature of this paper is that we do not only use Translation Mining to investigate cross-linguistic variation in a lexical domain (in our case, choice of connective), but also to study the co-occurrence of two grammatical markers: connective and polarity marking in main and subordinate clauses. These markers interact compositionally to determine the semantics and pragmatics of the ‘not…until’ construction. Hereby we contribute to the underexplored field of cross-linguistic variation with respect to compositional meaning (see
von Fintel and Matthewson 2008 for the need to study variation of meaning composition). Finally, this work can be seen as connecting insights and methodology from the typological approach and the formal semantic approach.
Our corpus study proceeds in two steps that are based on two different multilingual datasets, named D1 (fewer languages, more parallel datapoints) and D2 (more languages, fewer parallel datapoints), both extracted from Europarl. We start in
Section 3 with dataset D1, which is constructed based on information from the literature discussed in
Section 2. It contains 7 European languages, which exemplify the main clusters of connectives found in
Wälchli (
2018/2019). The intermediate results of analyzing D1 in
Section 4 reveal that there is stability with respect to the combination of connective and polarity pattern, as predicted by compositional semantics (research question Q2). Future vs. past time reference turns out to play a role in the balance between conditionality and temporality, and in that sense the Europarl data fill a gap in comparison to earlier discussions in the literature (Q4). Surprisingly, we find much more variation in connective choice than previous literature led us to expect (research questions Q1 and Q3). In order to deal with the large amount of variation, we created a second dataset D2 which contains fewer datapoints, but more languages. The increase in number of languages to 21 allows for more robust statistical testing of patterns of cross-linguistic variation and stability (
Section 5). The analysis of D2 in
Section 5 replicates the two main findings from D1 in terms of strategies (Q1) and compositionality (Q2). The larger set of languages reveals more language-internal and cross-linguistic stability in the data after all, and thus resolves some of the issues that arose after D1 (research question Q3). Before we proceed to the parallel corpus study, we provide a short background on the construction at hand from the perspective of the semantic and typological literature in
Section 2.
6. Discussion
In this section, we address the research questions we formulated in
Table 2 using the combined insights gained from datasets D1 and D2. We end with a brief discussion of the way our results fit into the larger context of cross-linguistic research by considering some additional issues.
D1 displayed much more language-internal variation than we expected, and this variation was replicated in D2. All strategies corresponding to the paraphrases in
Table 1 were found in the cross-linguistic parallel text data, even though only three strategies were reflected in the search string for D1 and none in D2 (with Swedish
förrän not corresponding to any of the strategies in
Table 1).
However, variation is not infinite:
(i) We find predominantly until, before and as.long.as;
(ii) There are a few cases of alternative strategies revealing an extension into the domain of conditionals ((
only)…if next to
only…when/
only…after) and in the domain of exceptive clauses (
not…
without);
8(iii) Although none of the languages under investigation uses a single strategy to convey the ‘not…until’-meaning, we find some languages in which one or two forms are used as dominant strategies.
The interaction of connective type and polarity largely corresponds to the expectations we set out from in
Table 1. There are exceptions, but they are all systematic. They can be accounted for by lexical negation (
Section 4.5.2) and expletive negation (
Section 4.5.3). Expletive negation is restricted to the
until- and
before-strategies. A third arguable type of exceptions concerns ‘only’ containing a negative element, such as French
ne…que ‘only’, as part of the
only.when-strategy.
Section 5 based on D2 suggests that variability across the ‘not…until’ domain can be explained by systematic language-internal and cross-linguistic variability. Two strategies, (
not.)
if and
only.when are mainly due to language-internal differences. Minor exceptions are German, Dutch and Danish, where the
only.when-strategy is slightly more common than in other languages (which is no surprise for Dutch and German, see
Section 2.1). Some languages have one marking strategy that occurs in more than 50% of the data in D2. This is either
before (Finnish and Danish), a specific dedicated marker (Swedish
förrän, diachronically deriving from
before),
until (English and Spanish) or an underspecified
as.long.as/until marker (Bulgarian, Slovene, Czech and Slovak). Other languages, including Portuguese, Latvian, and Polish, are mixed. So far the results are largely the same as in
Wälchli (
2018/2019) for data from the New Testament. Our results differ, however, for French, Dutch, German and Estonian, which are also mixed in the Europarl data. Notably, the relevance of the
as.long.as-strategies in French (
tant que) and Dutch (
zolang) was entirely missed in both
de Swart (
1996) and
Wälchli (
2018/2019). As expected (see
Section 4.5.3), expletive negation only occurs in
before-,
until- and underdifferentiated
as.long.as/until-connectives.
In the data considered, conventionalization (dominant markers) is so strong that it is the major signal in the multidimensional scaling and principal component analyses. Hence, it is safe to conclude that a large part of the ‘not…until’ domain is strongly conventionalized in European languages, but all languages also have less conventionalized parts where language-internal variation occurs. Our results demonstrate that the encoding of the ‘not…until’ domain can only be properly understood if cross-linguistic and language-internal variability are both taken into account at the same time.
We have shown that the various strategies in
Table 1 are not entirely synonymous. Yet, we cannot say either that different markers in the ‘not…until’ domain in European languages have neatly distinct meanings. There are no strict semantic borders across the domain and thus no strict absence of synonyms. The various strategies can safely be considered to be near-synonyms since the semantic differences between them are entirely gradual, they differ in meaning only as a tendency. Two strategies, the ones at the extreme poles,
if and
only.when, are somewhat more different,
before,
until and
as.long.as are overlapping to a larger extent. We have attested both “underdifferentiation” and “overdifferentiation” in this domain:
Underdifferentiation: In some languages, not all strategies can be distinguished. Lithuanian kol, for instance, means both ‘as long as’ and ‘until’. Hence, the two strategies until and as.long.as are not easily distinguished.
Overdifferentiation: Some languages have more than one connective of the same “type”. Greek mexri (na), méxris ótu, éos otu and óspu na all mean ‘until’. Swedish has a connective (inte) förrän dedicated to ‘(not)…until’, which is different from both innan ‘before’ and tills ‘until’.
Our results show that the ‘not…until’ domain not only hosts temporal, but also conditional connectives. Given the nature of political linkage discussed in 2.3, this need not come as a surprise. Dimension analysis in
Section 5 indicates that we can map the data on a scale of more temporal expressions (e.g.,
until) vs. more conditional expressions. There is no strict borderline between temporal and conditional meaning in this domain, as we can easily understand a phase change of an eventuality
e1 at the time of another eventuality
e2, as
e2 being a condition for the occurrence of
e1.
Further Issues
Research questions Q1-Q4 do not in any way exhaust the range of issues that could be picked up. We illustrate this with a brief note on the order of main and subordinate clauses in the constructions under investigation.
Figure 7 shows the ratio of initial subordinate clauses averaged through all 21 languages of the D2 sample by size of circles.
As can be seen, final word order strongly prevails and there is no obvious pattern of distribution of deviant initial order across the clusters. As many as 44 contexts never have initial order in any translation and the maximum value is 0.9 (there is no context where all languages have initial subordinate clauses). These findings agree with our hypothesis that speakers typically put the ‘linkee’-perspective first in a configuration of linkage (see
Section 2.3). However, it may also be the case that word order preference is biased by the choice of initial search strings.
Other questions not addressed in this paper are the relationship of ‘not…until’ and tense and aspect forms in main and subordinate clauses and the great variability of expressions in different languages used in the only.when-strategy.
7. Conclusions
In this study we have investigated the expression of ‘not…until’ in the Europarl parallel corpus in two datasets: D1 (7 languages and 225 datapoints) and D2 (21 languages and a more restricted set of 79 datapoints). We set out with a set of paraphrases and our research questions concerned the ways these strategies are reflected in the cross-linguistic corpus data, how the different strategies interact with polarity, to what extent diversity is constrained cross-linguistically and language-internally and what the interplay of temporality and conditionality is in the ‘not…until’ domain. In both datasets we found that languages have a bewildering wealth of different constructions to convey the ‘not…until’-meaning. Further analysis of dataset D2, based on analysis of clusters and dimensions in semantic maps created by MDS, reveals that this variation is neither unlimited, nor purely a matter of free variation. We were able to identify clusters of meaning corresponding to
before,
if, and
only.when, as well as a cluster of highly conventionalized expressions of the ‘not…until’-meaning. The interaction between connectives and polarity is stable in the sense that cross-linguistically, categories of connectives combine with a single polarity pattern in the main and subordinate clauses (see
Table 5), unless there is a specific reason for deviation, such as expletive negation. This aligns with predictions from the formal literature that different semantic encodings of the ‘not…until’-meaning are semantically/pragmatically equivalent, but originate in different lexicalizations of the construction (
de Swart 1996). We have thus shown how an analysis of parallel corpus data can verify predictions about meaning composition made in the semantic literature. However, the corpus data do not only confirm earlier predictions, they also expand our perspective. Many examples deal with possible future events in terms of linkage expressing a crossover of interests and control, contexts which have so far been largely ignored in the semantic literature.
To summarize, what we find is much more diversity than expected from earlier semantic and typological literature, but also some very clear trends how diversity is constrained both cross-linguistically and language-internally. Despite some obvious methodological difficulties in using translation data, we cannot see any way in which the results we obtained could be reached by other methodologies. Our study demonstrates that cross-linguistic corpus research is indispensable in semantic studies. Semantic studies cannot abstract from cross-linguistic and language-internal diversity before having controlled for it, which presupposes empirical cross-linguistic and corpus research.