4.1.2. Results

To explore the nonlinear effects of frequency and collocate diversity on observed variance, we fitted generalized additive mixed models (GAMM) [35] using the *mgcv* package for R. In baseline model 1, we model the normalized number of observed corpus variants as a function of the smooth over log frequency. In baseline model 2, we model the number of variants as a function of a smooth over collocate diversity, the log normalized number of preceding words we observe in the corpus.

Model 1 counts show a strong, nonlinear effect of frequency (*p* < 0.0001). It yields an *R*<sup>2</sup> of 0.435 and explains 43.5% of the deviance in the data (*edf* = 5.05, *AIC* = 19852.04). Model 2 shows a strong, nonlinear effect of diversity of collocates in the preceding position (*p* < 0.0001), explaining 74.6% of the variance in the data (*edf* = 6.922, *R*<sup>2</sup> = 0.746, *AIC* = 11777.66).

We assessed the goodness of fit of both models by the Aikake Information Criterion (AIC). Model 2 improved the score by 8074.38. To contrast the contribution of both predictors, we modeled word variance as a function of smooth over log normalized word frequency and log normalized number of variants observed in the corpus in a combined model 3. Model 3 (*R*<sup>2</sup> = 0.746, *AIC* = 11548.74) reduced the AIC by 228. Both predictors are highly significant (*p* < 0.0001).

Interestingly, the plots show that the frequency effects predicted by the baseline model 1 and the combined model diverge substantially across frequency ranges (see Figure 7a,c), suggesting that the effect of frequency is largely overestimated in the low-frequency and mid-frequency ranges by the baseline model. It further appears that a large part of the frequency effect is confounded by the correlation between word frequency and the number of collocate contexts a word appears within. There remains, however, a strong effect of frequency observable in high-frequency words. The high-frequency part of the data behind the effect comprises 82 function words, 57 nouns, and 47 verbs, representing 69%, 1%, and 2% of unique types, respectively.

Word frequency appears to influence the extent to which a word varies in form only in high-frequency words and thus holds for type variation across lexical categories to the degree with which the category is represented in the high-frequency tail of the word distribution. We further observe a stronger correlation between collocate diversity and word frequency in function words (*r*(143) = 0.882, *p* < 0.0001), than in verbs (*r*(2549) = 0.665, *p* < 0.0001) and nouns (*r*(5626) = 0.593, *p* < 0.0001).

**Figure 7.** Baseline word variance model comparison: (**a**) the log normalized number of observed variants as a a function of smooth over log frequency (derived from the spoken part of COCA); (**b**) the log normalized number of observed variants as a function of collocate diversity, the log number of preceding words; and (**<sup>c</sup>**,**d**) Figure 7a,b in a combined model.

Finally, we fitted a set of combined models, adding in the log number of distinct parts of speech following each word for all words (model 4) and adding in lexical category as a covariate factor

(model 5). In model 4, we observe a fairly weak effect of frequency (*p* < 0.006 (see Figure 8a), while the effects of the context predictors were strong. The AIC score is reduced by 531.

**Figure 8.** Log normalized number of observed variants as a function of smooth over log frequency (row 1) and adjacent token diversity (rows 2 and 3) for all words (**a**), function words (**b**), verbs (**c**), and nouns (**d**): when collocate diversity is taken into account, frequency effects on variation only hold in a minimal proportion of high-frequency nouns and appear to have no effect at all on verbs and function words.

In model 5, the introduction of lexical category as a covariate further reduces the AIC score by 254 points and explains 76.8% (*R*<sup>2</sup> = 0.766) of deviance observed in the data. The effect of frequency is not significant in verbs (*p* < 0.816) and function words (*p* < 0.062) and is statistically significant but weak in nouns (*p* < 0.018). Again, all contextual predictors are highly significant in function words, nouns, and verbs. The same pattern was observed for all of the other analyzed categories apart from the following exceptions: filled pauses and numbers show an effect of frequency (*p* < 0.002); contractions are unaffected by the collocate diversity (*p* = 0.21); and there is no interaction between modal verbs and the upcoming collocate context (*p* = 0.1). Modals, numbers, and contractions comprise 0.008% of the analyzed data set. We observe differences in the effect of preceding collocate diversity between verbs and nouns in that the effect and the confidence interval both increase linearly in nouns while the effect levels off in high-frequency verbs, showing an increase in variance.

A closer examination of the data reveals that the relationship between word frequency and collocate diversity differ significantly across frequency ranges for verbs and nouns. Collocate diversity is much higher in high-frequency verbs and function words than it is in high-frequency nouns. Also, there is far more variance in the effect in high-frequency nouns.
