1. Introduction
In corpus linguistics, we are interested in systematic analyses of large collections of texts (corpora) to gain insights into language usage patterns, structure, and meaning. Here, full-text corpora (i.e., collections where the sequential ordering of words and their distribution over documents are preserved) are the most general resource, in the sense that all the information that might become relevant during a research endeavor can be derived, especially all measures that rely on sequential information. However, a lot of interesting language resources that we want to include in corpora are subject to restrictions under national copyright law or licenses. The latter is (often) a matter of negotiation between licensees and licensors and might even differ between text sources [
1]. This not only complicates things considerably when compiling corpora but is also an obstacle for efforts regarding standards of open science [
2]. In other words, in an ideal world shaped for corpus-linguistic research, we would be able to distribute all corpus resources freely for everyone who wants to conduct original research or replicate findings based on these resources.
Since, unfortunately, we do not live in this ideal world, the corpus-linguistic community often has to compromise when distributing corpora. For example, we might prevent access to the full-text corpora themselves and give users the opportunity to access them via corpus platforms that are designed to search for specific patterns and allow for a specified set of analyses and the extraction of small parts of the corpus (e.g., keyword-in-context outputs). Several such corpus platforms are available for the German language. COSMAS II and KorAP (COSMAS II is available via
https://cosmas2.ids-mannheim.de. KorAP is available via
https://korap.ids-mannheim.de (accessed on 7 November 2023)) and provide access to the corpus on which the dataset presented here is based. Access to another German corpus resource is offered by the corpus search of the Digital Dictionary of the German Language (DWDS) (the corpus search of the DWDS is available via
https://www.dwds.de/r (accessed on 7 November 2023)). While opening many linguistic research avenues, there are, however, also many large-scale corpus-linguistic procedures (e.g., calculations of transition probabilities for all 2-grams, i.e., two-word sequences, in a corpus) where access paths through corpus platforms, and this includes the ones mentioned above, are not sufficient.
For such applications,
n-gram (
n-grams are adjacent sequences of words where
n encodes the window size: 1-grams (unigrams) are single words, 2-grams (bigrams) are adjacent sequences of two words, and so on) frequency lists are another possibility for allowing researchers to leverage large-scale corpora. Maybe the best-known of such lists are the Google Books corpora [
3], which are also available for German but come with their own restrictions, like a frequency threshold and other issues [
4,
5]. Frequency lists enable researchers to devise a range of measures (for example, the transitional probabilities mentioned above or more sophisticated language models based on
n-grams) and can be used to train word-level machine learning models with a fixed context [
6]. Basically, frequency lists become relevant whenever all items contained in the corpus need to be factored into the analyses. Frequency lists can also inform joint research with other linguistic disciplines. Word frequencies (or derived measures) might become relevant in choosing stimuli for psycholinguistic studies (e.g., if the corpus frequency of experimental items should be held constant or manipulated in an experimental setting). They are also used as covariates or predictors for behavioral data, such as eye movements during reading [
7,
8] or ERP data [
9,
10].
Here, we introduce new (up to 3-gram) frequency lists based on the German Reference Corpus “DeReKo” [
1] that contain lemma, part-of-speech (POS), and frequency information from a corpus of around 43.2 billion tokens. We will first describe which parts of DeReKo provide the basis for the frequency list and introduce the data structure. We will then evaluate the distribution of data over the 16 parts (henceforth “folds”,
Section 3) and present a case study (
Section 4) on vocabulary growth utilizing several cleaning stages in the dataset. Further details and sample code for using the dataset are available at
https://www.owid.de/plus/derekogram/ (accessed on 7 November 2023) (
Supplementary Materials). With this accompanying webpage, we would like to facilitate work with DeReKoGram for as many researchers as possible. We give pointers on how to back-translate the integer codes (see
Section 2 for an explanation) to human-readable wordforms and lemmas, aggregate, lower (i.e., transform all characters to lower-case), and clean the dataset and search for specific patterns based on a linguistic example. For Python, we also show how to train smoothed
n-gram language models with DeReKoGram. For Stata and R code for another linguistic application please see the supplementary material of a previous study [
11].
3. Evaluation of Fold Distribution
In what follows, we evaluate the underlying sampling process, which was carried out using the cryptographic hash function BLAKE2b [
15] to create the 16 folds by comparing several frequency parameters over all folds. We did so for a corpus that had not been cleaned in any way and for raw and lowered, i.e., where all characters had been converted to lower case, datasets separately.
We chose five “direct” measures: (1) number of different wordforms, i.e. the number of wordform types. One may wonder why we do not evaluate the number of wordform tokens. However, this would be somewhat trivial, since the sampling process randomly assigned corpus documents to folds (see
Section 2). The length of the documents (measured in tokens) may therefore differ only minimally between the folds. With the evaluations presented here, we aim to show that this similarity also holds for somewhat more complex measures. (2) number of hapax legomena, (3) number of different wordforms tagged as normal nouns, (4) function words (part-of-speech tags are taken from the automatic (and non-corrected) classification of the TreeTagger, which uses the Stuttgart-Tübingen-Tagset [
16]. The following tags were classified as function words: APPO, APPR, APPRART, APZR, ART, KOKOM, KON, KOUI, KOUS, PAV, PDAT, PDS, PIAT, PIS, PPER, PPOSAT, PPOSS, PRELAT, PRELS, PRF, PTKA, PTKNEG, PTKZU, PWAT, PWAV, PWS) as well as (5) the ratio between the number of name/noun tokens and finite verb tokens. We also report two “derived” measures: (6) type-token ratio (where wordforms are treated as types) and (7) the entropy (entropy
h is defined as
, where
pi is the relative frequency of wordform
i and
N is the number of wordform types in the corpus fold) of the complete frequency distribution for wordforms. For all these measures, we calculated the difference in percentages between the minimum and maximum value over the 16 folds: Δ1-gram = (max (
x1,
x2, …,
x16) – min(
x1,
x2, …,
x16))/max(
x1,
x2, …,
x16); where
xn is the value of the respective measure in the
nth fold (accordingly for 2-grams). Ideally, the difference percentages should be close to zero, which would indicate no difference between the corpus folds. We also calculated coefficients of variation (CV, also known as relative standard deviation, RSD, also reported as percentages) defined as
CV =
σx/
μx, where
σx is the standard deviation and
μx the mean of the respective measure over the 16 folds.
Table 2 shows that the highest difference percentages for 1-grams and the highest coefficients of variation can be observed for the number of different wordforms tagged as function words. This is presumably because the number of different wordforms of function words is indeed much smaller than, e.g., nouns. For example, for the lowered folds, the number of different wordforms ranges from 764 to 772. On the 2-gram level, the difference percentages and the coefficients of variation are also very low.
Another way to compare the folds on the 1-gram level is to compare the frequency ranks of wordforms over all 16 corpus folds. Since each wordform has been assigned a numerical code based on its token frequency rank in the overall corpus, we can correlate the codes of the first n wordforms (ranked by their frequency in each fold) with the respective first n codes in every other fold. This yields a 16-by-16 correlation matrix with Spearman’s correlation coefficients ρ. Ideally, all ρs should be very close to 1, which would indicate a perfect match of frequency ranks over all corpus folds. We calculated correlation matrices with increasing values of n for raw corpus folds without any cleaning and extracted the lowest ρ (ρmin) for each value of n.
Table 3 shows that all values for
ρmin are very close to 1 but are steadily decreasing as
n increases. This is because the token frequencies in higher ranks (= lower frequencies) tend to produce more ties; i.e., wordforms with equal frequencies. In other words, wordform frequencies lose their distinguishing power for higher values of
n. For example, in fold 4, the token frequency for all wordforms with the in-fold token frequency ranks of 995,506 to 1,018,394 (22,889 wordforms) is 28. To demonstrate that a similar pattern can also be observed for other folds: in fold 13, all tokens with the in-fold frequency ranks 994,655 to 1,017,581 (22,927 tokens) share a frequency of 28. This lack of distinguishing power in the lower frequency ranges consequently leads to lower values of
ρmin.
Furthermore, we Spearman-correlated all POS frequency distributions in the 16 folds with all other folds, again yielding a 16-by-16 correlation matrix. Here, all ρs were larger than 0.9999.
Finally, we checked whether the frequency distributions as captured by the Zipf-Mandelbrot power law [
17,
18] yielded comparable parameters over all folds. Using the R package
zipfR [
19], we fitted large number of rare events (LNRE) models to the frequency distributions of (unigram) wordforms for each fold. Indeed, the parameter ranges were very narrow, with the exponent parameter
α ranging from 0.6316296 to 0.6316336 and the second free parameter
β ranging from 0.00043567 to 0.00043572.
Given these comparisons, we concluded that the sampling process produced 16 homogeneous folds. Methodologically, this means that many analyses performed using all 16 parts should produce very similar results when performed using only one fold.
4. Case Study: Vocabulary Growth
First, it must be noted that we do not understand vocabulary growth here in the sense of language acquisition research. Here, we investigate how vocabulary develops as the corpus grows [
20]. Of course, as corpus size increases, we would expect the vocabulary to grow. However, vocabulary growth curves should differ according to the amount and type of cleaning we apply to the frequency lists, because an increasing number of wordforms are being excluded from the dataset. For this case study, we applied various cleaning stages:
no cleaning at all;
exclusion of punctuation, names, and start-end-symbols (all identified via their respective POS tags), URLs, and wordforms only consisting of numbers (both identified by regular expressions);
exclusion of wordforms containing numbers;
exclusion of wordforms that contain upper-case letters following lower-case letters (in this cleaning stage, we exclude wordforms with non-conventional capitalization (e.g., dEr) while, at the same time, keeping capitalized abbreviations (e.g., NATO). For this cleaning stage, we only use the raw version because there cannot be any difference to the lowered version);
exclusion of wordforms where the TreeTagger could not assign a lemma;
selection of wordforms that are themselves (or the associated lemma (note that this means that, for example, the inflected wordform
Weihnachtsmannes is also included, although it is not on the BLL itself, but the associated lemma
Weihnachtsmann is. Another example is the wordform
u., which is shorthand for the lemma
und)) on a basic lemma list (BLL) of New High German standard language to identify a set of conventionalized word forms [
21]. For more information regarding this basic lemma list, please refer to Koplenig et al. [
11].
Cleaning stages A through D are cumulative. For example, cleaning stage D incorporates stages B and C. Stages E and F, however, both rely on stage D, because they can be understood as being equivalent regarding their aim: identifying “true” lemmas and wordforms of German. We chose these cleaning stages because they represent very general selections/exclusions in potential research projects in corpus or computational linguistics. One could also think of these cleaning stages as becoming more rigorous towards the core vocabulary of the language in each step. Of course, the datasets provided can also be used to test other selections adapted to specific research questions (e.g., only selecting certain POS or applying frequency thresholds).
4.1. Number of Wordform Types
We will first examine how the number of wordform types develops when including an increasing number of the 16 corpus folds.
Figure 1 shows that the first four cleaning stages (panels A through D) exhibit roughly the same overall pattern for raw and lowered versions: the vocabulary growth curves do not show clear signs of approaching a ceiling value. This finding replicates several studies for English language corpora that were summarized by [
22], who also “failed to find any flattening of the predicted linear curve, indicating that the pool of possible word types was still far from exhausted”. There is, in other words, “no indication of a stop to the growth”, which is an instantiation of Herdan’s [
23] or Heaps’ [
24] law.
The final two cleaning stages (panels E and F) quickly showed asymptotical behavior, but only in the vocabulary growth curves for the lowered dataset (grey lines). This is especially true for cleaning stage F, where we restricted the corpus to a fixed set of wordforms. For the raw corpus version, many new forms were still observed, even approaching the full corpus.
The same data can also be visualized as percentage increases as increasing folds are included (
Figure 2). There is virtually no difference between the raw vs. lowered versions for the first three cleaning stages (panels A through C), and the number of observed wordforms still increases in the last step (15 to 16 folds), by approx. 4%. This is remarkably close to the figure reported by [
25], who used a corpus of ten introductory psychology textbooks and investigated the growing lexical diversity when adding whole textbooks to their sample one after another. However, this similarity might be coincidental, because Miller and Biber used lemmas (not wordforms) and a much more thematically restricted corpus in their study. Moreover, their overall corpus was much smaller.
The results for cleaning stages E and F are different: the fourth step (4 to 5 folds) in panel E shows a percentage increase of below 1% for the lowered dataset. In panel F, it is already the second step (2 to 3 folds). The lowest percentage increase is observed for the last step of the lowered corpora for cleaning stage F: 0.03% (or, in absolute numbers, 157 newly observed wordform). One of these wordforms (habilitationsprofessur) appears 5 times, and 14 wordforms appeared twice. Note that these numbers and the wordforms themselves depend on which fold is added last to the growing dataset. Here, we simply included folds 1 to 16 subsequently. So, for cleaning stages E and F, we can conclude that the boundless growth in vocabulary is far less pronounced than in the previous cleaning stages. This makes sense given that E and F are the cleaning stages where we tried to identify a basic set of German wordforms.
4.2. Number of Hapax Legomena
Hapax legomena (henceforth: HL), items appearing only once in a frequency list, are often used in calculations of quantitative linguistic measures, for example in analyses concerning productivity [
26]. It is therefore interesting to see how the number of HL is influenced by corpus size (=number of folds), cleaning stage, and lowering of the dataset.
In terms of frequency, we would expect more HL in larger corpora, especially when no or rather light (cleaning stages A through D) cleaning is performed, because we observe more and more non-canonical wordforms as the corpus becomes larger. However, it is hard to hypothesize what the pattern would look like for the last two cleaning stages E and F.
There are no considerable differences between the first four cleaning stages (panels A through D in
Figure 3) and, indeed, none of the curves show any signs of approaching a point where no new HL are being added after a specific corpus size, which would be indicated by the curve approaching a horizontal asymptote. The highest number of HL was observed for the full raw corpus without any cleaning in panel A of
Figure 3 (66,149,313 wordforms, 58.0% of all wordforms).
The last two cleaning stages (panels E and F), which we consider quite “strict”, show a different pattern compared to the first four stages. In these cleaning stages, the trajectory of the curves for raw and lowered datasets differs. For the lowered dataset, the number of HL steadily decreases, which is what we would expect, given datasets where only recognized lemmas (stage E) or elements from a well-defined word list (stage F) are allowed. Consequently, the lowest number of HL was observed for the full corpus in cleaning stage F for a lowered dataset (2205 wordforms), which account for 0.37% of all wordforms at this point.
For the raw version, it is especially noteworthy that the number of HL first decreased and then increased again. In stage E, the lowest point of the curve was reached for 4 folds (239,461 HL, 8.6% of all wordforms) with steadily rising counts until the corpus was complete (304,293 HL, 9.6%). To get an idea of where this effect comes from, we can look at the HL in the complete 16-fold dataset that were not observed in the 4-fold dataset (247,607 wordforms). These 228,156 HL had to be added somewhere between the inclusion of folds 5 to 16. A total of 51.8% of these wordforms (n = 118,076) consisted of upper-case letters only (in stage D, we only excluded wordforms where upper-case letters follow at least one lower-case letter). Another 7.9% (n = 18,111) begin with more than one upper-case letter (e.g., COusinen or HERZstiche), also indicating irregular capitalizations of wordforms that would not be hapax legomena if capitalized in a regular way. Thus, irregular capitalization seemed to play a large role in the increasing number of hapax legomena after the few initial folds for the final cleaning stages of raw corpora.
A total of 35,014 HL (15,3%) had an upper-case letter only in the word-initial position. Most of these wordforms turned out to be sentence-initial or nominalized forms of adjectives (
Flauschigsten,
Abwischbarer) and verbs (
Durchtauchte, Lullten) or compound nouns (
Waldgesundheitsprogramm,
Dampflokomotivkessel). We made this observation by sampling 200 of these wordforms: 128 are adjectives or verbs with a word-initial upper-case letter; 62 are compound nouns (41 with two, 18 with three, and 3 with four constituent parts). In addition, several forms show alternative spellings of umlauts (e.g.,
Gegenstaendlich instead of
Gegenständlich) or ß/ss alternation (e.g.,
Fussten or
Verkehrskompaß) or simply typos (
Unwahrschienlich instead of
Unwahrscheinlich). Since all the effects reported above involve capitalization, they were not observed for lowered corpora. Hence, the diverging patterns for the respective curves in panels E and F in
Figure 3.